1 Before we start

Course etiquette
Please read the course etiquette, if you haven’t read that yet.

Shared document
We are using shared GoogleDocs documents for each of the main topics covered during the summer school. The document for this section can be found here.

Docker

docker exec -ti -u 0 docker /bin/bash

Software requirements (if not using Docker)
If you want to follow this tutorial using your own machine, you need to install the following command line tools:

You can install the tools one by one, but a very convenient way to manage installed tools/packages and their dependencies is Conda. If you are new to Conda, please follow this tutorial.

Sample dataset
A dataset for this tutorial is available here. Please keep in mind that, the directories and folder structure in your machine may differ from the one we used during the course.

2 Introduction

It is difficult to overestimate the impact that the development of Next Generation Sequencing (NGS), a.k.a high-throughput sequencing, had on modern biology and medicine. Taking a genome-wide perspective in research greatly increased our understanding of the interconnectivity between physiological or pathological processes and greatly accelerated the development of new treatment strategies.

By definition, NGS involves parallel sequencing of millions of DNA or RNA fragments. It is the “catch-all” term used to describe a number of different modern sequencing technologies. Although there are many variants and applications of NGS, the first few steps of data analysis are the same for the vast majority of sequencing techniques.

2.1 Learning objectives

During this tutorial you will learn how to:

perform quality checks of raw sequencing data using FastQC
perform quality trimming of raw sequencing data using Cutadapt
combine many FastQC reports into one using MultiQC

3 Quality check of raw FASTQ files

Dear diary, today I got my data back from sequencing facilities…

Typically, the sequencer output consits of sequenced fragments (reads) and per-base sequencing quality measurments, that are saved as text file in a FASTQ format. More detailed informations about FASTQ format can be found here.

3.1 FastQC

We will use a virtual desktop to run this tutorial. Once logged in we need to open a terminal window and navigate to: /home/ubuntu/Course_Materials/Introduction. Two example samples that we will use for this tutorial are located in the data folder. Let’s check that:

cd /home/ubuntu/Course_Materials/Introduction/practicals
ls data/*.fastq.gz

The FASTQ files were compressed in the GNU zip format (an open source file compression program), as indicated by the .gz file extension. This is a standard format that you are likely to receive from sequencing facilities.

Lets have a quick look at the first two reads in the FASTQ file so we can see how the data are organised. We’ll use a command gunzip to decompress the FASTQ file and pipe | the output directly to another command head -n 8, that will allow us to see the first 8 lines of the files (a.k.a the first 2 reads):

gunzip -c data/raji_rnaseq_rep1.fastq.gz | head -n 8

## @K00252:56:HFVJNBBXX:4:1101:2544:1648 1:N:0:GCCAAT
## GAAAGATTTCAGTTGAGGAACGGGAACAAAGATTATGATAGCTTTCCGAC
## +
## <<AFFJFJJJJJ<JJJ7<F-F7<FJF<FFJJJJJJ-F-A-FFFFJJ7<--
## @K00252:56:HFVJNBBXX:4:1101:2605:1648 1:N:0:GCCAAT
## TGAGGAATGGCTTATTTTCTGATGACCACATGTGGGACTATTTCAACCGC
## +
## AAF<<FJJJFFFJJJJJJJJJJJJJJFJJJ-F-FFJJJFFJJ-F-JJ7A<

Line 4 contains base call quality scores encoded using ASCII characters, see here. It would be impossible to look at the quality of all reads manually, so we will use a command fastqc to generate quality report of our .fastq files. FastQC111 FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2015) is a very popular tool that provides basic quality control metrics for raw NGS data.

NOTE: This is a standard first step in NGS data analysis that should never be skipped!

# Quick look at the documentation of fastqc command
fastqc --help

We’ll save the quality report in a separate folder ‘fastqc’, so let’s create a new directory:

mkdir fastqc

Now run fastqc with -o fastqc option to save the output in the fastqc folder.

fastqc data/raji_rnaseq_rep1.fastq.gz -o fastqc

The output from FastQC is an html file that may be viewed in your browser.

3.2 Exercise 1

Navigate to /home/ubuntu/Course_Materials/Introduction/practicals.

Check if the fastqc folder is present.

Run fastqc command for tp53_chipseq_rep1.fastq.gz and save the output in the fastqc folder.

Open the FastQC report in your browser.

Compare the results of the two files.

4 Trimming and artefact removal (optional)

In order to increase our chances to align sequenced reads accurately, only a small number of mismatched bases is allowed. If a read contains too many mismatches, it is marked as unaligned. If we want to accurately align as many reads as possible, we may remove unwanted/noisy information from our data, eg:

Poor quality bases at read ends (either 3’-end or 5’-end)
Leftover adapter sequences
Known contaminants (strings of As/Ts, other sequences, primers etc.)

NOTE: Some aligners, like STAR, are able to account for low-quality bases at the ends of reads.

NOTE: Aggressive quality-based trimming of RNA-Seq reads was found to affect gene expression estimates. Imposing minimum read length requirements reverts gene expression estimates to values closer to estimates produced from untrimmed reads. Another potential improvement may be to use longer sequencing reads, such as 100 or 150 bases222 Williams, C.R., Baccarella, A., Parrish, J.Z. et al. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics 17, 103 (2016).

NOTE: If the source of contamination cannot be identified, FastQScreen, aligns the reads against a standard set of libraries. It was built as a QC check for sequencing pipelines but may also be useful in characterising metagenomic samples.

4.1 Quality based trimming using Cutadapt

Once you have had a closer look at the quality report you can see that the data quality is not too bad, however we still might be able to improve the quality with some quality based trimming, since the quality drops towards the end of the reads.

We will use Cutadapt for trimming, so let’s have a look at its help page:

cutadapt --help

As you can see Cutadapt has many options for:

Trimming based on quality threshold
Trimming some bases from the 5’ or 3’ ends of reads, removing adapter contamination

In our case all we want to do is to remove low quality bases from our reads. We can use the following command to do this:

cutadapt -m 10 -q 20 -j 4 -o data/raji_rnaseq_rep1_trimmed.fastq.gz data/raji_rnaseq_rep1.fastq.gz

Let’s go through the trimming parameters we are using in the command above:

-m 10: will discard all reads that would be shorter than a read length of 10 after the trimming,
-q 20: will trim low-quality bases from the 3’ end of the reads; if two comma-separated cutoffs are given, the 5’ end is trimmed with the first cutoff, the 3’ end with the second.
-j 4: number of cores
-o data/raji_rnaseq_rep1_trimmed.fastq.gz: will specify name of the output file

Now we can check how the trimming affected quality of our reads:

fastqc data/raji_rnaseq_rep1_trimmed.fastq.gz -o fastqc

4.2 Exercise 1

Run cutadapt for tp53_chipseq_rep1.fastq.gz setting the trimming quality to 20 for both read ends and discarding all reads shorter than 10. Save the output as tp53_chipseq_rep1_trimmed.fastq.gz

Run fastqc for the trimmed .fastq.gz files. Compare the files before and after trimming

4.3 MultiQC for aggregating results across many samples

Looking at the FastQC reports one by one may be time-consuming and tiresome. MultiQC searches a given folder for analysis logs and compiles a single HTML report. Full documentation can be viewed here. By now the folder fastqc should contain two FastQC reports. We will run multiqc to generate the summarised report. We’ll save the output in a separate multiqc folder.

mkdir multiqc
multiqc -s -o multiqc fastqc

NOTE: We are using a parameter -s to use unique naming for all analysed samples.

4.4 Exercise 2

Run multiqc to generate a summarised report for all fastqc runs generated so far.

Compare the quality reports before and after trimming.

4.5 Further reading

Video explaining Illumina sequenicng by synthesis

Detailed overview of NGS techniques:
Goodwin S. et al. Nature Reviews Genetics 17, 333–351 (2016)

FastQC has a solid documented manual page with more details about all the plots in the report. Looking at this post for more information on what bad plots look like and what they mean for your data.

5 Acknowledgements

Dora Bihary
VIB Center for Cancer Biology, University of Leuven, BE
MRC Cancer Unit, University of Cambridge, UK

Joanna A. Krupka MRC Cancer Unit, University of Cambridge, UK

Shoko Hirosue MRC Cancer Unit, University of Cambridge, UK

Harvard Chan Bioinformatics Core

Practical 1: Preprocessing of Next Generation Sequencing data

July 2021