Course etiquette
Please read the course etiquette, if you haven’t read that yet.
Shared document
We are using shared GoogleDocs documents for each of the main topics covered during the summer school. The document for this section can be found here.
Docker
docker exec -ti -u 0 docker /bin/bash
Software requirements (if not using Docker)
If you want to follow this tutorial using your own machine, you need to install the following command line tools:
You can install the tools one by one, but a very convenient way to manage installed tools/packages and their dependencies is Conda. If you are new to Conda, please follow this tutorial.
Sample dataset
A dataset for this tutorial is available here. Please keep in mind that, the directories and folder structure in your machine may differ from the one we used during the course.
It is difficult to overestimate the impact that the development of Next Generation Sequencing (NGS), a.k.a high-throughput sequencing, had on modern biology and medicine. Taking a genome-wide perspective in research greatly increased our understanding of the interconnectivity between physiological or pathological processes and greatly accelerated the development of new treatment strategies.
By definition, NGS involves parallel sequencing of millions of DNA or RNA fragments. It is the “catch-all” term used to describe a number of different modern sequencing technologies. Although there are many variants and applications of NGS, the first few steps of data analysis are the same for the vast majority of sequencing techniques.
During this tutorial you will learn how to:
FastQC
Cutadapt
MultiQC
Dear diary, today I got my data back from sequencing facilities…
Typically, the sequencer output consits of sequenced fragments (reads) and per-base sequencing quality measurments, that are saved as text file in a FASTQ
format. More detailed informations about FASTQ
format can be found here.
We will use a virtual desktop to run this tutorial. Once logged in we need to open a terminal window and navigate to: /home/ubuntu/Course_Materials/Introduction
. Two example samples that we will use for this tutorial are located in the data
folder. Let’s check that:
cd /home/ubuntu/Course_Materials/Introduction/practicals
ls data/*.fastq.gz
The FASTQ files were compressed in the GNU zip format (an open source file compression program), as indicated by the .gz
file extension. This is a standard format that you are likely to receive from sequencing facilities.
Lets have a quick look at the first two reads in the FASTQ
file so we can see how the data are organised. We’ll use a command gunzip
to decompress the FASTQ
file and pipe |
the output directly to another command head -n 8
, that will allow us to see the first 8 lines of the files (a.k.a the first 2 reads):
gunzip -c data/raji_rnaseq_rep1.fastq.gz | head -n 8
## @K00252:56:HFVJNBBXX:4:1101:2544:1648 1:N:0:GCCAAT
## GAAAGATTTCAGTTGAGGAACGGGAACAAAGATTATGATAGCTTTCCGAC
## +
## <<AFFJFJJJJJ<JJJ7<F-F7<FJF<FFJJJJJJ-F-A-FFFFJJ7<--
## @K00252:56:HFVJNBBXX:4:1101:2605:1648 1:N:0:GCCAAT
## TGAGGAATGGCTTATTTTCTGATGACCACATGTGGGACTATTTCAACCGC
## +
## AAF<<FJJJFFFJJJJJJJJJJJJJJFJJJ-F-FFJJJFFJJ-F-JJ7A<
Line 4 contains base call quality scores encoded using ASCII characters, see here. It would be impossible to look at the quality of all reads manually, so we will use a command fastqc
to generate quality report of our .fastq
files. FastQC1 FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2015) is a very popular tool that provides basic quality control metrics for raw NGS data.
NOTE: This is a standard first step in NGS data analysis that should never be skipped!
# Quick look at the documentation of fastqc command
fastqc --help
We’ll save the quality report in a separate folder ‘fastqc’, so let’s create a new directory:
mkdir fastqc
Now run fastqc
with -o fastqc
option to save the output in the fastqc
folder.
fastqc data/raji_rnaseq_rep1.fastq.gz -o fastqc
The output from FastQC is an html file that may be viewed in your browser.
- Navigate to
/home/ubuntu/Course_Materials/Introduction/practicals
.- Check if the
fastqc
folder is present.- Run
fastqc
command fortp53_chipseq_rep1.fastq.gz
and save the output in thefastqc
folder.
- Open the FastQC report in your browser.
- Compare the results of the two files.
In order to increase our chances to align sequenced reads accurately, only a small number of mismatched bases is allowed. If a read contains too many mismatches, it is marked as unaligned. If we want to accurately align as many reads as possible, we may remove unwanted/noisy information from our data, eg:
NOTE: Some aligners, like STAR, are able to account for low-quality bases at the ends of reads.
NOTE: Aggressive quality-based trimming of RNA-Seq reads was found to affect gene expression estimates. Imposing minimum read length requirements reverts gene expression estimates to values closer to estimates produced from untrimmed reads. Another potential improvement may be to use longer sequencing reads, such as 100 or 150 bases2 Williams, C.R., Baccarella, A., Parrish, J.Z. et al. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics 17, 103 (2016).
NOTE: If the source of contamination cannot be identified, FastQScreen, aligns the reads against a standard set of libraries. It was built as a QC check for sequencing pipelines but may also be useful in characterising metagenomic samples.
Once you have had a closer look at the quality report you can see that the data quality is not too bad, however we still might be able to improve the quality with some quality based trimming, since the quality drops towards the end of the reads.
We will use Cutadapt for trimming, so let’s have a look at its help page:
cutadapt --help
As you can see Cutadapt has many options for:
In our case all we want to do is to remove low quality bases from our reads. We can use the following command to do this:
cutadapt -m 10 -q 20 -j 4 -o data/raji_rnaseq_rep1_trimmed.fastq.gz data/raji_rnaseq_rep1.fastq.gz
Let’s go through the trimming parameters we are using in the command above:
-m 10
: will discard all reads that would be shorter than a read length of 10 after the trimming,
-q 20
: will trim low-quality bases from the 3’ end of the reads; if two comma-separated cutoffs are given, the 5’ end is trimmed with the first cutoff, the 3’ end with the second.
-j 4
: number of cores
-o data/raji_rnaseq_rep1_trimmed.fastq.gz
: will specify name of the output file
Now we can check how the trimming affected quality of our reads:
fastqc data/raji_rnaseq_rep1_trimmed.fastq.gz -o fastqc
- Run
cutadapt
fortp53_chipseq_rep1.fastq.gz
setting the trimming quality to20
for both read ends and discarding all reads shorter than10
. Save the output astp53_chipseq_rep1_trimmed.fastq.gz
- Run
fastqc
for the trimmed.fastq.gz
files. Compare the files before and after trimming
Looking at the FastQC reports one by one may be time-consuming and tiresome. MultiQC searches a given folder for analysis logs and compiles a single HTML report. Full documentation can be viewed here. By now the folder fastqc
should contain two FastQC reports. We will run multiqc
to generate the summarised report. We’ll save the output in a separate multiqc
folder.
mkdir multiqc
multiqc -s -o multiqc fastqc
NOTE: We are using a parameter -s
to use unique naming for all analysed samples.
- Run
multiqc
to generate a summarised report for allfastqc
runs generated so far.
- Compare the quality reports before and after trimming.
Video explaining Illumina sequenicng by synthesis
Detailed overview of NGS techniques:
Goodwin S. et al. Nature Reviews Genetics 17, 333–351 (2016)
FastQC has a solid documented manual page with more details about all the plots in the report. Looking at this post for more information on what bad plots look like and what they mean for your data.
Dora Bihary
VIB Center for Cancer Biology, University of Leuven, BE
MRC Cancer Unit, University of Cambridge, UK
Joanna A. Krupka MRC Cancer Unit, University of Cambridge, UK
Shoko Hirosue MRC Cancer Unit, University of Cambridge, UK