This video gives an overview of the āsequencing-by-synthesisā approach used by Illumina. Other companies will have different techniques, but Illumina is probably the most-popular sequencing technology out there. For most of what we will discuss, it wonāt really matter how your samples were sequenced.
* Other sequencing technologies are available
Images from:- http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
A library is prepared by breaking the DNA of interest into shorter, single-strand fragments. Adapters are each added to each end of the fragments. A flow-cell has already been prepared to have a ālawnā of sequences that are complementary to the adapters.
Bridge amplification forms clusters of identical fragments on the flow-cell surface. This is required as weāre going to be taking images of the flow-cell and need many copies of each fragment so that we get a decent signal. However, as weāll see later, this can introduce errors.
The construction of a sequencing reads involves adding a single, terminated, DNA base (each given a distinct flourescent label) one at a time, simulataneously across the whole flow-cell, and taking an image.
So, we try adding an āAā with a red label and take an image
and then a āTā with a green, and take another image
Analysis of the images can determine which base was succesfully incorporated to each fragment.
The process of added each base repeats for the next cycle, and so-on for āNā cycles. e.g.Ā 100 times for a 100-base sequence.
Although this process is high-optimised and refined, the sheer number of reactions being performed means that errors are inevitable. The signal from a particular cluster could be affected by interference from its neighbours
Or sometimes we get a bit over-excited and add too many bases
Therefore the identification of bases comes with some degree of uncertainty which we must capture. This uncertainty can be incorporated into our analysis, as we will see later.
.TIFF
images; not unlike microarray data.TIFF
image ~ 7Mb = 2,240,000 Mb of data (2.24TB)We donāt need any special software to view these, but bear in mind there can be ~ 250 Million reads (sequences) per Hi-Seq lane. So using Word or Notepad is probably not a great idea.
An example .fastq
file has been provided for you in the folder /data/test
. If you are curious how this fastq file was generated, you can see the Appendix for details.
We can launch the CRUK Docker from the Desktop and navigate to the directory containing the files using the cd
command
cd /data/test
If we wanted to check the directory we are currently looking at, we can use the pwd
command.
pwd
The ls
command will list all the files in this directory
ls
Assuming the files are present, we can print the first few lines using the standard unix command head
-n 12
to print the first 12 lines of the file sample.fq1
head -n 12 sample.fq1
The unix command wc
can count the number of lines in a file with the option -l
wc -l sample.fq1
What would you type to print the first 12 lines of sample.fq2
?
The name of a sequence is unique and can encode some useful information. e.g.
@HWUSI-EAS100R:6:73:941:1973#0/1
However, this depends on instrument setup and processing pipelines. Sometimes the tile and coordinate information is omitted to save space.
As we saw earlier the process of deciding which base is present at each cycle of each fragment comes with some probability (p
) that we make a mistake. The quality score expresses our confidence in a particular base-call; higher quality score, higher confidence
The raw base-calling probabilities are converted to text characters to make it easier to store in a file
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############
First of all, we convert the base-calling probability (p) into a Q
score using the formula
Annoyingly, different sequencing instruments have used different offsets over time. Itās important to check what encoding has been used for your data
Phred+33
This handy graphic from wikipedia compares the different schemes
Given a particular quality string, we have to look-up the ASCII code for each character and subtract the offset to get the Q score. We can then convert to a probability using the formula:-
\[ p = 10^{-Q/10} \]
So for our particular example:
N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############
it works out as follows:-
Character Code Minus.Offset..33.. Probability
1 N 78 45 0.00003
2 ? 63 30 0.00100
3 > 62 29 0.00126
4 : 58 25 0.00316
5 < 60 27 0.00200
6 9 57 24 0.00398
7 > 62 29 0.00126
8 > 62 29 0.00126
9 > 62 29 0.00126
10 : 58 25 0.00316
...
...
Character Code Minus.Offset..33.. Probability
58 # 35 2 0.63096
59 # 35 2 0.63096
60 # 35 2 0.63096
61 # 35 2 0.63096
62 # 35 2 0.63096
63 # 35 2 0.63096
64 # 35 2 0.63096
65 # 35 2 0.63096
66 # 35 2 0.63096
67 # 35 2 0.63096
68 # 35 2 0.63096
For the extremely-keen we can do this in R
:-
pq <- "N?>:<9>>>:=;>>?<>:@?>;==@@@>?=AAA<>=A@?6>4B=<>>.@>?<@;?#############"
code <- as.integer(charToRaw(as.character(pq)))
qs <- code -33
qs
## [1] 45 30 29 25 27 24 29 29 29 25 28 26 29 29 30 27 29 25 31 30 29 26 28
## [24] 28 31 31 31 29 30 28 32 32 32 27 29 28 32 31 30 21 29 19 33 28 27 29
## [47] 29 13 31 29 30 27 31 26 30 2 2 2 2 2 2 2 2 2 2 2 2 2
probs <- 10^(unlist(qs)/-10)
round(probs,5)
## [1] 0.00003 0.00100 0.00126 0.00316 0.00200 0.00398 0.00126 0.00126
## [9] 0.00126 0.00316 0.00158 0.00251 0.00126 0.00126 0.00100 0.00200
## [17] 0.00126 0.00316 0.00079 0.00100 0.00126 0.00251 0.00158 0.00158
## [25] 0.00079 0.00079 0.00079 0.00126 0.00100 0.00158 0.00063 0.00063
## [33] 0.00063 0.00200 0.00126 0.00158 0.00063 0.00079 0.00100 0.00794
## [41] 0.00126 0.01259 0.00050 0.00158 0.00200 0.00126 0.00126 0.05012
## [49] 0.00079 0.00126 0.00100 0.00200 0.00079 0.00251 0.00100 0.63096
## [57] 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096 0.63096
## [65] 0.63096 0.63096 0.63096 0.63096
It is possible to interrogate the fastq
files in R. However, in practice we tend to use other tools such as fastqc described in the next section.
library(ShortRead)
fq <- readFastq("/data/test/sample.fq1")
fq
(Acknowledgement to Ines De Santiago for her session at the previous summer school)
FastQC from the Babraham Institute Bioinformatics Core has emerged as the standard tool for performing quality assessment on sequencing reads
The manual for fastqc
is available online and is very comprehensive; especially the parts which describe particular sections of the report. The authors also run a āQCfailā blog which discusses some sequencing QC errors they have encountered and how they were diagnosed.
A ātraffic lightā system is used to draw your attention to sections of the report that require further investigation. However, it is worth bearing in mind that fastqc
is designed to be run on fastq files from any type of sequencing experiment and has no knowledge of the particular library preparation, or conditions that you are studying. It could be that you expect high levels of duplication or GC content. Always consider the nature of your study before taking any drastic action!
Also, fastqc
will not actually do anything to your data. If you decide to trim or remove contamination for your samples, you will need to use another tool.
fastqc
reportSome simple statistics about the composition of your file, which can be useful to see if it has guessed the encoding correctly and identified the correct number of reads. This section of the report is designed never to give a warning message
This section of the report is probably the one that receives most attention. Itās generally accepted that there is a degradation of quality over the duration of a sequencing run, but the extent to which the quality ādrops-offā should be monitored. A boxplot is produced for every base-position in the read and the central line and yellow box represent the median and inter-quartile range in the usual manner.
Ideally, the plot should look something like following:-
However, a warning will be triggered if the lower quartile (25% of the data) of any base in less than 10, or if the median for any base is less than 25. A failure (red cross in the traffic light system) occurs if the lower quartile for any base is less than 5, or if the median for any base is less than 20.
With this section of the report, we are checking to see if there is a population of sequences that have low quality values. A warning occurs when the mean quality is below 27, whereas a failure indicates a mean below 20.
fastqc
fastqc
can be run from the CRUK Docker as follows;
fastqc /data/test/sample.fq1
As a result, you should get two files; sample.fq1_fastqc.zip
and sample.fq1_fastqc.html
. The .zip
file contains all the metrics that fastqc
computes, should you wish to perform extra manipulation and visualisation beyond what fastqc
offers.
fastqc
stored?
fastqc
/home/participant/Course_Materials/Day1/
?fastqc -h
For your reference, here is how the example files were created
First, we download an example bam file from the 1000 genomes project
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/other_exome_alignments/NA06984/exome_alignment/NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam
Then the file is downsampled to give a much reduced set of reads. The original file can be removed
java -jar $PICARD DownsampleSam I=NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam O=random.bam P=0.1 VALIDATION_STRINGENCY=SILENT
rm NA06984.mapped.illumina.mosaik.CEU.exome.20111114.bam
For convenience, we filter the file to keep only reads that are properly paired. Then picard is used again to extract the fastq data from the file. As the original alignments were a mix of 68 and 76 bases, we trim all the reads to 68 bases to processing easier.
samtools view -f 0x02 -b random.bam > paired.bam
java -jar $PICARD SamToFastq I=paired.bam F=sample.fq1 VALIDATION_STRINGENCY=SILENT F2=sample.fq2 R1_MAX_BASES=68 R2_MAX_BASES=68
If you are interested this section covers how to trim our data
Based on these plots we may want to trim our data; fastqc will not do this for us.
Popular choices are fastx_toolit, trimmomatic and cutadapt; all of which should be installed on your computers. As implied by the name, cutadapt
is useful for removing adaptor sequences.
fastx_toolkit
allows us to remove reads that do not have a high enough proportion of high-quality reads.
You can run these commands after having run the CRUK Docker shortcut
## Check we're in the correct directory
cd /home/participant/Course_Materials/Day1
fastq_quality_filter -v Q 33 -q 20 -p 75 -i /data/test/sample.fq1 -o sample_filtered.fq
the options were used:-
-i: input file
-o: output file
-v: report the number of sequences
-Q: 33, determines the input quality ASCII offset
-q: 20, the quality value required
-p 75: the percentage of bases required to have that quality value
the output should be something likeā¦
Quality cut-off: 20
Minimum percentage: 75
Input: 6290535 reads.
Output: 5399767 reads.
discarded 890768 (14%) low-quality reads.
It can also do straightforward trimming to a particular read length
fastx_trimmer -v -f 1 -l 60 -i /data/test/sample.fq1 -o sample.fastx.trimmed.fq1
the options specified were:-
-f: first base to keep
-l: last base to keep
-i: input file
-o: output file
-v: report number of sequences
Trimmomatic is interesting as it allows for differently-sized reads.
To use trimmomatic
we can run the following in a Terminal. Here $TRIMMOMATIC
is used to refer to the particular location that the tool has been installed to.
java -jar $TRIMMOMATIC SE -phred33 /data/test/sample.fq1 sample.trimmed.fq1 TRAILING:3 MINLEN:30
SE: run single-end mode
-phred33: use the offset of 33 to interpret quality scores
TRAILING:3 cut bases from the end of the read if below 3
We can interrogate the files we have just created using the ShortRead
package.
How many were removed?
length(trimmed.fq)/length(fq)
summary(width(sread(trimmed.fq)))
hist(width(sread(trimmed.fq)))
We can verify why some of the reads were removed
old.ids <- as.character(id(fq))
ids.left <- as.character(id(trimmed.fq))
missing.ids <- setdiff(old.ids,ids.left)
myquals <- quality(fq)
myquals[old.ids %in% missing.ids]
fastqc
and observe what effect this has on the resultsfastqc sample.trimmed.fq1
fastqc sample.fastx.trimmed.fq1
fastqc sample.filtered.fq1