Introduction to NGS data

Mark Dunning

Last modified: 18 Jul 2015

Why do sequencing?

Microarrays vs sequencing

The cost of sequencing

costs

Reports of the death of microarrays

microarray-dead

Reports of the death of microarrays. Greatly exagerated?

http://core-genomics.blogspot.co.uk/2014/08/seqc-kills-microarrays-not-quite.html

hadfield-blog

What are NGS data?

Different terminologies for same thing

Illumina sequencing *

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

* Other sequencing technologies are available

Illumina sequencing

seq1

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq2

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq3

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Paired-end

seq4

Multiplexing

seq5

Image processing

cluster

Image processing

firecrest

Base-calling

bustard

Alignment

Post-processing of aligned files

Data formats

Raw reads - fastq

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

~ 250 Million reads (sequences) per Hi-Seq lane

Fastq sequence names

@HWUSI-EAS100R:6:73:941:1973#0/1

Fastq quality scores

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Fastq quality scores

phred

Useful for quality control

fastqc

Aligned reads - sam

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

Sam format - key columns

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

sam

Sam file flags

sam-flags

Aligned reads - bam

samtools view mysequences.bam | head

samtools flagstat

$ samtools flagstat NA19914.chr22.bam
2109857 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
40096 + 0 duplicates
2064356 + 0 mapped (97.84%:-nan%)
2011540 + 0 paired in sequencing
1005911 + 0 read1
1005629 + 0 read2
1903650 + 0 properly paired (94.64%:-nan%)
1920538 + 0 with itself and mate mapped
45501 + 0 singletons (2.26%:-nan%)
5134 + 0 with mate mapped to a different chr
4794 + 0 with mate mapped to a different chr (mapQ>=5)

Aligned files in IGV

igv

Other misc. format

Often said that Bioinformaticians love coming up with new file formats

What happens next?