Introduction to NGS analysis with R

Mark Dunning

1 June 2015

Historical overview

Probe and target

probe-and-target

Two-colour arrays

two-colour

Single-Channel

one-colour

Look at a ‘modern’ microarray

illumina-chips

Use in cancer research

heatmap km

Microarrays vs sequencing

The cost of sequencing

costs

Reports of the death of microarrays

microarray-dead

Reports of the death of microarrays. Greatly exagerated?

http://core-genomics.blogspot.co.uk/2014/08/seqc-kills-microarrays-not-quite.html

hadfield-blog

What are NGS data?

Different terminologies for same thing

Illumina sequencing *

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

* Other sequencing technologies are available

Illumina sequencing

seq1

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq2

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq3

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Paired-end

seq4

Multiplexing

seq5

Image processing

cluster

Image processing

firecrest

Base-calling

bustard

Alignment

Post-processing of aligned files

Data formats

Raw reads - fastq

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

~ 250 Million reads (sequences) per Hi-Seq lane

Fastq sequence names

@HWUSI-EAS100R:6:73:941:1973#0/1

Fastq quality scores

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Fastq quality scores

phred

Useful for quality control

fastqc

Aligned reads - sam

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

Sam format - key columns

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

sam

Sam file flags

sam-flags

Aligned reads - bam

samtools view mysequences.bam | head

samtools flagstat

$ samtools flagstat NA19914.chr22.bam
2109857 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
40096 + 0 duplicates
2064356 + 0 mapped (97.84%:-nan%)
2011540 + 0 paired in sequencing
1005911 + 0 read1
1005629 + 0 read2
1903650 + 0 properly paired (94.64%:-nan%)
1920538 + 0 with itself and mate mapped
45501 + 0 singletons (2.26%:-nan%)
5134 + 0 with mate mapped to a different chr
4794 + 0 with mate mapped to a different chr (mapQ>=5)

Aligned files in IGV

igv

Why use R and Bioconductor?

Legacy from the good old days

Why reproducibility?

Useful data types in R

nki
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 24481 features, 337 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: NKI_4 NKI_6 ... NKI_404 (337 total)
##   varLabels: samplename dataset ... e.os (21 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Contig45645_RC Contig44916_RC ... Contig15167_RC
##     (24481 total)
##   fvarLabels: probe EntrezGene.ID ... Description (10 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: rosetta

Data visualisation

boxplot(expressionMatrix,outline=FALSE)

plot of chunk unnamed-chunk-4

Analysis methods

geneExpression[1:10]
##  NKI_4  NKI_6  NKI_7  NKI_8  NKI_9 NKI_11 NKI_12 NKI_13 NKI_14 NKI_17 
## -0.007  0.074 -0.767 -0.820 -0.180 -0.296    NaN -0.163  0.059 -0.035
myGroup[1:10]
##  [1] 1 1 0 0 1 1 0 1 1 1
t.test(geneExpression,myGroup)
## 
##  Welch Two Sample t-test
## 
## data:  geneExpression and myGroup
## t = -25.65, df = 617.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1125 -0.9543
## sample estimates:
## mean of x mean of y 
##   -0.2945    0.7389

Clustering example

dist <- dist(t(expressionMatrix))
clust <- hclust(dist)
plot(clust)

plot of chunk unnamed-chunk-8

Principal Components Analysis

pca <- prcomp(dist)
plot(pca$rotation[,1],pca$rotation[,2],col=sampleCols,pch=16)

plot of chunk unnamed-chunk-9

Heatmaps

heatmap(as.matrix(expressionMatrix[topDE,]),ColSideColors = sampleCols)

plot of chunk unnamed-chunk-11

Conclusion

Course overview

Practical 1: Introduction to Bioconductor