Introduction to NGS analysis with R
Mark Dunning
1 June 2015
Historical overview
Probe and target
Two-colour arrays
Single-Channel
Look at a ‘modern’ microarray
Around 48,000 genes per sample, 12 samples on a slide
Use in cancer research
Extensive use in profiling of cancer patients and defining disease subtypes
But is there a better way?
sequencing technologies have been around since the 70s, but traditionally time-consuming and low-throughput
Microarrays vs sequencing
Probe design issues with microarrays
‘Dorian Gray effect’ http://www.biomedcentral.com/1471-2105/5/111
‘…mappings are frozen, as a Dorian Gray-like syndrome: the apparent eternal youth of the mapping does not reflect that somewhere the ’picture of it’ decays’
Sequencing data are ‘future proof’
if a new genome version comes along, just re-align the data!
can grab published-data from public repositories and re-align to your own choice of genome / transcripts and aligner
Limited number of novel findings from microarays
can’t find what you’re not looking for!
Genome coverage
some areas of genome are problematic to design probes for
Maturity of analysis techniques
on the other hand, analysis methods and workflows for microarrays are well-established
until recently…
The cost of sequencing
Reports of the death of microarrays
What are NGS data?
Different terminologies for same thing
N ext G eneration S equencing
H igh-T hroughput S equencing
2nd Generation Sequencing
Massively Parallel Sequencing
Also different library preparation
RNA-seq *
ChIP-seq *
Exome-seq
DNA-seq
Methyl-seq
…..
Paired-end
Multiplexing
Image processing
Sequencing produces high-resolution TIFF images; not unlike microarray data
100 tiles per lane, 8 lanes per flow cell, 100 cycles
4 images (A,G,C,T) per tile per cycle = 320,000 images
Each TIFF image ~ 7Mb = 2,240,000 Mb of data (2.24TB )
Image processing
“Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling.”
You will never have to do this
In fact, the TIFF images are deleted by the instrument
Base-calling
“Uses cluster intensities and noise estimate to output the sequence of bases read from each cluster, along with a confidence level for each base.”
You will never have to do this
Alignment
Locating where each generated sequence came from in the genome
Outside the scope of this course
Usually perfomed automatically by a sequencing service
For most of what follows in the course, we will assume alignment has been performed and we are dealing with aligned data
Post-processing of aligned files
Marking of PCR duplicates
PCR amplification errors can cause some sequences to be over-represented
Chances of any two sequences aligning to the same position are unlikely
Caveat: obviously this depends on amount of the genome you are capturing
Such reads are marked but not usually removed from the data
Most downstream methods will ignore such reads
Typically, picard is used
Sorting
Reads can be sorted according to genomic position
Indexing
Data formats
Raw reads - fastq
The most basic file type you will see is fastq
Data in public-repositories (e.g. Short Read Archive, GEO) tend to be in this format
This represents all sequences created after imaging process
Each sequence is described over 4 lines
No standard file extension. .fq , .fastq , .sequence.txt
Essentially they are text files
Can be manipulated with standard unix tools; e.g. cat , head , grep , more , less
They can be compressed and appear as .fq.gz
Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
~ 250 Million reads (sequences) per Hi-Seq lane
Fastq sequence names
@HWUSI-EAS100R:6:73:941:1973#0/1
The name of the sequencer (HWUSI-EAS100R)
The flow cell lane (6)
Tile number with the lane (73)
x co-ordinate within the tile (941)
y co-ordinate within the tile (1973)
#0 index number for a multiplexed sample
/1; the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
Fastq quality scores
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Quality scores Q = − 10 l o g 10 p
Q = 30, p=0.001
Q = 20, p=0.01
Q = 10, p=0.1
These numeric quanties are encoded as ASCII code
Sometimes an offset is used before encoding
Fastq quality scores
Useful for quality control
Based on these plots we may want to trim our data
Aligned reads - sam
HWI-ST1001:137:C12FPACXX:7:1115:14131:66670 0 chr1 12805 1 42M4I5M *
0 0 TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ
AS:i:-28 XN:i:0 XM:i:2 XO:i:1XG:i:4 NM:i:6 MD:Z:2C41C2 YT:Z:UU NH:i:3
CC:Z:chr15 CP:i:102518319 XS:A:+ HI:i:0
Sam file flags
Aligned reads - bam
Exactly the same information as a sam file
..except that it is binary version of sam
compressed around x4
Attempting to read will print garbage to the screen
bam files can be indexed
Produces an index file with the same name as the bam file, but with .bai extension
samtools view mysequences.bam | head
N.B The sequences can be extracted by various tools to give fastq
Aligned files in IGV
Once our bam files have been indexed we can view them in IGV
This is highly recommended
Why use R and Bioconductor?
Legacy from the good old days
Many of the lessons learnt still applicable
Still need to do quality assessment, normalisation, statistical testing etc
Not to mention Experimental Design (see later)
Respected software / methods
Respected developers
Established community
Packages to access genome annotation
Publication-quality graphs are possible
Reproducibility
Why reproducibility?
Others should be able to repeat our analyses and come to same conclusions
If they can’t, there can be problems…..
Time for Forensic Bioinformatics
VIDEO
Useful data types in R
A data-frame-like structure with ‘genes’ as rows and samples as columns
respects usual conventions of subsetting etc
convenient functions for accessing parts of the object. e.g. exprs
pData
nki
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 24481 features, 337 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: NKI_4 NKI_6 ... NKI_404 (337 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: Contig45645_RC Contig44916_RC ... Contig15167_RC
## (24481 total)
## fvarLabels: probe EntrezGene.ID ... Description (10 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: rosetta
Data visualisation
We can check for biases and outliers using a boxplot
boxplot (expressionMatrix,outline= FALSE )
Analysis methods
With data in this format, we can easily perform statistical testing
geneExpression[1 :10 ]
## NKI_4 NKI_6 NKI_7 NKI_8 NKI_9 NKI_11 NKI_12 NKI_13 NKI_14 NKI_17
## -0.007 0.074 -0.767 -0.820 -0.180 -0.296 NaN -0.163 0.059 -0.035
myGroup[1 :10 ]
## [1] 1 1 0 0 1 1 0 1 1 1
t.test (geneExpression,myGroup)
##
## Welch Two Sample t-test
##
## data: geneExpression and myGroup
## t = -25.65, df = 617.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1125 -0.9543
## sample estimates:
## mean of x mean of y
## -0.2945 0.7389
limma is highly-regarded method for analysing differential expression
It needs a gene expression matrix and experimental design
Clustering example
dist <- dist (t (expressionMatrix))
clust <- hclust (dist)
plot (clust)
Principal Components Analysis
pca <- prcomp (dist)
plot (pca$rotation[,1 ],pca$rotation[,2 ],col= sampleCols,pch= 16 )
Heatmaps
heatmap (as.matrix (expressionMatrix[topDE,]),ColSideColors = sampleCols)
Conclusion
Having data in this tabular format opens-up the possibility of analysis in Bioconductor
Re-using existing methods is a good thing
Heavy-lifting and computationally-expensive tasks are usually performed in another language / tool
As we will see, we can deal with aligned reads and sequences directly
If we have aligned data then we are only a couple of steps away from being able to perform downstream analysis in R
Count number of reads assigned to each gene, exon, whatever intervals you like
Normalise, quality assessment etc
statistical testing
Course overview
Representing and importing data; fastq and bam
Statistical foundations
RNA-seq; differential expression
Genome Annotation
ChIP-seq
Hopefully we will cover enough of the Bioconductor basics that you can try out other packages. See here for full list;
Also
Practical 1: Introduction to Bioconductor
Space, Right Arrow or swipe left to move to next slide, click help below for more details