Microarrays vs sequencing

Probe design issues with microarrays
- ‘Dorian Gray effect’ http://www.biomedcentral.com/1471-2105/5/111
- ‘…mappings are frozen, as a Dorian Gray-like syndrome: the apparent eternal youth of the mapping does not reflect that somewhere the ’picture of it’ decays’
Sequencing data are ‘future proof’
- if a new genome version comes along, just re-align the data!
- can grab published-data from public repositories and re-align to your own choice of genome / transcripts and aligner
Limited number of novel findings from microarays
- can’t find what you’re not looking for!
Genome coverage
- some areas of genome are problematic to design probes for
Maturity of analysis techniques
- on the other hand, analysis methods and workflows for microarrays are well-established
- until recently…

The cost of sequencing

costs

Reports of the death of microarrays

microarray-dead

Reports of the death of microarrays. Greatly exagerated?

http://core-genomics.blogspot.co.uk/2014/08/seqc-kills-microarrays-not-quite.html

hadfield-blog

Different terminologies for same thing

Next Generation Sequencing
High-Throughput Sequencing
2nd Generation Sequencing
Massively Parallel Sequencing
Also different library preparation
- RNA-seq *
- ChIP-seq *
- Exome-seq
- DNA-seq
- Methyl-seq
- …..

Illumina sequencing *

Employs a ‘sequencing-by-synthesis’ approach

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

* Other sequencing technologies are available

Illumina sequencing

seq1

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq2

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Illumina sequencing

seq3

http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf

Image processing

Sequencing produces high-resolution TIFF images; not unlike microarray data
100 tiles per lane, 8 lanes per flow cell, 100 cycles
4 images (A,G,C,T) per tile per cycle = 320,000 images
Each TIFF image ~ 7Mb = 2,240,000 Mb of data (2.24TB)

cluster

Image processing

Firecrest

firecrest

“Uses the raw TIF files to locate clusters on the image, and outputs the cluster intensity, X,Y positions, and an estimate of the noise for each cluster. The output from image analysis provides the input for base calling.”
- http://openwetware.org/wiki/BioMicroCenter:IlluminaDataPipeline
You will never have to do this
- In fact, the TIFF images are deleted by the instrument

Base-calling

Bustard

bustard

“Uses cluster intensities and noise estimate to output the sequence of bases read from each cluster, along with a confidence level for each base.”
- http://openwetware.org/wiki/BioMicroCenter:IlluminaDataPipeline
You will never have to do this

Alignment

Locating where each generated sequence came from in the genome
Outside the scope of this course
Usually perfomed automatically by a sequencing service
For most of what follows in the course, we will assume alignment has been performed and we are dealing with aligned data
- Popular aligners
- bwa http://bio-bwa.sourceforge.net/
- bowtie http://bowtie-bio.sourceforge.net/index.shtml
- novoalign http://www.novocraft.com/products/novoalign/
- stampy http://www.well.ox.ac.uk/project-stampy
- many, many more…..

Post-processing of aligned files

Marking of PCR duplicates
- PCR amplification errors can cause some sequences to be over-represented
- Chances of any two sequences aligning to the same position are unlikely
- Caveat: obviously this depends on amount of the genome you are capturing
- Such reads are marked but not usually removed from the data
- Most downstream methods will ignore such reads
- Typically, picard is used
Sorting
- Reads can be sorted according to genomic position
Indexing
- Allow efficient access

Raw reads - fastq

The most basic file type you will see is fastq
- Data in public-repositories (e.g. Short Read Archive, GEO) tend to be in this format
This represents all sequences created after imaging process
Each sequence is described over 4 lines
No standard file extension. .fq, .fastq, .sequence.txt
Essentially they are text files
- Can be manipulated with standard unix tools; e.g. cat, head, grep, more, less
They can be compressed and appear as .fq.gz
Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

~ 250 Million reads (sequences) per Hi-Seq lane

Fastq sequence names

@HWUSI-EAS100R:6:73:941:1973#0/1

The name of the sequencer (HWUSI-EAS100R)
The flow cell lane (6)
Tile number with the lane (73)
x co-ordinate within the tile (941)
y co-ordinate within the tile (1973)
#0 index number for a multiplexed sample
/1; the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Fastq quality scores

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Quality scores \[ Q = -10log_{10}p\]
- Q = 30, p=0.001
- Q = 20, p=0.01
- Q = 10, p=0.1
These numeric quanties are encoded as ASCII code
- Sometimes an offset is used before encoding

Fastq quality scores

phred

Useful for quality control

FastQC, from Babraham Bioinformatics Core; http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

fastqc

Based on these plots we may want to trim our data
- A popular choice is trimmomatic http://www.usadellab.org/cms/index.php?page=trimmomatic

Aligned reads - sam

Sequence Alignment Matrix (sam) http://samtools.github.io/hts-specs/SAMv1.pdf

Header lines followed by tab-delimited lines

Header gives information about the alignment and references sequences used

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

http://homer.salk.edu/homer/basicTutorial/samfiles.html
Large size on disk; ~100s of Gb
- Can be manipulated with standard unix tools; e.g. cat, head, grep, more, less

Sam format - key columns

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *
0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     
AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  
CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

sam

http://samtools.github.io/hts-specs/SAMv1.pdf
- Read name
- Chromosome
- Position
- Mapping quality
- etc…

Sam file flags

Represent useful QC information
- Read is unmapped
- Read is paired / unpaired
- Read failed QC
- Read is a PCR duplicate
https://broadinstitute.github.io/picard/explain-flags.html

sam-flags

Aligned reads - bam

Exactly the same information as a sam file
..except that it is binary version of sam
compressed around x4
Attempting to read will print garbage to the screen
bam files can be indexed
- Produces an index file with the same name as the bam file, but with .bai extension

samtools view mysequences.bam | head

N.B The sequences can be extracted by various tools to give fastq

samtools flagstat

Useful command-line tool as part of samtools

$ samtools flagstat NA19914.chr22.bam
2109857 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
40096 + 0 duplicates
2064356 + 0 mapped (97.84%:-nan%)
2011540 + 0 paired in sequencing
1005911 + 0 read1
1005629 + 0 read2
1903650 + 0 properly paired (94.64%:-nan%)
1920538 + 0 with itself and mate mapped
45501 + 0 singletons (2.26%:-nan%)
5134 + 0 with mate mapped to a different chr
4794 + 0 with mate mapped to a different chr (mapQ>=5)

Aligned files in IGV

Once our bam files have been indexed we can view them in IGV
This is highly recommended

igv

Why use R and Bioconductor?

Legacy from the good old days

Many of the lessons learnt still applicable
- Still need to do quality assessment, normalisation, statistical testing etc
- Not to mention Experimental Design (see later)
Respected software / methods
Respected developers
Established community
Packages to access genome annotation
Publication-quality graphs are possible
Reproducibility

Why reproducibility?

Others should be able to repeat our analyses and come to same conclusions
If they can’t, there can be problems…..
Time for Forensic Bioinformatics

Useful data types in R

A data-frame-like structure with ‘genes’ as rows and samples as columns
- respects usual conventions of subsetting etc
- convenient functions for accessing parts of the object. e.g. exprs pData

nki

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 24481 features, 337 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: NKI_4 NKI_6 ... NKI_404 (337 total)
##   varLabels: samplename dataset ... e.os (21 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Contig45645_RC Contig44916_RC ... Contig15167_RC
##     (24481 total)
##   fvarLabels: probe EntrezGene.ID ... Description (10 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: rosetta

Data visualisation

We can check for biases and outliers using a boxplot

boxplot(expressionMatrix,outline=FALSE)

plot of chunk unnamed-chunk-4

Analysis methods

With data in this format, we can easily perform statistical testing

geneExpression[1:10]

##  NKI_4  NKI_6  NKI_7  NKI_8  NKI_9 NKI_11 NKI_12 NKI_13 NKI_14 NKI_17 
## -0.007  0.074 -0.767 -0.820 -0.180 -0.296    NaN -0.163  0.059 -0.035

myGroup[1:10]

##  [1] 1 1 0 0 1 1 0 1 1 1

t.test(geneExpression,myGroup)

## 
##  Welch Two Sample t-test
## 
## data:  geneExpression and myGroup
## t = -25.65, df = 617.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1125 -0.9543
## sample estimates:
## mean of x mean of y 
##   -0.2945    0.7389

limma is highly-regarded method for analysing differential expression
- It needs a gene expression matrix and experimental design

Clustering example

dist <- dist(t(expressionMatrix))
clust <- hclust(dist)
plot(clust)

plot of chunk unnamed-chunk-8

Principal Components Analysis

pca <- prcomp(dist)
plot(pca$rotation[,1],pca$rotation[,2],col=sampleCols,pch=16)

plot of chunk unnamed-chunk-9

Heatmaps

heatmap(as.matrix(expressionMatrix[topDE,]),ColSideColors = sampleCols)

plot of chunk unnamed-chunk-11

Conclusion

Having data in this tabular format opens-up the possibility of analysis in Bioconductor
Re-using existing methods is a good thing
Heavy-lifting and computationally-expensive tasks are usually performed in another language / tool
As we will see, we can deal with aligned reads and sequences directly
If we have aligned data then we are only a couple of steps away from being able to perform downstream analysis in R
- Count number of reads assigned to each gene, exon, whatever intervals you like
- Normalise, quality assessment etc
- statistical testing

Course overview

Representing and importing data; fastq and bam
Statistical foundations
RNA-seq; differential expression
Genome Annotation
ChIP-seq
Hopefully we will cover enough of the Bioconductor basics that you can try out other packages. See here for full list;
- http://bioconductor.org/packages/release/BiocViews.html#___Sequencing
- http://bioconductor.org/developers/new_packages/
  - Don’t forget that each package has a vignette
Also
- http://seqanswers.com/forums/index.php
- https://www.biostars.org/

Practical 1: Introduction to Bioconductor

Introduction to NGS analysis with R

Historical overview

Probe and target

Two-colour arrays

Single-Channel

Look at a ‘modern’ microarray

Use in cancer research

Microarrays vs sequencing

The cost of sequencing

Reports of the death of microarrays

Reports of the death of microarrays. Greatly exagerated?

What are NGS data?

Different terminologies for same thing

Illumina sequencing *

Illumina sequencing

Illumina sequencing

Illumina sequencing

Paired-end

Multiplexing

Image processing

Image processing

Base-calling

Alignment

Post-processing of aligned files

Data formats

Raw reads - fastq

Fastq sequence names

Fastq quality scores

Fastq quality scores

Useful for quality control

Aligned reads - sam

Sam format - key columns

Sam file flags

Aligned reads - bam

samtools flagstat

Aligned files in IGV

Why use R and Bioconductor?

Legacy from the good old days

Why reproducibility?

Useful data types in R

Data visualisation

Analysis methods

Clustering example

Principal Components Analysis

Heatmaps

Conclusion

Course overview