July 2020
Counting estimates the relative counts for each gene
Does this accurately represent the original population of RNAs?
The relationship between counts and RNA expression is not the same for all genes across all samples
Library Size
Differing sequencing depth
Gene properties
Length, GC content, sequence
Library composition
Highly expressed genes overrepresented at the cost of lowly expressed genes
"Composition Bias"
Library Size
Normalise each sample by total number of reads sequenced.
Can also use another statistic similar to total count eg. median, upper quartile
Does not account for composition bias
Comparing feature abundance under different conditions
Assumes linearity of signal
When feature=gene, well-established pre- and post-analysis strategies exist
Mortazavi, A. et al (2008) Nature Methods
Simple difference in means
Replication introduces variation
Normal (Gaussian) Distribution - t-test
Two parameters - \(mean\) and \(sd\) (\(sd^2 = variance\))
Suitable for microarray data but not for RNAseq data
Count data - Poisson distribution
One parameter - \(mean\) \((\lambda)\)
\(variance\) = \(mean\)
Use the Negative Binomial distribution
In the NB distribution \(mean\) not equal to \(variance\)
Two paramenters - \(mean\) and \(dispersion\)
\(dispersion\) describes how \(variance\) changes with \(mean\)
Anders, S. & Huber, W. (2010) Genome Biology
Estimating the dispersion parameter can be difficult with a small number of samples
DESeq2 models the variance as the sum of technical and biological variance
Esimate dispersion for each gene
‘Share’ dispersion information between genes to obtain fitted estimate
Shrink gene-wise estimates towards the fitted estimates
Calculate coefficients describing change in gene expression
Linear Model \(\rightarrow\) Generalized Linear Model
Calculate coefficients describing change in gene expression
Linear Model \(\rightarrow\) Generalized Linear Model
http://software.broadinstitute.org/gsea
–>