Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law Based on the course RNAseq analysis in R delivered on May 11/12th 2016
This material has been created using the following resources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf (Lun, Chen, and Smyth 2016)
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html
Data files downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60450/suppl/GSE60450_Lactation-GenewiseCounts.txt.gz http://bioinf.wehi.edu.au/software/MSigDB/mouse_c2_v5.rdata http://bioinf.wehi.edu.au/software/MSigDB/mouse_H_v5.rdata
Data files:
sampleinfo.txt
GSE60450_Lactation-GenewiseCounts.txt
mouse_c2_v5.rdata
mouse_H_v5.rdata
Data files available from: https://figshare.com/s/1d788fd384d33e913a2a You should download these files and place them in your /data
directory.
Now that we are happy that we have normalised the data and that the quality looks good, we can continue to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESEQ2 or edgeR. We will use edgeR for the rest of this practical.
**First make sure we have all the objects and libraries loaded*
library(edgeR)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
load("Robjects/preprocessing.Rdata")
The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.
## Read the counts from the downloaded data
seqdata <- read.delim("data/GSE60450_Lactation-GenewiseCounts.txt", stringsAsFactors = FALSE)
#
# Remove first two columns from seqdata
countdata <- seqdata[,-(1:2)]
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]
countdata
colnames(countdata) <- substr(colnames(countdata), 1, 7)
countdata
## Calculate the Counts Per Million measure
myCPM <- cpm(countdata)
## Identify genes with at least 0.5 cpm in at least 2 samples
thresh <- myCPM > 0.5
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
## Convert to an edgeR object
dgeObj <- DGEList(counts.keep)
## Perform TMM normalisation
dgeObj <- calcNormFactors(dgeObj)
## Obtain corrected sample information
sampleinfo <- read.delim("data/SampleInfo_Corrected.txt")
group <- paste(sampleinfo$CellType,sampleinfo$Status,sep=".")
group
First we need to create a design matrix for the groups, as we have seen in the linear models lecture. We have two variables, status and cell type. We will fit two models under two assumptions; no interaction and interaction of these two factors.
Let’s start with the model with only main effects, that is no interaction. The main assumption here is that the effect of the status is the same in both type of cells.
design
(Intercept) typeluminal statuspregnant statusvirgin
1 1 0 0 1
2 1 0 0 1
3 1 0 1 0
4 1 0 1 0
5 1 0 0 0
6 1 0 0 0
7 1 1 0 1
8 1 1 0 1
9 1 1 1 0
10 1 1 1 0
11 1 1 0 0
12 1 1 0 0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$type
[1] "contr.treatment"
attr(,"contrasts")$status
[1] "contr.treatment"
An MDS plot shows distances, in terms of biological coefficient of variation (BCV), between samples. What do you think of the quality of the data? Can you anticipate if the interaction term will be important?
The common dispersion estimates the overall BCV of the dataset, averaged over all genes:
dgeObj <- estimateCommonDisp(dgeObj)
Then we estimate gene-wise dispersion estimates, allowing a possible trend with averge count size:
dgeObj <- estimateGLMTrendedDisp(dgeObj)
dgeObj <- estimateTagwiseDisp(dgeObj)
Plot the estimated dispersions:
r
r plotBCV(dgeObj)
First, we fit genewise glms:
# Fit the linear model
fit <- glmFit(dgeObj, design)
names(fit)
[1] "coefficients" "fitted.values" "deviance"
[4] "iter" "failed" "method"
[7] "counts" "unshrunk.coefficients" "df.residual"
[10] "design" "offset" "dispersion"
[13] "prior.count" "samples" "prior.df"
[16] "AveLogCPM"
head(coef(fit))
(Intercept) typeluminal statuspregnant statusvirgin
497097 -11.187922 -7.58804851 -0.7085514 -0.09305118
20671 -12.715063 -1.85287334 0.2269001 0.49554506
27395 -11.221391 0.56368066 -0.1415910 -0.29221577
18777 -10.146793 0.08280255 -0.1845489 -0.48795441
21399 -9.909825 -0.24195503 0.1753606 0.13494615
58175 -16.310131 3.09936215 1.1975518 0.84742701
Conduct likelihood ratio tests for luminal vs basal and show the top genes:
r
r lrt.BvsL <- glmLRT(fit, coef=2) topTags(lrt.BvsL)
Coefficient: typeluminal
logFC logCPM LR PValue FDR
110308 -8.940578 10.264297 24.89789 6.044844e-07 0.0004377961
50916 -8.636503 5.749781 24.80037 6.358512e-07 0.0004377961
12293 -8.362247 6.794788 24.68526 6.749827e-07 0.0004377961
56069 -8.419433 6.124377 24.41532 7.764861e-07 0.0004377961
24117 -9.290691 6.757163 24.32506 8.137331e-07 0.0004377961
12818 -8.216790 8.172247 24.24233 8.494462e-07 0.0004377961
22061 -8.034712 7.255370 24.16987 8.820135e-07 0.0004377961
12797 -9.001419 9.910795 24.12854 9.011487e-07 0.0004377961
50706 -7.697022 10.809629 24.06926 9.293193e-07 0.0004377961
237979 -8.167451 5.215921 24.03528 9.458678e-07 0.0004377961
Challenge
Conduct likelihood ratio tests for virgin vs lactate and show the top genes.
Suppose we want to find differentially expressed genes between pregnant and virgin. We don’t have a parameter that explicitly will allow us to test that hypothesis. We need to build a contrast:
r
r PvsV <- makeContrasts(statuspregnant-statusvirgin, levels=design)
Renaming (Intercept) to Intercept
r
r lrt.pVsV <- glmLRT(fit, contrast=PvsV) topTags(lrt.pVsV)
Coefficient: 1*statuspregnant -1*statusvirgin
logFC logCPM LR PValue FDR
232016 -4.593938 6.389781 12.40835 0.0004274195 1
236643 -6.608868 -1.481424 12.34832 0.0004413842 1
19283 -4.502691 4.318145 10.72832 0.0010550833 1
12993 7.359664 15.507688 10.65284 0.0010990179 1
15903 -3.746794 5.065876 10.53470 0.0011715416 1
544963 -2.988110 4.975414 10.37943 0.0012742723 1
17755 -2.649741 6.118883 10.36888 0.0012815732 1
16687 -5.696048 4.958824 10.12211 0.0014650148 1
381217 3.213414 4.748535 10.08264 0.0014967284 1
14663 7.581478 14.480767 10.01154 0.0015556252 1
Challenge
1.Fit a model with interaction: What is the rationale to include the interaction (What assumption are you relaxing?) 2. Is the number of replicates good enough to include the interaction? 3. Is the interaction needed in the model?
Solution
save(lrt.BvsL,dgeObj,group,file="Robjects/DE.Rdata")
Lun, Aaron T L, Yunshun Chen, and Gordon K Smyth. 2016. “It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR.” Methods in Molecular Biology (Clifton, N.J.) 1418 (January): 391–416. doi:10.1007/978-1-4939-3578-9\_19.