The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.
# Read the sample information into a data frame
sampleinfo <- read_tsv("results/SampleInfo_Corrected.txt")
# Read the data into R
seqdata <- read_tsv("data/GSE60450_Lactation.featureCounts", comment = "#")
# Transform the data to matrix of counts
countdata <- as.data.frame(seqdata) %>%
column_to_rownames("Geneid") %>% # turn the geneid column into rownames
rename_all(str_remove, ".bam") %>% # remove the ".bam" from the column names
select(sampleinfo$Sample) %>% # keep sample columns using sampleinfo
as.matrix()
# filter the data to remove genes with few counts
keep <- rowSums(countdata) > 5
countdata <- countdata[keep,]
Alternatively we can load the `objects with the RData file we created in the pre-processing tutorial.
First load the packages we need.
library(tidyverse)
library(DESeq2)
# load the RData object we created in the previous session
load("Robjects/preprocessing.RData")
ls()
## [1] "countdata" "sampleinfo"
dim(countdata)
## [1] 25369 12
sampleinfo
## # A tibble: 12 x 4
## FileName Sample CellType Status
## <chr> <chr> <chr> <chr>
## 1 MCL1.DG.bam MCL1.DG basal virgin
## 2 MCL1.DH.bam MCL1.DH basal virgin
## 3 MCL1.DI.bam MCL1.DI basal pregnant
## 4 MCL1.DJ.bam MCL1.DJ basal pregnant
## # … with 8 more rows
Now that we are happy that the quality of the data looks good, we can proceed to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESeq2 or edgeR. They are both equally applicable. There is an informative and honest blog post here by Mike Love, one of the authors of DESeq2, about deciding which to use.
We will use DESeq2 for the rest of this practical.
First we need to create a design model formula for our analysis. DESeq2
will use this to generate the model matrix, as we have seen in the linear models lecture.
We have two variables in our experiment: “Status” and “Cell Type”.
We will fit two models under two assumptions: no interaction and interaction of these two factors, however, to demonstrate the how DESeq2 is used we will start with a simple model which considers Status but ignores Cell Type
# Use the standard R 'formula' syntax for an additive model
design <- as.formula(~ Status)
What does this look like as a model matrix?
modelMatrix <- model.matrix(design, data = sampleinfo)
modelMatrix
## (Intercept) Statuspregnant Statusvirgin
## 1 1 0 1
## 2 1 0 1
## 3 1 1 0
## 4 1 1 0
## 5 1 0 0
## 6 1 0 0
## 7 1 0 1
## 8 1 0 1
## 9 1 1 0
## 10 1 1 0
## 11 1 0 0
## 12 1 0 0
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Status
## [1] "contr.treatment"
It would be nice if virgin
were the base line/intercept. To get R to use virgin
as the intercept we need to use a factor
. Let’s set factor levels on Status to use virgin as the intercept.
sampleinfo$Status <- factor(sampleinfo$Status,
levels = c("virgin", "pregnant", "lactate"))
modelMatrix <- model.matrix(design, data = sampleinfo)
modelMatrix
## (Intercept) Statuspregnant Statuslactate
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
## 7 1 0 0
## 8 1 0 0
## 9 1 1 0
## 10 1 1 0
## 11 1 0 1
## 12 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Status
## [1] "contr.treatment"
We don’t actually need to pass DESeq2
the model matrix, instead we pass it the design formula and the sampleinfo
it will build the matrix itself.
# create the DESeqDataSet object
ddsObj.raw <- DESeqDataSetFromMatrix(countData = countdata,
colData = sampleinfo,
design = design)
## converting counts to integer mode
DESeq2
work flowThe main DESeq2
work flow is carried out in 3 steps:
First, Calculate the “median ratio” normalisation size factors…
ddsObj <- estimateSizeFactors(ddsObj.raw)
colData(ddsObj.raw)
## DataFrame with 12 rows and 4 columns
## FileName Sample CellType Status
## <character> <character> <character> <factor>
## MCL1.DG MCL1.DG.bam MCL1.DG basal virgin
## MCL1.DH MCL1.DH.bam MCL1.DH basal virgin
## MCL1.DI MCL1.DI.bam MCL1.DI basal pregnant
## MCL1.DJ MCL1.DJ.bam MCL1.DJ basal pregnant
## MCL1.DK MCL1.DK.bam MCL1.DK basal lactate
## ... ... ... ... ...
## MCL1.LB MCL1.LB.bam MCL1.LB luminal virgin
## MCL1.LC MCL1.LC.bam MCL1.LC luminal pregnant
## MCL1.LD MCL1.LD.bam MCL1.LD luminal pregnant
## [ reached getOption("max.print") -- omitted 2 rows ]
colData(ddsObj)
## DataFrame with 12 rows and 5 columns
## FileName Sample CellType Status sizeFactor
## <character> <character> <character> <factor> <numeric>
## MCL1.DG MCL1.DG.bam MCL1.DG basal virgin 1.293508
## MCL1.DH MCL1.DH.bam MCL1.DH basal virgin 1.198757
## MCL1.DI MCL1.DI.bam MCL1.DI basal pregnant 1.225709
## MCL1.DJ MCL1.DJ.bam MCL1.DJ basal pregnant 1.080972
## MCL1.DK MCL1.DK.bam MCL1.DK basal lactate 0.973883
## ... ... ... ... ... ...
## MCL1.LB MCL1.LB.bam MCL1.LB luminal virgin 1.369562
## [ reached getOption("max.print") -- omitted 4 rows ]
We can use plotMA
from limma
to look at the data in an MA plot
logcounts <- log2(countdata + 1)
limma::plotMA(logcounts)
abline(h=0, col="red")
normalizedCounts <- counts(ddsObj, normalized=TRUE)
logNormalizedCounts <- log2(normalizedCounts + 1)
limma::plotMA(logNormalizedCounts)
abline(h=0, col="red")
DESeq2 doesn’t actually normalise the counts, it uses raw counts and includes the size factors in the modelling. Please see the DESeq2 documentation if you’d like more details on exactly how they are incorporated into the algorithm. For practical purposes we can think of it as a normalisation.
… next estimate dispersion …
ddsObj <- estimateDispersions(ddsObj)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
… finally, apply Negative Binomial GLM fitting and calculate Wald statistics
ddsObj <- nbinomWaldTest(ddsObj)
DESeq
commandIn practice the 3 steps above can be performed in a single step using the DESeq
wrapper function. Performing the three steps separately is useful if you wish to alter the default parameters of one or more steps, otherwise the DESeq
function is fine.
# Run DESeq
ddsObj <- DESeq(ddsObj.raw)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
We can generate a table of differential expression results from the DDS object using the results
function of DESeq2.
res <- results(ddsObj, alpha=0.05)
res
## log2 fold change (MLE): Status lactate vs virgin
## Wald test p-value: Status lactate vs virgin
## DataFrame with 25369 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000051951 194.697167 0.234655 2.732631 0.0858714 0.9315687
## ENSMUSG00000102331 0.557693 2.218766 2.611789 0.8495197 0.3955922
## ENSMUSG00000025900 3.838125 -3.864560 1.954604 -1.9771573 0.0480238
## ENSMUSG00000025902 52.523551 -0.777627 0.992003 -0.7838953 0.4331015
## ENSMUSG00000098104 0.686876 0.567819 1.554715 0.3652239 0.7149443
## padj
## <numeric>
## ENSMUSG00000051951 0.974958
## ENSMUSG00000102331 NA
## ENSMUSG00000025900 0.192165
## ENSMUSG00000025902 0.702193
## ENSMUSG00000098104 NA
## [ reached getOption("max.print") -- omitted 6 rows ]
You will notice that some of the adjusted p-values (padj
) are NA. Remember in Session 2 we said that there is no need to pre-filter the genes as DESeq2 will do this through a process it calls ‘independent filtering’. The genes with NA
are the ones DESeq2
has filtered out.
From DESeq2
manual: “The results function of the DESeq2
package performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic is found which optimizes the number of adjusted p values lower than a [specified] significance level”.
The default significance level for independent filtering is 0.1
, however, you should set this to the FDR cut off you are planning to use. We will use 0.05
- this was the purpose of the alpha
argument in the previous command.
results
The results
function has returned the results for the contrast “lactate vs virgin”. Let’s have a look at the model matrix to understand why DESeq2
has given us this particular contrast.
modelMatrix
## (Intercept) Statuspregnant Statuslactate
## 1 1 0 0
## 2 1 0 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
## 7 1 0 0
## 8 1 0 0
## 9 1 1 0
## 10 1 1 0
## 11 1 0 1
## 12 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Status
## [1] "contr.treatment"
By default, results
has returned the contrast encoded by the final column in the model matrix. DESeq2
has the command resultsNames
that allows us to view the contrasts that are available directly from the model matrix.
resultsNames(ddsObj)
## [1] "Intercept" "Status_pregnant_vs_virgin"
## [3] "Status_lactate_vs_virgin"
Let’s just rename res
so that we know which contrast results it contains.
resLvV_status <- res
rm(res)
If we want a different contrast we can just pass the results
function the name of the design matrix column that encodes it. Let’s retrieve the results for pregant versus virgin
resPvV_status <- results(ddsObj,
name="Status_pregnant_vs_virgin",
alpha = 0.05)
resPvV_status
## log2 fold change (MLE): Status pregnant vs virgin
## Wald test p-value: Status pregnant vs virgin
## DataFrame with 25369 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000051951 194.697167 -1.135371 2.733012 -0.415428 0.677828
## ENSMUSG00000102331 0.557693 0.963091 2.679825 0.359386 0.719307
## ENSMUSG00000025900 3.838125 -0.212930 1.695701 -0.125571 0.900072
## ENSMUSG00000025902 52.523551 -0.428587 0.987699 -0.433925 0.664343
## ENSMUSG00000098104 0.686876 -1.466301 1.690059 -0.867603 0.385612
## padj
## <numeric>
## ENSMUSG00000051951 0.963150
## ENSMUSG00000102331 NA
## ENSMUSG00000025900 NA
## ENSMUSG00000025902 0.958286
## ENSMUSG00000098104 NA
## [ reached getOption("max.print") -- omitted 6 rows ]
How many differentially expressed genes are there at FDR < 0.05
sum(resPvV_status$padj < 0.05)
## [1] NA
We need to remove NA
’s as the genes that were filtered out by DESeq2 have NA
in the in the padj
column
sum(resPvV_status$padj < 0.05, na.rm = TRUE)
## [1] 723
Let’s get the top 100 genes by adjusted p-value
topGenesPvV <- as.data.frame(resPvV_status) %>%
rownames_to_column("GeneID") %>%
arrange(padj) %>%
head(100)
topGenesPvV
So far we have fitted a simple model considering just “Status”, but in reality we want to model the effects of both “Status” and “Cell Type”.
Let’s start with the model with only main effects - an additive model with no interaction. The main assumption here is that the effect of the status on gene expression is the same in both types of cells.
First we will recapitulate the above steps to generate a new DESeq2 object with the additive model. Then we will extract the results table for the contrast for “lactate v virgin”.
- Load the raw data.
Remember that we would like R to use ‘Virgin’ in the intercept of the mode, so we need to transform the Status into a factor in the sample sheet and set “Virgin” to be the first level.load("Robjects/preprocessing.RData") sampleinfo$Status <- factor(sampleinfo$Status, levels = c("virgin", "pregnant", "lactate"))
- Create the model
design <- as.formula(~ CellType + Status)
- Then build the DESeq from the raw data, the sample meta data and the model
ddsObj.raw <- DESeqDataSetFromMatrix(countData = countdata, colData = sampleinfo, design = design)
- Run the DESeq2 analysis
ddsObj <- DESeq(ddsObj.raw)
- Extract the default contrast - Lacate v Virgin
resLvV <- results(ddsObj, alpha=0.05)
Challenge 1
- Obtain results for luminal vs basal. Call the new results object
resLvB
.- How many significantly upregulated genes are there?
Challenge 2 - Contrasts
Suppose we want to find genes that are differentially expressed between pregnant and lactate. We don’t have a parameter that explicitly will allow us to test that hypothesis. We need to provide a contrast.
Look at the help page for
results
(?results
) and read about thecontrast
argument (also look at Example 1 at the bottom of the help page)Use the
contrast
argument to extract the results table for “pregnant v lactate”.
So far we have modelled gene expression as a function of status and cell type with an additive model. It is possible that the two factors interact such that differences in gene expression between different cell types are not the same across all three mouse statuses:
Let’s plot a PCA from vst
transformed data. Can you anticipate if the interaction term will be important?
vstcounts <- vst(ddsObj.raw, blind=TRUE)
plotPCA(vstcounts, intgroup=c("Status", "CellType"))
Maybe, but we can’t be sure so we need a way to compare the two models.
Let’s take a simple example to start with.
Suppose we thought that maybe status
were irrelevant and really the only differences might be between cell types. We could fit a simpler model and this would give us more degrees of freedom and therefore more power, but how would we know if it was a better model of not?
We can compare two models using the “likelihood ratio test” (LRT).
To do so we provide the LRT with a simpler model (one with less parameters) than the one currently be used.
Currently ddsObj
is using the model ~CellType + Status
. Here we want to compare to a model without the Status
parameter: ~CellType
# create the simpler model
design.reduced <- as.formula(~ CellType )
# Compare the two designs
ddsObjC <- DESeq(ddsObj, test="LRT", reduced=design.reduced)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
resCvCS <- results(ddsObjC)
resCvCS
## log2 fold change (MLE): Status lactate vs virgin
## LRT p-value: '~ CellType + Status' vs '~ CellType'
## DataFrame with 25369 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000051951 194.697167 0.696569 0.755104 4.53253 0.1036987
## ENSMUSG00000102331 0.557693 1.842911 2.747250 0.40384 0.8171603
## ENSMUSG00000025900 3.838125 -2.862379 1.268975 5.80154 0.0549808
## ENSMUSG00000025902 52.523551 -0.698029 0.413168 2.94369 0.2295016
## ENSMUSG00000098104 0.686876 0.217999 1.693518 1.14087 0.5652807
## padj
## <numeric>
## ENSMUSG00000051951 0.190999
## ENSMUSG00000102331 NA
## ENSMUSG00000025900 0.116046
## ENSMUSG00000025902 0.352802
## ENSMUSG00000098104 NA
## [ reached getOption("max.print") -- omitted 6 rows ]
The second line of the results output shows us the test we are doing:
LRT p-value: '~ CellType + Status' vs '~ CellType'
The null hypothesis is that there is no significant difference between the two models, i.e. the simpler model is sufficient to explain the variation in gene expression between the samples. If the the adjusted p-value for a gene passes a significance threshold (e.g. padj < 0.05) then we should consider using the more complex model for this gene.
sum(resCvCS$padj < 0.05, na.rm=TRUE)
## [1] 7932
We can see that for 7932 genes the more complex model does fit the data better. This would suggest that we should be including both terms in the linear model.
Although we have a result for each gene, in practice we should choose one model or the other and apply it to all genes.
When we looked at the PCA it did seem that an interaction model might be warranted. Let’s test that.
Challenge
- Create a new DESeq2 object using a model with an interaction between CellType and Status. The model formula should be
~CellType + Status + CellType:Status
where
CellType:Status
is the parameter for the interaction beteween CellType and Status.
Look back at the code at the beginning of Exercise 1 to remind you how to do this.Note that
*
can be used as shortcut to add the interaction term, e.g.~CellType * Status
, however, writing out in long form is clearer here.
Use the LRT to compare this to the simpler additive model (
~CellType + Status
)Extract a table of results using
results
.Questions:
For how many genes is interaction model a better fit?
Do you think we need to use the interaction model for this analysis?
Do you think the experimental design is good enough to include the interaction?
If not, why and how would you change it?
save(resLvV, ddsObj, sampleinfo, file="results/DE.RData")