Recap of pre-processing

The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.

# Read the sample information into a data frame
sampleinfo <- read_tsv("data/SampleInfo.Corrected.txt")

# Read the data into R
seqdata <- read_tsv("data/GSE60450_Lactation.featureCounts", comment = "#")

# Transform the data to matrix of counts
countdata <- as.data.frame(seqdata) %>% 
    column_to_rownames("Geneid") %>% # turn the geneid column into rownames
    rename_all(str_remove, ".bam") %>% # remove the ".bam" from the column names
    select(sampleinfo$Sample) %>% # keep sample columns using sampleinfo
    as.matrix()

# filter the data to remove genes with few counts
keep <- rowSums(countdata) > 5
countdata <- countdata[keep,]

Load the data

Alternatively we can load the `objects with the RData file we created in the pre-processing tutorial.

First load the packages we need.

library(tidyverse)
library(DESeq2)
# load the RData object we created in the previous session
load("Robjects/preprocessing.RData")
ls()
## [1] "countdata"  "sampleinfo"
dim(countdata)
## [1] 25369    12
sampleinfo
## # A tibble: 12 x 5
##    FileName    Sample  CellType Status   Group           
##    <chr>       <chr>   <chr>    <chr>    <chr>           
##  1 MCL1.DG.bam MCL1.DG basal    virgin   basal.virgin    
##  2 MCL1.DH.bam MCL1.DH basal    virgin   basal.virgin    
##  3 MCL1.DI.bam MCL1.DI basal    pregnant basal.pregnant  
##  4 MCL1.DJ.bam MCL1.DJ basal    pregnant basal.pregnant  
##  5 MCL1.DK.bam MCL1.DK basal    lactate  basal.lactate   
##  6 MCL1.DL.bam MCL1.DL basal    lactate  basal.lactate   
##  7 MCL1.LA.bam MCL1.LA luminal  virgin   luminal.virgin  
##  8 MCL1.LB.bam MCL1.LB luminal  virgin   luminal.virgin  
##  9 MCL1.LC.bam MCL1.LC luminal  pregnant luminal.pregnant
## 10 MCL1.LD.bam MCL1.LD luminal  pregnant luminal.pregnant
## 11 MCL1.LE.bam MCL1.LE luminal  lactate  luminal.lactate 
## 12 MCL1.LF.bam MCL1.LF luminal  lactate  luminal.lactate

The model formula and design matrices

Now that we are happy that the quality of the data looks good, we can proceed to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESeq2 or edgeR. They are both equally applicable. There is an informative and honest blog post here by Mike Love, one of the authors of DESeq2, about deciding which to use.

We will use DESeq2 for the rest of this practical.

Create a DESeqDataSet object with the raw data

Creating the design model formula

First we need to create a design model formula for our analysis. DESeq2 will use this to generate the model matrix, as we have seen in the linear models lecture.

We have two variables: “status”" and “cell type”. We will fit two models under two assumptions: no interaction and interaction of these two factors.

Let’s start with the model with only main effects, that is no interaction. The main assumption here is that the effect of the status is the same in both type of cells.

# Use the standard R 'formula' syntax for an additive model
design <- as.formula(~ CellType + Status)

What does this look like as a model matrix?

modelMatrix <- model.matrix(design, data = sampleinfo)
modelMatrix
##    (Intercept) CellTypeluminal Statuspregnant Statusvirgin
## 1            1               0              0            1
## 2            1               0              0            1
## 3            1               0              1            0
## 4            1               0              1            0
## 5            1               0              0            0
## 6            1               0              0            0
## 7            1               1              0            1
## 8            1               1              0            1
## 9            1               1              1            0
## 10           1               1              1            0
## 11           1               1              0            0
## 12           1               1              0            0
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$CellType
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Status
## [1] "contr.treatment"

It would be nice if virgin were the base line/intercept. To get R to use virgin as the intercept we need to use a factor. Let’s set factor levels on Status to use virgin as the intercept.

sampleinfo$Status <- factor(sampleinfo$Status, 
                            levels = c("virgin", "pregnant", "lactate"))
modelMatrix <- model.matrix(design, data = sampleinfo)
modelMatrix
##    (Intercept) CellTypeluminal Statuspregnant Statuslactate
## 1            1               0              0             0
## 2            1               0              0             0
## 3            1               0              1             0
## 4            1               0              1             0
## 5            1               0              0             1
## 6            1               0              0             1
## 7            1               1              0             0
## 8            1               1              0             0
## 9            1               1              1             0
## 10           1               1              1             0
## 11           1               1              0             1
## 12           1               1              0             1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$CellType
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Status
## [1] "contr.treatment"

Build a DESeq2DataSet

We don’t actually need to pass DESeq2 the model matrix, instead we pass it the design formula and the sampleinfo it will build the matrix itself.

# create the DESeqDataSet object
ddsObj.raw <- DESeqDataSetFromMatrix(countData = countdata,
                                     colData = sampleinfo,
                                     design = design)
## converting counts to integer mode
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

Data exploration

Let’s plot a PCA from vst transformed data. Can you anticipate if the interaction term will be important?

vstcounts <- vst(ddsObj.raw, blind=TRUE)
plotPCA(vstcounts, intgroup=c("Status", "CellType"))

Differential expression analysis with DESeq2

The DESeq2 work flow

The main DESeq2 work flow is carried out in 3 steps:

First, Calculate the “median ratio” normalisation size factors…

ddsObj <- estimateSizeFactors(ddsObj.raw)

… then estimate dispersion …

ddsObj <- estimateDispersions(ddsObj)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates

… finally, apply Negative Binomial GLM fitting and calculate Wald statistics

ddsObj <- nbinomWaldTest(ddsObj)

The DESeq command

In practice the 3 steps above can be performed in a single step using the DESeq wrapper function. Performing the three steps separately is useful if you wish to alter the default parameters of one or more steps, otherwise the DESeq function is fine.

# Run DESeq
ddsObj <- DESeq(ddsObj.raw)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Generate a results table

We can generate a table of differential expression results from the DDS object using the results function of DESeq2.

res <- results(ddsObj, alpha=0.05)
head(res)
## log2 fold change (MLE): Status lactate vs virgin 
## Wald test p-value: Status lactate vs virgin 
## DataFrame with 6 rows and 6 columns
##                             baseMean     log2FoldChange             lfcSE
##                            <numeric>          <numeric>         <numeric>
## ENSMUSG00000051951  194.697166753892  0.696568728116145 0.755104145648301
## ENSMUSG00000102331 0.557692810163983   1.84291069909588  2.74725030827382
## ENSMUSG00000025900  3.83812469441752  -2.86237851407868  1.26897511437594
## ENSMUSG00000025902  52.5235511181304 -0.698029316520065 0.413167843539657
## ENSMUSG00000098104 0.686876046943845  0.217998777865932  1.69351777058105
## ENSMUSG00000103922  143.632008465288   1.82325340829652 0.521322420811451
##                                 stat               pvalue
##                            <numeric>            <numeric>
## ENSMUSG00000051951 0.922480338812205    0.356278081181113
## ENSMUSG00000102331  0.67082009001715    0.502335147557336
## ENSMUSG00000025900 -2.25566166085641   0.0240918283738428
## ENSMUSG00000025902 -1.68945702681014   0.0911318804646417
## ENSMUSG00000098104 0.128725415022445    0.897574924437044
## ENSMUSG00000103922  3.49736235295343 0.000469883126047928
##                                   padj
##                              <numeric>
## ENSMUSG00000051951   0.507656546833457
## ENSMUSG00000102331                  NA
## ENSMUSG00000025900  0.0654373609038499
## ENSMUSG00000025902   0.186215722971815
## ENSMUSG00000098104                  NA
## ENSMUSG00000103922 0.00259986740467536

Independent filtering

You will notice that some of the adjusted p-values (padj) are NA. Remember in Session 2 we said that there is no need to pre-filter the genes as DESeq2 will do this through a process it calls ‘independent filtering’. The genes with NA are the ones DESeq2 has filtered out.

From DESeq2 manual: “The results function of the DESeq2 package performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic is found which optimizes the number of adjusted p values lower than a [specified] significance level”.

The default significance level for independent filtering is 0.1, however, you should set this to the FDR cut off you are planning to use. We will use 0.05 - this was the purpose of the alpha argument in the previous command.

The default contrast of results

The results function has returned the results for the contrast “lactate vs virgin”. Let’s have a look at the model matrix to understand why DESeq2 has given us this particular contrast.

modelMatrix
##    (Intercept) CellTypeluminal Statuspregnant Statuslactate
## 1            1               0              0             0
## 2            1               0              0             0
## 3            1               0              1             0
## 4            1               0              1             0
## 5            1               0              0             1
## 6            1               0              0             1
## 7            1               1              0             0
## 8            1               1              0             0
## 9            1               1              1             0
## 10           1               1              1             0
## 11           1               1              0             1
## 12           1               1              0             1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$CellType
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Status
## [1] "contr.treatment"

By default, results has returned the contrast encoded by the final column in the model matrix. DESeq2 has the command resultsNames that allows us to view the contrasts that are available directly from the model matrix.

resultsNames(ddsObj)
## [1] "Intercept"                 "CellType_luminal_vs_basal"
## [3] "Status_pregnant_vs_virgin" "Status_lactate_vs_virgin"

Let’s just rename res so that we know which contrast results it contains.

resLvV <- res
rm(res)

If we want a different contrast we can just pass the results function the name of the design matrix column that encodes it. Let’s retrieve the results for pregant versus virgin

resPvV <- results(ddsObj, 
                  name="Status_pregnant_vs_virgin", 
                  alpha = 0.05)
resPvV
## log2 fold change (MLE): Status pregnant vs virgin 
## Wald test p-value: Status pregnant vs virgin 
## DataFrame with 25369 rows and 6 columns
##                             baseMean     log2FoldChange             lfcSE
##                            <numeric>          <numeric>         <numeric>
## ENSMUSG00000051951  194.697166753892 -0.964421914046116 0.756556436409462
## ENSMUSG00000102331 0.557692810163983  0.783160736655198  2.77052970967154
## ENSMUSG00000025900  3.83812469441752 -0.582055283538631  1.06442135081234
## ENSMUSG00000025902  52.5235511181304 -0.392164452505025 0.394880413726866
## ENSMUSG00000098104 0.686876046943845  -1.48086297595932  1.80557233483966
## ...                              ...                ...               ...
## ENSMUSG00000094915 0.608461286410795 -0.826494288924283  2.12032636483226
## ENSMUSG00000079808  6.43506750276924 -0.478860320185389 0.782310294318585
## ENSMUSG00000095041  1685.46131828261 -0.137358428113449  0.11918690809244
## ENSMUSG00000063897  443.698775654276  0.103037163491328 0.148063679075831
## ENSMUSG00000095742  665.855337042711 -0.192143905408863 0.356298129738732
##                                  stat            pvalue              padj
##                             <numeric>         <numeric>         <numeric>
## ENSMUSG00000051951  -1.27475211052749 0.202396995964576 0.416001414815931
## ENSMUSG00000102331  0.282675451528742 0.777425635905802                NA
## ENSMUSG00000025900 -0.546827892069642 0.584496978808013 0.769313106697388
## ENSMUSG00000025902 -0.993122066510698 0.320650471531356 0.546600952135141
## ENSMUSG00000098104 -0.820162641720375  0.41212339574703                NA
## ...                               ...               ...               ...
## ENSMUSG00000094915 -0.389795789286273 0.696687557648832                NA
## ENSMUSG00000079808 -0.612110467755624 0.540464672316713 0.737081937254622
## ENSMUSG00000095041  -1.15246238292309 0.249131118998283 0.470675028487924
## ENSMUSG00000063897  0.695897630900804 0.486492941082487 0.696723172519147
## ENSMUSG00000095742 -0.539278456358328 0.589694731815056 0.773487503029154

Let’s get the top 100 genes by adjusted p-value

topGenesPvV <- as.data.frame(resPvV) %>%
    rownames_to_column("GeneID") %>% 
    arrange(padj) %>% 
    head(100)
topGenesPvV
##               GeneID     baseMean log2FoldChange     lfcSE      stat
## 1 ENSMUSG00000061937 1154100.0043       7.410186 0.4205063  17.62206
## 2 ENSMUSG00000000381  384010.7324       7.059725 0.4218738  16.73421
## 3 ENSMUSG00000026417   11185.5610       5.439228 0.3686446  14.75467
## 4 ENSMUSG00000033831     460.4074      -9.311600 0.6313563 -14.74857
## 5 ENSMUSG00000022491  516162.0093       7.611417 0.5350239  14.22631
## 6 ENSMUSG00000016028     452.4390      -2.784445 0.2111341 -13.18804
##         pvalue         padj
## 1 1.668205e-69 3.247495e-65
## 2 7.383067e-63 7.186308e-59
## 3 2.870993e-49 1.529487e-45
## 4 3.142729e-49 1.529487e-45
## 5 6.291248e-46 2.449435e-42
## 6 1.028299e-39 3.336317e-36

Challenge 1

Obtain results for luminal vs basal and find the top 200 genes. Call the new results object resBvL.

Contrasts

Suppose we want to find differentially expressed genes between pregnant and lactate. We don’t have a parameter that explicitly will allow us to test that hypothesis. We need to provide a contrast.

resultsNames(ddsObj)
## [1] "Intercept"                 "CellType_luminal_vs_basal"
## [3] "Status_pregnant_vs_virgin" "Status_lactate_vs_virgin"
resPvL <- results(ddsObj,
                  contrast=c("Status", "pregnant", "lactate"), 
                  alpha = 0.05)
resPvL
## log2 fold change (MLE): Status pregnant vs lactate 
## Wald test p-value: Status pregnant vs lactate 
## DataFrame with 25369 rows and 6 columns
##                             baseMean     log2FoldChange             lfcSE
##                            <numeric>          <numeric>         <numeric>
## ENSMUSG00000051951  194.697166753892  -1.66099064216226 0.758524203103812
## ENSMUSG00000102331 0.557692810163983  -1.05974996244069  2.71496570152254
## ENSMUSG00000025900  3.83812469441752   2.28032323054005  1.30691554764151
## ENSMUSG00000025902  52.5235511181304   0.30586486401504 0.420505778447022
## ENSMUSG00000098104 0.686876046943845  -1.69886175382525  1.83914004967557
## ...                              ...                ...               ...
## ENSMUSG00000094915 0.608461286410795   1.41175123171982  2.27701125771381
## ENSMUSG00000079808  6.43506750276924  0.373754201238206 0.846710631015317
## ENSMUSG00000095041  1685.46131828261 -0.164565142000542 0.119998113860552
## ENSMUSG00000063897  443.698775654276 0.0197569041646694  0.15002706664928
## ENSMUSG00000095742  665.855337042711  0.275202416993547 0.357853594081112
##                                  stat             pvalue              padj
##                             <numeric>          <numeric>         <numeric>
## ENSMUSG00000051951  -2.18976617405963 0.0285411990950159 0.101341551801633
## ENSMUSG00000102331 -0.390336408981662  0.696287803823162                NA
## ENSMUSG00000025900   1.74481299473036  0.081017426084148 0.214587613057888
## ENSMUSG00000025902  0.727373747739295  0.466997031911468 0.669988558938463
## ENSMUSG00000098104 -0.923726148057588  0.355628914550976                NA
## ...                               ...                ...               ...
## ENSMUSG00000094915  0.620001867332561  0.535256557546821                NA
## ENSMUSG00000079808  0.441419048666043  0.658909654417067 0.810344850124042
## ENSMUSG00000095041  -1.37139773873263  0.170251001999973 0.354767489891225
## ENSMUSG00000063897  0.131688931910236   0.89523034032685 0.948789549681707
## ENSMUSG00000095742  0.769036336494552   0.44187173981442 0.648448631402245

Comparing two design models

Suppose we thought that maybe status were irrelevant and really the only differences might be between cell types. We could fit a simpler model, this would give us more degrees of freedom and therefore more power, but how would we know if it was a better model of not? We can compare the two models using the “log ratio test” (LRT).

designC <- as.formula(~ CellType )

# Compare the designs
ddsObjC <- DESeq(ddsObj, test="LRT", reduced=designC)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
resCvCS <- results(ddsObjC)
resCvCS
## log2 fold change (MLE): Status lactate vs virgin 
## LRT p-value: '~ CellType + Status' vs '~ CellType' 
## DataFrame with 25369 rows and 6 columns
##                             baseMean     log2FoldChange             lfcSE
##                            <numeric>          <numeric>         <numeric>
## ENSMUSG00000051951  194.697166753892  0.696568728116145 0.755104145648301
## ENSMUSG00000102331 0.557692810163983   1.84291069909588  2.74725030827382
## ENSMUSG00000025900  3.83812469441752  -2.86237851407868  1.26897511437594
## ENSMUSG00000025902  52.5235511181304 -0.698029316520065 0.413167843539657
## ENSMUSG00000098104 0.686876046943845  0.217998777865932  1.69351777058105
## ...                              ...                ...               ...
## ENSMUSG00000094915 0.608461286410795  -2.23824552064411  2.18942541093988
## ENSMUSG00000079808  6.43506750276924 -0.852614521423596  0.82752938365476
## ENSMUSG00000095041  1685.46131828261 0.0272067138870933 0.119548703169558
## ENSMUSG00000063897  443.698775654276 0.0832802593266583 0.149701881553858
## ENSMUSG00000095742  665.855337042711  -0.46734632240241 0.357462273984729
##                                 stat             pvalue              padj
##                            <numeric>          <numeric>         <numeric>
## ENSMUSG00000051951  4.53253050179748  0.103698746569904 0.190999045793046
## ENSMUSG00000102331 0.403839945011132  0.817160320618606                NA
## ENSMUSG00000025900  5.80154416549107 0.0549807539733725 0.116045852621484
## ENSMUSG00000025902  2.94369054748157  0.229501600938087 0.352802064658696
## ENSMUSG00000098104   1.1408656660373  0.565280713583159                NA
## ...                              ...                ...               ...
## ENSMUSG00000094915  1.55291350083761  0.460033139170532                NA
## ENSMUSG00000079808  0.91537819988595   0.63274416582761 0.729917831924744
## ENSMUSG00000095041  2.14994647669812  0.341306889114605 0.470116342530785
## ENSMUSG00000063897 0.538020455776859  0.764135440116204  0.83376879002311
## ENSMUSG00000095742  1.59526291784212  0.450394479587144 0.573627755976036

The null hypothesis is that there is no significant difference between the two models, i.e. the simpler model is sufficient to explain the variation in gene expression between the samples. If the the adjusted p-value for a gene passes a significance threshold (e.g. padj < 0.05) then we should consider using the more complex model for this gene. In practice we would usually choose one model or the other and apply it to all genes.

Challenge 2

When we looked at the PCA it did seem that an interaction model might be warranted. Let’s test that.

  1. Fit a model with interaction.
  2. Use the LRT to compare the two models.
  3. Is the number of replicates good enough to include the interaction?
  4. Is the interaction needed in the model?

Finally save the results in a new RData object

save(resLvV, ddsObj, sampleinfo, file="results/DE.RData")