Load the data

First load the packages we need. DESeq2 for the actual differential gene expression analysis and the tidyverse for data handling.

library(DESeq2)
library(tidyverse)

We have already read in the results from Salmon into R and created a txi object for you, which we then saved into an “rds” file. We can now load the txi from that file to start the differential expression analysis. We will also need the sample meta data sheet

txi <- readRDS("RObjects/txi.rds")
sampleinfo <- read_tsv("data/samplesheet_corrected.tsv", col_types = "cccc")

First check the metadata.

sampleinfo
## # A tibble: 12 × 4
##   SampleName Replicate Status   TimePoint
##   <chr>      <chr>     <chr>    <chr>    
## 1 SRR7657878 1         Infected d11      
## 2 SRR7657881 2         Infected d11      
## 3 SRR7657880 3         Infected d11      
## 4 SRR7657874 1         Infected d33      
## # ℹ 8 more rows

It is important to be sure that the order of the samples in rows in the sample meta data table matches the order of the columns in the data matrix - DESeq2 will not check this. If the order does not match you will not be running the analyses that you think you are.

all(colnames(txi$counts) == sampleinfo$SampleName)
## [1] TRUE

DESeq2 and other options

Now that we are happy that the quality of the data looks good, we can proceed to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use DESeq2 (Love, Huber, and Anders 2014) or edgeR (Robinson, McCarthy, and Smyth 2010; McCarthy, Chen, and Smyth 2012). There is also the option to use the limma package and transform the counts using its voom function .They are all equally valid approaches (Ritchie et al. 2015). There is an informative and honest blog post here by Mike Love, one of the authors of DESeq2, about deciding which to use.

We will use DESeq2 for the rest of this practical.

The model formula and design matrices

Creating the design model formula

First we need to create a design model formula for our analysis. DESeq2 will use this to generate the model matrix, as we have seen in the linear models lecture.

We have two variables in our experiment: “Status” and “Time Point”.

We will fit two models under two assumptions: no interaction and interaction of these two factors, however, to demonstrate the how DESeq2 is used we will start with a simple model which considers Status but ignores Time Point.

First, create a variable containing the model using standard R ‘formula’ syntax.

simple.model <- as.formula(~ Status)

What does this look like as a model matrix?

model.matrix(simple.model, data = sampleinfo)
##    (Intercept) StatusUninfected
## 1            1                0
## 2            1                0
## 3            1                0
## 4            1                0
## 5            1                1
## 6            1                0
## 7            1                1
## 8            1                1
## 9            1                1
## 10           1                1
## 11           1                0
## 12           1                1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Status
## [1] "contr.treatment"

The intercept has been set automatically to the group in the factor that is alphabetically first: Infected.

It would be nice if Uninfected were the base line/intercept. To get R to use Uninfected as the intercept we need to use a factor. Let’s set factor levels on Status to use Uninfected as the intercept.

sampleinfo <- mutate(sampleinfo, Status = fct_relevel(Status, "Uninfected"))
model.matrix(simple.model, data = sampleinfo)
##    (Intercept) StatusInfected
## 1            1              1
## 2            1              1
## 3            1              1
## 4            1              1
## 5            1              0
## 6            1              1
## 7            1              0
## 8            1              0
## 9            1              0
## 10           1              0
## 11           1              1
## 12           1              0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$Status
## [1] "contr.treatment"

Build a DESeq2DataSet

We don’t actually need to pass DESeq2 the model matrix, instead we pass it the design formula and the sampleinfo it will build the matrix itself.

# create the DESeqDataSet object
ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
                                       colData = sampleinfo,
                                       design = simple.model)
## using counts and average transcript lengths from tximport

When we summarised the counts to gene level, tximport also calculated an average transcript length for each gene for each sample. For a given gene the average transcript length may vary between samples if different samples are using alternative transcripts. DESeq2 will incorporate this into its “normalisation”.

Filter out the unexpressed genes

Whilst DESeq2 will perform it’s own filtering of the genes, we should still filter out genes that uninformative.

keep <- rowSums(counts(ddsObj.raw)) > 5
ddsObj.filt <- ddsObj.raw[keep, ]

Differential expression analysis with DESeq2

The DESeq2 work flow

The main DESeq2 work flow is carried out in 3 steps:

estimateSizeFactors

First, Calculate the “median ratio” normalisation size factors for each sample and adjust for average transcript length on a per gene per sample basis.

ddsObj <- estimateSizeFactors(ddsObj.filt)
## using 'avgTxLength' from assays(dds), correcting for library size

Let’s have a look at what that did

DESeq2 has calculated a normalizsation factor for each gene for each sample.

normalizationFactors(ddsObj.filt)
## NULL
normalizationFactors(ddsObj)
##                    SRR7657878 SRR7657881 SRR7657880 SRR7657874 SRR7657882
## ENSMUSG00000000001  0.9650391 0.96894491 0.94927826  0.9018189 1.20992201
## ENSMUSG00000000028  1.1565743 1.00859533 0.94221210  0.8798304 1.22726399
## ENSMUSG00000000037  0.9372735 1.11991330 0.52624731  1.0355843 0.77648936
##                    SRR7657872 SRR7657877 SRR7657876 SRR7657879 SRR7657883
## ENSMUSG00000000001  0.9615574  1.1078271 1.01950314 0.95990882  0.9028895
## ENSMUSG00000000028  1.0086842  1.2535886 0.95400723 0.77315414  0.7550835
## ENSMUSG00000000037  0.8718128  2.2493164 0.77345594 0.67601042  0.5549084
##                    SRR7657873 SRR7657875
## ENSMUSG00000000001  1.0472683 1.04743598
## ENSMUSG00000000028  1.0609078 1.12777706
## ENSMUSG00000000037  1.8562184 2.13169578
##  [ reached getOption("max.print") -- omitted 20088 rows ]

We can use plotMA from limma to look at the of these normalisation factors on data in an MA plot. Let’s look at SRR7657882, the fifth column, which has the largest normalisation factors.

logcounts <- log2(counts(ddsObj, normalized = FALSE)  + 1)

limma::plotMA(logcounts, array = 5, ylim = c(-5, 5))
abline(h = 0, col = "red")

logNormalizedCounts <- log2(counts(ddsObj, normalized = TRUE)  + 1)

limma::plotMA(logNormalizedCounts, array = 5, ylim = c(-5, 5))
abline(h = 0, col = "red")

DESeq2 doesn’t actually normalise the counts, it uses raw counts and includes the normalisation factors in the modeling as an “offset”. Please see the DESeq2 documentation if you’d like more details on exactly how they are incorporated into the algorithm. For practical purposes we can think of it as a normalisation.

estimateDispersions

Next we need to estimate the dispersion parameters for each gene.

ddsObj <- estimateDispersions(ddsObj)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates

We can plot all three sets of dispersion estimates. It is particularly important to do this if you change any of the default parameters for this step.

plotDispEsts(ddsObj)

nbinomWaldTest

Finally, apply Negative Binomial GLM fitting and calculate Wald statistics.

ddsObj <- nbinomWaldTest(ddsObj)

The DESeq command

In practice the 3 steps above can be performed in a single step using the DESeq wrapper function. Performing the three steps separately is useful if you wish to alter the default parameters of one or more steps, otherwise the DESeq function is fine.

ddsObj <- DESeq(ddsObj.filt)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Generate a results table

We can generate a table of differential expression results from the DDS object using the results function of DESeq2.

results.simple <- results(ddsObj, alpha = 0.05)
results.simple
## log2 fold change (MLE): Status Infected vs Uninfected 
## Wald test p-value: Status Infected vs Uninfected 
## DataFrame with 20091 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE       stat      pvalue
##                     <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSMUSG00000000001 1102.56094    -0.00802952  0.102877  -0.078050    0.937788
## ENSMUSG00000000028   58.60055     0.30498077  0.254312   1.199239    0.230435
## ENSMUSG00000000037   49.23586    -0.05272685  0.416862  -0.126485    0.899348
## ENSMUSG00000000049    7.98789     0.38165132  0.644869   0.591827    0.553966
## ENSMUSG00000000056 1981.00402    -0.16921845  0.128542  -1.316449    0.188024
##                           padj
##                      <numeric>
## ENSMUSG00000000001    0.975584
## ENSMUSG00000000028    0.480598
## ENSMUSG00000000037    0.961314
## ENSMUSG00000000049    0.772123
## ENSMUSG00000000056    0.426492
##  [ reached getOption("max.print") -- omitted 6 rows ]

Assess the number of differentially expressed genes

Now we have made our results table using our simple model, let have a look at which genes are changing and how many pass our 0.05 threshold.

sum(results.simple$padj < 0.05)
## [1] NA

Some of the padj values are NA. This is due to “Independent Filtering”

sum(is.na(results.simple$padj))
## [1] 1584

To count the number of genes that are signficant we need to exclude the NAs:

sum(results.simple$padj < 0.05, na.rm = TRUE)
## [1] 2884

We could also check for the number of upregulated genes by also testing the fold change:

sum(results.simple$padj < 0.05 & results.simple$log2FoldChange > 0,
    na.rm = TRUE)
## [1] 1879

Independent filtering

From DESeq2 manual: “The results function of the DESeq2 package performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic is found which optimizes the number of adjusted p values lower than a [specified] significance level”.

The default significance level for independent filtering is 0.1, however, you should set this to the FDR cut off you are planning to use. We will use 0.05 - this was the purpose of the alpha argument in the previous command.

Earlier we said that there it is not necessary to carefully to pre-filter the genes as DESeq2 will do this through a process it calls ‘independent filtering’. The genes with NA are the ones DESeq2 has filtered out.

Fitting an additive model

So far we have fit a simple model considering just “Status”, but in reality we want to model the effects of both “Status” and “Time Point”.

Let’s start with the model with only main effects - an additive model with no interaction. The main assumption here is that the effects of Status and the effects of Time Point are indepedent.

additive.model <- as.formula(~ TimePoint + Status)
ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
                                       colData = sampleinfo,
                                       design = additive.model)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using counts and average transcript lengths from tximport
keep <- rowSums(counts(ddsObj.raw)) > 5
ddsObj.filt <- ddsObj.raw[keep, ]
ddsObj <- DESeq(ddsObj.filt)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
results.additive <- results(ddsObj, alpha = 0.05)

The default contrast of results

Let’s look at the results.

results.additive
## log2 fold change (MLE): Status Infected vs Uninfected 
## Wald test p-value: Status Infected vs Uninfected 
## DataFrame with 20091 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE       stat      pvalue
##                     <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSMUSG00000000001 1102.56094     -0.0110965  0.106195  -0.104492    0.916779
## ENSMUSG00000000028   58.60055      0.3007930  0.265626   1.132391    0.257470
## ENSMUSG00000000037   49.23586     -0.0481414  0.429685  -0.112039    0.910793
## ENSMUSG00000000049    7.98789      0.4110498  0.656171   0.626437    0.531028
## ENSMUSG00000000056 1981.00402     -0.1907691  0.119694  -1.593809    0.110979
##                           padj
##                      <numeric>
## ENSMUSG00000000001    0.967428
## ENSMUSG00000000028    0.514578
## ENSMUSG00000000037    0.965220
## ENSMUSG00000000049    0.757304
## ENSMUSG00000000056    0.314608
##  [ reached getOption("max.print") -- omitted 6 rows ]

The results function has returned the results for the contrast “Infected vs Uninfected”. Let’s have a look at the model matrix to understand why DESeq2 has given us this particular contrast.

model.matrix(additive.model, data = sampleinfo)
##    (Intercept) TimePointd33 StatusInfected
## 1            1            0              1
## 2            1            0              1
## 3            1            0              1
## 4            1            1              1
## 5            1            1              0
## 6            1            1              1
## 7            1            0              0
## 8            1            0              0
## 9            1            0              0
## 10           1            1              0
## 11           1            1              1
## 12           1            1              0
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$TimePoint
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Status
## [1] "contr.treatment"

By default, results has returned the contrast encoded by the final column in the model matrix. DESeq2 has the command resultsNames that allows us to view the contrasts that are available directly from the DESeq2 object.

resultsNames(ddsObj)
## [1] "Intercept"                     "TimePoint_d33_vs_d11"         
## [3] "Status_Infected_vs_Uninfected"

Let’s just rename results.additive so that we know which contrast results it contains.

results.InfectedvUninfected <- results.additive
rm(results.additive)

Let’s get the top 100 genes by adjusted p-value

topGenesIvU <- as.data.frame(results.InfectedvUninfected) %>%
    rownames_to_column("GeneID") %>%
    top_n(100, wt = -padj)
topGenesIvU

Extracting other contrasts

If we want a different contrast we can just pass the results function the name of the contrast, as given by resultsNames(ddsObj).

resultsNames(ddsObj)
## [1] "Intercept"                     "TimePoint_d33_vs_d11"         
## [3] "Status_Infected_vs_Uninfected"
results.d33vd11 <- results(ddsObj, name = "TimePoint_d33_vs_d11", alpha = 0.05)
results.d33vd11
## log2 fold change (MLE): TimePoint d33 vs d11 
## Wald test p-value: TimePoint d33 vs d11 
## DataFrame with 20091 rows and 6 columns
##                      baseMean log2FoldChange     lfcSE      stat    pvalue
##                     <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 1102.56094      0.0604013  0.106194  0.568780 0.5695052
## ENSMUSG00000000028   58.60055     -0.0630899  0.265577 -0.237558 0.8122240
## ENSMUSG00000000037   49.23586     -0.0771281  0.429688 -0.179498 0.8575466
## ENSMUSG00000000049    7.98789      0.4439955  0.656139  0.676679 0.4986096
## ENSMUSG00000000056 1981.00402     -0.2009862  0.119694 -1.679174 0.0931182
##                         padj
##                    <numeric>
## ENSMUSG00000000001  0.938969
## ENSMUSG00000000028  0.978807
## ENSMUSG00000000037  0.982054
## ENSMUSG00000000049  0.923210
## ENSMUSG00000000056  0.666649
##  [ reached getOption("max.print") -- omitted 6 rows ]
sum(results.d33vd11$padj < 0.05, na.rm = TRUE)
## [1] 109

Should we be using the interaction model?

So far we have modeled gene expression as a function of Status and Time Point with an additive model. Realistically, we expect the two factors interact such that differences in gene expression between infected and uninfected mice are not the same at different time point: pdf version

Let’s plot a PCA from vst transformed data. Can you anticipate if the interaction term will be important?

vstcounts <- vst(ddsObj.raw, blind = TRUE)
## using 'avgTxLength' from assays(dds), correcting for library size
plotPCA(vstcounts,  intgroup = c("Status", "TimePoint"))
## using ntop=500 top features by variance

In this case we can, from both the PCA and our understanding of the biology, be fairly certain that the interaction model is the appropriate model to use. This is not always the case and so there are methods for comparing models if you are not sure.

Using an interaction model

The model formula should be:

~TimePoint + Status + TimePoint:Status where TimePoint:Status is the parameter for the interaction beteween TimePoint and Status.

Note that * can be used as shortcut to add the interaction term, e.g. ~TimePoint * Status, however, writing out in long form is clearer here.

interaction.model <- as.formula(~ TimePoint + Status + TimePoint:Status)
ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
                                       colData = sampleinfo,
                                       design = interaction.model)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using counts and average transcript lengths from tximport
keep <- rowSums(counts(ddsObj.raw)) > 5
ddsObj.filt <- ddsObj.raw[keep, ]

Run the statistical analysis using the DESeq command and create a new analysis object called ddsObj.interaction.

ddsObj.interaction <- DESeq(ddsObj.filt)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Extracting specific contrasts from an interactive model

If we are settled on using the interaction model, then we need to extract our contrasts with reference to this. That is, we can no longer ask the general question “What is the difference in gene expression between Infected and Uninfected?”, but must rather ask two quesions:

  • “What is the difference in gene expression between Infected and Uninfected at 11 days post infection?”
  • “What is the difference in gene expression between Infected and Uninfected at 33 days post infection?”

If we view the resultsNames for the interaction model, we can see the intercept is Uninfected and 11 days post infection:

resultsNames(ddsObj.interaction)
## [1] "Intercept"                     "TimePoint_d33_vs_d11"         
## [3] "Status_Infected_vs_Uninfected" "TimePointd33.StatusInfected"

The main effect Status_Infected_vs_Uninfected is therefore the difference between Infected and Uninfected at 11 days post infection.

results.interaction.11 <- results(ddsObj.interaction,
                                  name = "Status_Infected_vs_Uninfected",
                                  alpha = 0.05)

To get the results for Infected versus Uninfected at 33 days post infection, we would need to add the interaction term TimePointd33.StatusInfected.

In the help page for results it shows us how to do this with a contrast in example 3.

results.interaction.33 <- results(ddsObj.interaction,
          contrast = list(c("Status_Infected_vs_Uninfected", "TimePointd33.StatusInfected")),
                                  alpha = 0.05)

Number of genes with padj < 0.05 for Test v Control at day 11:

sum(results.interaction.11$padj < 0.05, na.rm = TRUE)
## [1] 1072

Number of genes with padj < 0.05 for Test v Control at day 33:

sum(results.interaction.33$padj < 0.05, na.rm = TRUE)
## [1] 2782

We can see that there is a strong difference in the effects of infection on gene expression between days 11 and 33.


References

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8.
McCarthy, Davis J, Yunshun Chen, and Gordon K Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.