So far we have fitted a simple model considering just “Status”, but in reality we want to model the effects of both “Status” and “Time Point”.
Let’s start with the model with only main effects - an additive model with no interaction. The main assumption here is that the effects of Status and the effects of Time Point are indepedent.
Recapitulate the above steps to generate a new DESeq2 object with the additive model. Then we will extract the results table as above.
Load the raw data, remembering to set the factor on the Status so that
“Uninfected” will be set as the intercept:
txi <- readRDS("RObjects/txi.rds")
sampleinfo <- read_tsv("data/samplesheet_corrected.tsv", col_types="cccc") %>%
mutate(Status = fct_relevel(Status, "Uninfected"))
Create the model:
additive.model <- as.formula(~ TimePoint + Status)
Then build the DESeq from the raw data, the sample meta data and the model:
ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
colData = sampleinfo,
design = additive.model)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using counts and average transcript lengths from tximport
Filter the data set:
keep <- rowSums(counts(ddsObj.raw)) > 5
ddsObj.filt <- ddsObj.raw[keep,]
You are now ready to run the differential gene expression analysis Run the DESeq2 analysis
- Run the size factor estimation, dispersion estimation and modelling steps using the
DESeq
command as above.
ddsObj <- DESeq(ddsObj.filt)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
- Extract the default contrast using the
results
command into a new object calledresults.additive
results.additive <- results(ddsObj, alpha=0.05)
results.additive
## log2 fold change (MLE): Status Infected vs Uninfected
## Wald test p-value: Status Infected vs Uninfected
## DataFrame with 20091 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 1102.56094 -0.0110965 0.106195 -0.104492 0.916779
## ENSMUSG00000000028 58.60055 0.3007930 0.265626 1.132391 0.257470
## ENSMUSG00000000037 49.23586 -0.0481414 0.429685 -0.112039 0.910793
## ENSMUSG00000000049 7.98789 0.4110498 0.656171 0.626437 0.531028
## ENSMUSG00000000056 1981.00402 -0.1907691 0.119694 -1.593809 0.110979
## padj
## <numeric>
## ENSMUSG00000000001 0.967428
## ENSMUSG00000000028 0.514578
## ENSMUSG00000000037 0.965220
## ENSMUSG00000000049 0.757304
## ENSMUSG00000000056 0.314608
## [ reached getOption("max.print") -- omitted 6 rows ]
- What contrast are these results for? If you have constructed the model correctly, then it should be the same as previous
results.simple
Again this results table is for the contrast Infected v Uninfected.- How many genes have an adjusted p-value of less than 0.05
sum(results.additive$padj < 0.05, na.rm = TRUE)
## [1] 2766
If we want a different contrast we can just pass the
results
function the name of the contrast, as given byresultsNames(ddsObj)
. Look at the help page for theresults
command to see how to do this.
- Retrieve the results for the contrast of d33 versus d11.
results.d33vd11 <- results(ddsObj, name= "TimePoint_d33_vs_d11", alpha=0.05)
- How many differentially expressed genes are there at FDR < 0.05?
sum(results.d33vd11$padj < 0.05, na.rm = TRUE)
## [1] 109
When we looked at the PCA it did seem that an interaction model might be warranted. Let’s test that.
Create a new DESeq2 object using a model with an interaction between TimePoint and Status. The model formula should be
~TimePoint + Status + TimePoint:Status
where
TimePoint:Status
is the parameter for the interaction beteween TimePoint and Status.Note that
*
can be used as shortcut to add the interaction term, e.g.~TimePoint * Status
, however, writing out in long form is clearer here.Remember to filter to remove uninformative genes.
interaction.model <- as.formula(~ TimePoint * Status)
ddsObj.raw <- DESeqDataSetFromTximport(txi = txi,
colData = sampleinfo,
design = interaction.model)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using counts and average transcript lengths from tximport
keep <- rowSums(counts(ddsObj.raw)) > 5
ddsObj.filt <- ddsObj.raw[keep,]
- Run the statistical analysis using the
DESeq
command and create a new analysis object calledddsObj.interaction
.
ddsObj.interaction <- DESeq(ddsObj.filt)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
- Use the LRT to compare this to the simpler additive model (
~TimePoint + Status
)
ddsObj.LRT <- DESeq(ddsObj.interaction, test="LRT", reduced=additive.model)
## using pre-existing normalization factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
results.Interaction_v_Additive <- results(ddsObj.LRT)
- Extract a table of results using
results
. For how many genes is interaction model a better fit?
table(results.Interaction_v_Additive$padj < 0.05)
##
## FALSE TRUE
## 16474 455
Let’s investigate the uninfected mice
- Extract the results for d33 v d11 for Uninfected mice. The the intercept is Uninfected mice at 11 days post infection, so the main effect
TimePoint_d33_vs_d11
is the result that we want.
results.d33_v_d11_uninfected <- results(ddsObj.interaction,
name="TimePoint_d33_vs_d11",
alpha = 0.05)
How many genes have an adjusted p-value less than 0.05?
table(results.d33_v_d11_uninfected$padj < 0.05)
##
## FALSE TRUE
## 20043 1
Is this remarkable?
Maybe not. Do we really expect vast gene expression differences between the brains of mice that are slightly older than one another? It is possible that there could have been confounding factors, such as changes in enviromental conditions such as temperature or feeding regime, that may have effected gene expression. In which case it was important to set the experiment up with control for both time points.
- Extract the results for d33 v d11 for Infected mice. The the intercept is Uninfected mice at 11 days post infection, so the main effect
TimePoint_d33_vs_d11
is the result that we want.
results.d33_v_d11_infected <- results(ddsObj.interaction,
contrast = list(c("TimePoint_d33_vs_d11", "TimePointd33.StatusInfected")),
alpha = 0.05)
How many genes have an adjusted p-value less than 0.05?
table(results.d33_v_d11_infected$padj < 0.05)
##
## FALSE TRUE
## 16573 1134
Do these results suggest another approach to analysing this data set?
Could we possibly treat the six uninfected samples as a single group with six replicates and then just have 1 factor with 3 levels: Control, d11.Infected, d33.Infected? This is really a biological question and not a statistical one.