library(scater)
library(scran)
library(batchelor)
library(edgeR)
library(tidyverse)
library(patchwork)
library(DT)
library(bluster)
Source: Multi-sample comparisons of the OSCA book.
A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect changes in composition or expression between conditions. This can provide more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific biological states.
Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories - differential expression (DE) and differential abundance (DA) analyses. The former tests for changes in expression between conditions for cells of the same ‘type’ that are present in both conditions, while the latter tests for changes in the composition of cell types (or states, etc.) between conditions.
Due to the nature of differencial expression we cannot us the downsampled version of the dataset for the this section so we have the full Caron dataset. This is 7 samples with all the cells. It has been QCed and the normalisation carried out as shown last week but it has not been batch corrected or clustered.
Load the SCE object:
# Read object in:
sce <- readRDS("../Robjects/07_semiprocessedCaronSamples.rds")
sce
## class: SingleCellExperiment
## dim: 28377 30773
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(30773): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 12_TTTGTCATCAGTTGAC-1 12_TTTGTCATCTCGTTTA-1
## colData names(8): Sample Dataset ... total sizeFactor
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
As we discussed last week there appears to be a batch effect in this
data so we can correct for this in a slightly different way for
demonstartion with the correctExperiments
function. The
benefit of this function is that it does preserve the count and meta
data from the uncorrected object. We can also specify levels of the
correction so although we are saying sampleName is the batch we can tell
it that PBMMC and ETV6-RUNX1 are also different. The downside is that it
doesn’t have all the functionality of quickCorrect
eg. you
have to precompute the HVGs or default to using all genes.
set.seed(01001001)
# order based on levels of batch (sample)
# get indices of samples in the levels of sample metadata column
# that will be used to inform on batches,
sce$Sample <- factor(sce$Sample)
pbmmc_index <- grep("PBMMC", levels(sce$Sample)) %>% as.list
etv6runx1_index <- grep("ETV6-RUNX1", levels(sce$Sample)) %>% as.list
# set correcting order
mergeOrderList <- list(pbmmc_index, etv6runx1_index)
# correct samples, with fastMNN
corrected <- correctExperiments(sce,
batch = sce$Sample,
PARAM = FastMnnParam(merge.order=mergeOrderList))
corrected
## class: SingleCellExperiment
## dim: 28377 30773
## metadata(3): merge.info pca.info Samples
## assays(3): reconstructed counts logcounts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
## ENSG00000278817
## rowData names(5): rotation ID Symbol Type Chromosome
## colnames(30773): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 12_TTTGTCATCAGTTGAC-1 12_TTTGTCATCTCGTTTA-1
## colData names(9): batch Sample ... total sizeFactor
## reducedDimNames(3): corrected PCA TSNE
## mainExpName: NULL
## altExpNames(0):
Using what you discovered this morning in the clustering session lets cluster this data.
clust <- clusterCells(corrected,
use.dimred = "corrected",
BLUSPARAM = SNNGraphParam(k = 57, cluster.fun = "leiden"))
corrected$Cluster <- factor(clust)
colLabels(corrected) <- factor(clust)
corrected <- runTSNE(corrected, dimred = "corrected")
plotTSNE(corrected, colour_by = "label")
The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples.
# Using 'label' and 'Sample' as our two factors; each column of the output
# corresponds to one unique combination of these two factors.
summed <- aggregateAcrossCells(corrected,
id=colData(corrected)[,c("label", "Sample")])
summed
## class: SingleCellExperiment
## dim: 28377 110
## metadata(3): merge.info pca.info Samples
## assays(1): counts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
## ENSG00000278817
## rowData names(5): rotation ID Symbol Type Chromosome
## colnames: NULL
## colData names(14): batch Sample ... Sample ncells
## reducedDimNames(3): corrected PCA TSNE
## mainExpName: NULL
## altExpNames(0):
At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking:
Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. (Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.)
The DE analysis will be performed using quasi-likelihood (QL) methods
from the edgeR
package (Robinson, McCarthy, and Smyth 2010;
Chen, Lun, and Smyth 2016). This uses a negative binomial generalized
linear model (NB GLM) to handle overdispersed count data in experiments
with limited replication. In our case, we have biological variation with
few replicates per sample group, so edgeR
(or its
contemporaries) is a natural choice for the analysis. It is possible to
use DESeq2
here instead, the OSCA book uses
edgeR
so we have stuck to that but it is up to you.
We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we picked one of the labels to use for this demonstration.
labelToGet <- "4"
current <- summed[,summed$label==labelToGet]
colData(current)
## DataFrame with 7 rows and 14 columns
## batch Sample Dataset Barcode SampleId sum
## <character> <factor> <character> <character> <character> <numeric>
## 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1 NA SRR9264343 NA
## 2 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1 NA SRR9264344 NA
## 3 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1 NA SRR9264345 NA
## 4 ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1 NA SRR9264346 NA
## 5 PBMMC_1 PBMMC_1 PBMMC NA NA NA
## 6 PBMMC_2 PBMMC_2 PBMMC NA SRR9264353 NA
## 7 PBMMC_3 PBMMC_3 PBMMC NA SRR9264354 NA
## detected total sizeFactor Cluster label label Sample
## <integer> <numeric> <numeric> <factor> <factor> <factor> <factor>
## 1 NA NA NA 4 4 4 ETV6-RUNX1_1
## 2 NA NA NA 4 4 4 ETV6-RUNX1_2
## 3 NA NA NA 4 4 4 ETV6-RUNX1_3
## 4 NA NA NA 4 4 4 ETV6-RUNX1_4
## 5 NA NA NA 4 4 4 PBMMC_1
## 6 NA NA NA 4 4 4 PBMMC_2
## 7 NA NA NA 4 4 4 PBMMC_3
## ncells
## <integer>
## 1 15
## 2 141
## 3 1415
## 4 234
## 5 614
## 6 1110
## 7 929
# Creating up a DGEList object for use in edgeR:
countsToUse <- counts(current)
colnames(countsToUse) <- colData(current)$Sample
y <- DGEList(countsToUse, samples=colData(current))
y
## An object of class "DGEList"
## $counts
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1
## ENSG00000238009 0 0 2 0 0
## ENSG00000239945 0 0 0 0 0
## ENSG00000241860 0 0 3 0 4
## ENSG00000241599 0 0 0 0 0
## ENSG00000286448 0 0 0 0 0
## PBMMC_2 PBMMC_3
## ENSG00000238009 1 0
## ENSG00000239945 0 0
## ENSG00000241860 3 1
## ENSG00000241599 0 0
## ENSG00000286448 0 0
## 28372 more rows ...
##
## $samples
## group lib.size norm.factors batch Sample Dataset
## ETV6-RUNX1_1 1 49091 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1
## ETV6-RUNX1_2 1 314754 1 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1
## ETV6-RUNX1_3 1 3301851 1 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1
## ETV6-RUNX1_4 1 361779 1 ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1
## PBMMC_1 1 1090056 1 PBMMC_1 PBMMC_1 PBMMC
## PBMMC_2 1 3024653 1 PBMMC_2 PBMMC_2 PBMMC
## PBMMC_3 1 1779177 1 PBMMC_3 PBMMC_3 PBMMC
## Barcode SampleId sum detected total sizeFactor Cluster label
## ETV6-RUNX1_1 <NA> SRR9264343 NA NA NA NA 4 4
## ETV6-RUNX1_2 <NA> SRR9264344 NA NA NA NA 4 4
## ETV6-RUNX1_3 <NA> SRR9264345 NA NA NA NA 4 4
## ETV6-RUNX1_4 <NA> SRR9264346 NA NA NA NA 4 4
## PBMMC_1 <NA> <NA> NA NA NA NA 4 4
## PBMMC_2 <NA> SRR9264353 NA NA NA NA 4 4
## PBMMC_3 <NA> SRR9264354 NA NA NA NA 4 4
## label.1 Sample.1 ncells
## ETV6-RUNX1_1 4 ETV6-RUNX1_1 15
## ETV6-RUNX1_2 4 ETV6-RUNX1_2 141
## ETV6-RUNX1_3 4 ETV6-RUNX1_3 1415
## ETV6-RUNX1_4 4 ETV6-RUNX1_4 234
## PBMMC_1 4 PBMMC_1 614
## PBMMC_2 4 PBMMC_2 1110
## PBMMC_3 4 PBMMC_3 929
A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 20 cells (Crowell et al. 2019).
discarded <- current$ncells < 20
y <- y[,!discarded]
summary(discarded)
## Mode FALSE TRUE
## logical 6 1
Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Genes are discarded if they are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design).
keep <- filterByExpr(y, group=current$Dataset)
y <- y[keep,]
summary(keep)
## Mode FALSE TRUE
## logical 20731 7646
Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). Counts for our pseudo-bulk samples are large enough to apply bulk normalization methods.
y <- calcNormFactors(y)
y$samples
As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile. This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile.
par(mfrow=c(2,4))
for (i in seq_len(ncol(y))) {
plotMD(y, column=i)
}
We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles. This is closely related to PCA and allows us to visualize the structure of the data. Here, the aim is to check whether samples separate by our known factors of interest. Strong separation foreshadows a large number of DEGs in the subsequent analysis.
y$samples$Dataset <- factor(y$samples$Dataset)
limma::plotMDS(cpm(y, log=TRUE), col = as.numeric(y$samples$Dataset))
Our aim is to test whether the log-fold change between sample groups is significantly different from zero.
design <- model.matrix(~factor(Dataset), y$samples)
design
## (Intercept) factor(Dataset)PBMMC
## ETV6-RUNX1_2 1 0
## ETV6-RUNX1_3 1 0
## ETV6-RUNX1_4 1 0
## PBMMC_1 1 1
## PBMMC_2 1 1
## PBMMC_3 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(Dataset)`
## [1] "contr.treatment"
We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend, which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend.
y <- estimateDisp(y, design)
summary(y$trended.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04653 0.05260 0.05953 0.05953 0.06798 0.07092
Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively.
plotBCV(y)
We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis.
fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$var.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5810 0.7528 0.8065 0.8130 0.8978 1.0965
summary(fit$df.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2599 15.1615 15.1615 14.8325 15.1615 15.1615
QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).
plotQLDisp(fit)
We test for differences in expression due to sample group using glmQLFTest(). DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. If very few genes are significantly DE that sample group has little effect on the transcriptome.
res <- glmQLFTest(fit, coef=ncol(design))
summary(decideTests(res))
## factor(Dataset)PBMMC
## Down 29
## NotSig 7573
## Up 44
topTags(res)$table
Now that we have laid out the theory underlying the DE analysis, we
repeat this process for each of the labels. This is conveniently done
using the pseudoBulkDGE
function from scran
,
which will loop over all labels and apply the exact analysis described
above to each label. To prepare for this, we filter out all sample-label
combinations with insufficient cells.
summed.filt <- summed[,summed$ncells >= 20]
We construct a common design matrix that will be used in the analysis for each label. Recall that this matrix should have one row per unique sample (and named as such), reflecting the fact that we are modelling counts on the sample level instead of the cell level.
# Pulling out a sample-level 'targets' data.frame:
targets <- colData(corrected)[!duplicated(corrected$Sample),] %>%
data.frame()
# Constructing the design matrix:
design <- model.matrix(~factor(Dataset), data=targets)
rownames(design) <- targets$Sample
We then apply the pseudoBulkDGE
function to obtain a
list of DE genes for each label. This function puts some additional
effort into automatically dealing with labels that are not represented
in all sample groups, for which a DE analysis between conditions is
meaningless; or are not represented in a sufficient number of replicate
samples to enable modelling of biological variability.
summed.filt$Dataset <- factor(summed.filt$Dataset)
de.results <- pseudoBulkDGE(summed.filt,
label = summed.filt$label,
design = ~Dataset,
coef = "DatasetPBMMC",
condition = summed.filt$Sample
)
We examine the numbers of DEGs at a FDR of 5% for each label using
the decideTestsPerLabel
function. Note that genes listed as
NA were either filtered out as low-abundance genes for a given label’s
analysis, or the comparison of interest was not possible for a
particular label, e.g., due to lack of residual degrees of freedom or an
absence of samples from both conditions.
is.de <- decideTestsPerLabel(de.results, threshold=0.05)
summarizeTestsPerLabel(is.de)
## -1 0 1 NA
## 1 690 10420 415 16852
## 10 10 5917 20 22430
## 11 107 7414 52 20804
## 12 275 8070 495 19537
## 13 11 3485 48 24833
## 14 124 2060 106 26087
## 2 812 7895 947 18723
## 3 319 13212 117 14729
## 4 53 9186 109 19029
## 5 38 4872 48 23419
## 7 30 6215 54 22078
## 8 25 2817 22 25513
For each gene, we compute the percentage of cell types in which that gene is upregulated or downregulated. (Here, we consider a gene to be non-DE if it is not retained after filtering.).
# Upregulated across most cell types.
up.de <- is.de > 0 & !is.na(is.de)
head(sort(rowMeans(up.de), decreasing=TRUE), 10)
## ENSG00000163220 ENSG00000090382 ENSG00000143546 ENSG00000145592 ENSG00000085265
## 0.9166667 0.8333333 0.7500000 0.5000000 0.5000000
## ENSG00000197956 ENSG00000158869 ENSG00000211592 ENSG00000152601 ENSG00000124406
## 0.4166667 0.4166667 0.4166667 0.4166667 0.4166667
# Downregulated across cell types.
down.de <- is.de < 0 & !is.na(is.de)
head(sort(rowMeans(down.de), decreasing=TRUE), 10)
## ENSG00000132604 ENSG00000110492 ENSG00000227706 ENSG00000107447 ENSG00000120833
## 0.9166667 0.8333333 0.7500000 0.7500000 0.7500000
## ENSG00000118523 ENSG00000198520 ENSG00000168394 ENSG00000184489 ENSG00000015413
## 0.6666667 0.5833333 0.5833333 0.5833333 0.5833333
We further identify label-specific DE genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected even at a generous FDR threshold of 50%.
remotely.de <- decideTestsPerLabel(de.results, threshold=0.5)
not.de <- remotely.de==0 | is.na(remotely.de)
# pick cluster 1
cx <- "1"
other.labels <- setdiff(colnames(not.de), cx)
unique.degs <- is.de[,cx]!=0 & rowMeans(not.de[,other.labels])==1
unique.degs <- names(which(unique.degs))
head(unique.degs)
## [1] "ENSG00000150938" "ENSG00000153563" "ENSG00000133256" "ENSG00000286848"
## [5] "ENSG00000164362" "ENSG00000113356"
# Choosing the top-ranked gene for inspection:
de.inspec <- list()
de.inspec[[cx]] <- de.results[[cx]]
de.inspec[[cx]] <- de.inspec[[cx]][order(de.inspec[[cx]]$PValue),]
de.inspec[[cx]] <- de.inspec[[cx]][rownames(de.inspec[[cx]]) %in% unique.degs,]
sizeFactors(summed.filt) <- NULL
plotExpression(logNormCounts(summed.filt),
features=rownames(de.inspec[[cx]])[1],
x="Sample", colour_by="Sample",
other_fields="label") +
facet_wrap(~label) +
ggtitle(glue::glue("{cx}: {rownames(de.inspec[[cx]])[1]}"))
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Computation failed in `stat_ydensity()`:
## replacement has 1 row, data has 0
Identify label-specific DE genes that are significant in ‘c10’ yet not DE in any other label.
Plot the top-ranked gene for inspection.
The above example of differential expression focused on testing for differences in expression between conditions for the same cell type or label. However, the same methodology can be applied to test for differences between cell types across samples. This kind of DE analysis can overcome a lack of suitable replication.
summed.sub <- summed[,summed$Cluster %in% c("3", "4")]
# Using a dummy value for the label to allow us to include multiple cell types
# in the fitted model; otherwise, each cell type will be processed separately.
between.res <- pseudoBulkDGE(summed.sub,
label=rep("dummy", ncol(summed.sub)),
design=~factor(Sample) + factor(Cluster),
coef="factor(Cluster)4")[[1]]
table(Sig=between.res$FDR <= 0.05, Sign=sign(between.res$logFC))
## Sign
## Sig -1 1
## FALSE 4034 2676
## TRUE 3008 1496
between.res[order(between.res$PValue),]
## DataFrame with 28377 rows and 5 columns
## logFC logCPM F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000019582 -4.20137 11.81529 705.477 1.93446e-29 2.16930e-25
## ENSG00000081189 -6.00394 8.11585 618.845 3.20053e-28 1.79454e-24
## ENSG00000272398 -5.15958 9.02102 599.184 6.36715e-28 2.38004e-24
## ENSG00000223865 -4.62725 9.63222 563.071 2.38328e-27 6.68152e-24
## ENSG00000196126 -4.60135 9.04103 555.379 3.18904e-27 7.15238e-24
## ... ... ... ... ... ...
## ENSG00000274792 NA NA NA NA NA
## ENSG00000275869 NA NA NA NA NA
## ENSG00000277836 NA NA NA NA NA
## ENSG00000278633 NA NA NA NA NA
## ENSG00000278817 NA NA NA NA NA
summed.sub <- logNormCounts(summed.sub, size.factors=NULL)
plotExpression(summed.sub,
features=head(rownames(between.res)[order(between.res$PValue)]),
x="Cluster",
colour_by=I(factor(summed.sub$Sample)))
Whether or not this is a scientifically meaningful comparison depends on the nature of the labels. These particular labels were defined by clustering, which means that the presence of DEGs is a foregone conclusion. Nonetheless, it may have some utility for applications where the labels are defined using independent information, e.g., from FACS.
In a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome.
We prepare for the DA analysis by quantifying the number of cells assigned to each label (or cluster).
abundances <- table(corrected$Cluster, corrected$Sample)
abundances <- unclass(abundances)
head(abundances)
##
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_2 PBMMC_3
## 1 768 1301 333 1491 39 57 117
## 2 504 15 8 29 176 116 159
## 3 1345 4543 1729 2074 45 46 174
## 4 15 141 1415 234 614 1110 929
## 5 2 7 116 411 8 504 44
## 6 24 210 75 19 10 16 4
Performing the DA analysis
Our DA analysis will again be performed with the edgeR package. This allows us to take advantage of the NB GLM methods to model overdispersed count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label (Lun, Richard, and Marioni 2017). The aim is to share information across labels to improve our estimates of the biological variability in cell abundance between replicates.
# Attaching some column metadata.
extra.info <- colData(corrected)[match(colnames(abundances), corrected$Sample),]
y.ab <- DGEList(abundances, samples=extra.info)
y.ab
## An object of class "DGEList"
## $counts
##
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_2 PBMMC_3
## 1 768 1301 333 1491 39 57 117
## 2 504 15 8 29 176 116 159
## 3 1345 4543 1729 2074 45 46 174
## 4 15 141 1415 234 614 1110 929
## 5 2 7 116 411 8 504 44
## 12 more rows ...
##
## $samples
## group lib.size norm.factors batch Sample Dataset
## ETV6-RUNX1_1 1 2695 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1
## ETV6-RUNX1_2 1 6473 1 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1
## ETV6-RUNX1_3 1 4851 1 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1
## ETV6-RUNX1_4 1 5661 1 ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1
## PBMMC_1 1 2089 1 PBMMC_1 PBMMC_1 PBMMC
## PBMMC_2 1 4836 1 PBMMC_2 PBMMC_2 PBMMC
## PBMMC_3 1 4168 1 PBMMC_3 PBMMC_3 PBMMC
## Barcode SampleId sum detected total sizeFactor
## ETV6-RUNX1_1 AAACCTGAGACTTTCG-1 SRR9264343 6677 2056 6677 1.8314606
## ETV6-RUNX1_2 AAACCTGAGCAGACTG-2 SRR9264344 6868 2761 6868 1.8838507
## ETV6-RUNX1_3 AAACCTGAGGTGATTA-3 SRR9264345 656 399 656 0.1799368
## ETV6-RUNX1_4 AAACCTGAGAGTGACC-4 SRR9264346 2702 1297 2702 0.7411422
## PBMMC_1 AAACCTGCACTTCGAA-9 SRR9264351 2225 883 2225 0.6103040
## PBMMC_2 AAACCTGAGAAACGAG-11 SRR9264353 5793 352 5793 1.5889847
## PBMMC_3 AAACCTGAGGACAGCT-12 SRR9264354 1950 615 1950 0.5348732
## Cluster label
## ETV6-RUNX1_1 1 1
## ETV6-RUNX1_2 1 1
## ETV6-RUNX1_3 7 7
## ETV6-RUNX1_4 3 3
## PBMMC_1 15 15
## PBMMC_2 8 8
## PBMMC_3 4 4
We filter out low-abundance labels as previously described. This avoids cluttering the result table with very rare subpopulations that contain only a handful of cells. For a DA analysis of cluster abundances, filtering is generally not required as most clusters will not be of low-abundance (otherwise there would not have been enough evidence to define the cluster in the first place).
keep <- filterByExpr(y.ab, group=y.ab$samples$Dataset)
y.ab <- y.ab[keep,]
summary(keep)
## Mode FALSE TRUE
## logical 2 15
Unlike DE analyses, we do not perform an additional normalization step with calcNormFactors(). This means that we are only normalizing based on the “library size”, i.e., the total number of cells in each sample. Any changes we detect between conditions will subsequently represent differences in the proportion of cells in each cluster.
Here, the log-fold change in our model refers to the change in cell abundance between sample groups, rather than the change in gene expression.
design.ab <- model.matrix(~factor(Dataset), y.ab$samples)
We use the estimateDisp() function to estimate the NB dispersion for each cluster. We turn off the trend as we do not have enough points for its stable estimation.
y.ab <- estimateDisp(y.ab, design.ab, trend="none")
summary(y.ab$common.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.279 1.279 1.279 1.279 1.279 1.279
plotBCV(y.ab, cex=1)
We repeat this process with the QL dispersion, again disabling the trend.
fit.ab <- glmQLFit(y.ab, design.ab, robust=TRUE, abundance.trend=FALSE)
summary(fit.ab$var.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.08 1.08 1.08 1.08 1.08 1.08
summary(fit.ab$df.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.431 10.084 11.192 10.631 11.192 11.192
plotQLDisp(fit.ab, cex=1)
We test for differences in abundance between sample groups using
glmQLFTest
.
res.ab <- glmQLFTest(fit.ab, coef=ncol(design))
summary(decideTests(res.ab))
## factor(Dataset)PBMMC
## Down 1
## NotSig 11
## Up 3
topTags(res.ab)
## Coefficient: factor(Dataset)PBMMC
## logFC logCPM F PValue FDR
## 15 7.391339 15.84681 19.154316 0.0004870632 0.007305948
## 9 5.318875 15.05862 14.999060 0.0013224288 0.008937726
## 12 5.124446 15.75690 13.919037 0.0017875451 0.008937726
## 3 -4.307065 18.12281 10.206317 0.0055718399 0.020894400
## 1 -3.387948 16.93813 6.915820 0.0180734043 0.051742641
## 13 3.335043 13.05429 6.570395 0.0206970565 0.051742641
## 6 -2.323505 13.33929 3.141462 0.0951370277 0.203865059
## 11 2.200692 15.00314 2.477686 0.1353598374 0.253799695
## 10 1.552649 15.45982 1.647648 0.2173452364 0.362242061
## 4 1.464691 17.27317 1.245638 0.2811404606 0.421710691
As mentioned above, we do not use calcNormFactors
in our
default DA analysis. This normalization step assumes that most of the
input features are not different between conditions. While this
assumption is reasonable for most types of gene expression data, it is
generally too strong for cell type abundance - most experiments consist
of only a few cell types that may all change in abundance upon
perturbation. Thus, our default approach is to only normalize based on
the total number of cells in each sample, which means that we are
effectively testing for differential proportions between conditions.
Unfortunately, the use of the total number of cells leaves us susceptible to composition effects. For example, a large increase in abundance for one cell subpopulation will introduce decreases in proportion for all other subpopulations - which is technically correct, but may be misleading if one concludes that those other subpopulations are decreasing in abundance of their own volition. If composition biases are proving problematic for interpretation of DA results, we have several avenues for removing them or mitigating their impact by leveraging a priori biological knowledge.
If it is possible to assume that most labels (i.e., cell types) do
not change in abundance, we can use calcNormFactors
to
compute normalization factors.
y.ab2 <- calcNormFactors(y.ab)
y.ab2$samples$norm.factors
## [1] 1.676738 0.170929 1.266416 1.337315 1.152452 1.598872 1.118079
We then proceed with the remainder of the edgeR analysis, shown below in condensed format. A shift of positive log-fold changes towards zero is consistent with the removal of composition biases.
y.ab2 <- estimateDisp(y.ab2, design.ab, trend="none")
fit.ab2 <- glmQLFit(y.ab2, design.ab, robust=TRUE, abundance.trend=FALSE)
res.ab2 <- glmQLFTest(fit.ab2, coef=ncol(design))
topTags(res.ab2, n=10)
## Coefficient: factor(Dataset)PBMMC
## logFC logCPM F PValue FDR
## 15 5.257970 15.66379 7.9470060 0.006156997 0.04983305
## 3 -5.905669 19.45123 7.7945158 0.006644406 0.04983305
## 12 4.673103 15.50242 6.7270443 0.011413534 0.05706767
## 9 4.242152 14.89462 5.6571106 0.019931354 0.06706544
## 1 -4.618755 17.84262 5.4354614 0.022420361 0.06706544
## 6 -4.470452 14.92189 5.1005226 0.026826177 0.06706544
## 11 2.005537 14.56177 1.5048297 0.223766885 0.47950047
## 13 1.398529 13.15358 0.7178347 0.399550679 0.68159982
## 14 -1.421365 11.82888 0.6895298 0.408959895 0.68159982
## 4 1.024351 17.11251 0.4036993 0.527118821 0.79067823
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bluster_1.4.0 DT_0.20
## [3] patchwork_1.1.1 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.8
## [7] purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.6
## [11] tidyverse_1.3.1 edgeR_3.36.0
## [13] limma_3.50.0 batchelor_1.10.0
## [15] scran_1.22.1 scater_1.22.0
## [17] ggplot2_3.3.5 scuttle_1.4.0
## [19] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [21] Biobase_2.54.0 GenomicRanges_1.46.1
## [23] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [25] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [27] MatrixGenerics_1.6.0 matrixStats_0.61.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 ggbeeswarm_0.6.0
## [3] colorspace_2.0-2 ellipsis_0.3.2
## [5] XVector_0.34.0 BiocNeighbors_1.12.0
## [7] fs_1.5.2 rstudioapi_0.13
## [9] farver_2.1.0 ggrepel_0.9.1
## [11] fansi_1.0.2 lubridate_1.8.0
## [13] xml2_1.3.3 splines_4.1.2
## [15] sparseMatrixStats_1.6.0 knitr_1.37
## [17] jsonlite_1.7.3 ResidualMatrix_1.4.0
## [19] broom_0.7.12 cluster_2.1.2
## [21] dbplyr_2.1.1 compiler_4.1.2
## [23] httr_1.4.2 dqrng_0.3.0
## [25] backports_1.4.1 assertthat_0.2.1
## [27] Matrix_1.4-0 fastmap_1.1.0
## [29] cli_3.1.1 BiocSingular_1.10.0
## [31] htmltools_0.5.2 tools_4.1.2
## [33] rsvd_1.0.5 igraph_1.2.11
## [35] gtable_0.3.0 glue_1.6.1
## [37] GenomeInfoDbData_1.2.7 Rcpp_1.0.8
## [39] cellranger_1.1.0 jquerylib_0.1.4
## [41] vctrs_0.3.8 DelayedMatrixStats_1.16.0
## [43] xfun_0.29 beachmat_2.10.0
## [45] rvest_1.0.2 lifecycle_1.0.1
## [47] irlba_2.3.5 statmod_1.4.36
## [49] zlibbioc_1.40.0 scales_1.1.1
## [51] hms_1.1.1 parallel_4.1.2
## [53] yaml_2.2.2 gridExtra_2.3
## [55] sass_0.4.0 stringi_1.7.6
## [57] highr_0.9 ScaledMatrix_1.2.0
## [59] BiocParallel_1.28.3 rlang_1.0.1
## [61] pkgconfig_2.0.3 bitops_1.0-7
## [63] evaluate_0.14 lattice_0.20-45
## [65] labeling_0.4.2 htmlwidgets_1.5.4
## [67] cowplot_1.1.1 tidyselect_1.1.1
## [69] magrittr_2.0.2 R6_2.5.1
## [71] generics_0.1.2 metapod_1.2.0
## [73] DelayedArray_0.20.0 DBI_1.1.2
## [75] pillar_1.7.0 haven_2.4.3
## [77] withr_2.4.3 RCurl_1.98-1.6
## [79] modelr_0.1.8 crayon_1.4.2
## [81] utf8_1.2.2 tzdb_0.2.0
## [83] rmarkdown_2.11 viridis_0.6.2
## [85] locfit_1.5-9.4 grid_4.1.2
## [87] readxl_1.3.1 reprex_2.0.1
## [89] digest_0.6.29 munsell_0.5.0
## [91] beeswarm_0.4.0 viridisLite_0.4.0
## [93] vipor_0.4.5 bslib_0.3.1