splSetToGet <- "PBMMC,ETV6-RUNX1"
splSetVec <- unlist(strsplit(splSetToGet, ","))
splSetToGet2 <- gsub(",", "_", splSetToGet)
nbPcToComp <- 50
figSize <- 7
library(scater)
library(scran)
library(batchelor)
library(edgeR)
library(tidyverse)
library(patchwork)
library(DT)
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
Source: Multi-sample comparisons of the OSCA book.
A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect changes in composition or expression between conditions. For example, a researcher could use this strategy to detect changes in cell type abundance after drug treatment (Richard et al. 2018) or genetic modifications (Scialdone et al. 2016). This provides more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific experimental perturbations.
Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories - differential expression (DE) and differential abundance (DA) analyses. The former tests for changes in expression between conditions for cells of the same type that are present in both conditions, while the latter tests for changes in the composition of cell types (or states, etc.) between conditions.
We have used the data set comprising the 7 samples (500 cells per sample) analysed with fastMNN.
The differential analyses will be predicated on many of the pre-processing steps covered previously. For brevity, we will not explicitly repeat them here, only noting that we have already merged cells from all samples into the same coordinate system and clustered the merged dataset to obtain a common partitioning across all samples.
Load the SCE object:
# Read object in:
sce <- readRDS("../Robjects/caron_postDeconv_5hCellPerSpl_dsi_PBMMC_ETV6-RUNX1_clust.Rds")
sce
## class: SingleCellExperiment
## dim: 9479 3500
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(9479): TSPAN6 SCYL3 ... AL132709.8 AL109627.1
## rowData names(8): ID Symbol ... gene_sparsity rotation
## colnames(3500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
## 12_TTTGTCACAATGAAAC-1 12_TTTGTCATCAGTTGAC-1
## colData names(28): SampleName Barcode ... cluster.out.10
## cluster.out.10.col
## reducedDimNames(3): corrected TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
colData(sce) %>% head
## DataFrame with 6 rows and 28 columns
## SampleName Barcode SampleId SampleGroup
## <character> <character> <character> <factor>
## 1_AAACGGGCAGTTCATG-1 ETV6-RUNX1_1 AAACGGGCAGTTCATG-1 SRR9264343 ETV6-RUNX1
## 1_AAACGGGGTTCACCTC-1 ETV6-RUNX1_1 AAACGGGGTTCACCTC-1 SRR9264343 ETV6-RUNX1
## 1_AAAGATGAGCGATGAC-1 ETV6-RUNX1_1 AAAGATGAGCGATGAC-1 SRR9264343 ETV6-RUNX1
## 1_AAAGATGCAGCCAATT-1 ETV6-RUNX1_1 AAAGATGCAGCCAATT-1 SRR9264343 ETV6-RUNX1
## 1_AAAGTAGCAATGCCAT-1 ETV6-RUNX1_1 AAAGTAGCAATGCCAT-1 SRR9264343 ETV6-RUNX1
## 1_AAATGCCAGAACAATC-1 ETV6-RUNX1_1 AAATGCCAGAACAATC-1 SRR9264343 ETV6-RUNX1
## DatasetName sum detected subsets_Mito_sum
## <character> <numeric> <integer> <numeric>
## 1_AAACGGGCAGTTCATG-1 Caron 8103 2299 430
## 1_AAACGGGGTTCACCTC-1 Caron 9552 2843 752
## 1_AAAGATGAGCGATGAC-1 Caron 1693 894 75
## 1_AAAGATGCAGCCAATT-1 Caron 6164 2063 402
## 1_AAAGTAGCAATGCCAT-1 Caron 5816 1943 434
## 1_AAATGCCAGAACAATC-1 Caron 2533 1080 271
## subsets_Mito_detected subsets_Mito_percent total
## <integer> <numeric> <numeric>
## 1_AAACGGGCAGTTCATG-1 11 5.30668 8103
## 1_AAACGGGGTTCACCTC-1 12 7.87270 9552
## 1_AAAGATGAGCGATGAC-1 10 4.43001 1693
## 1_AAAGATGCAGCCAATT-1 11 6.52174 6164
## 1_AAAGTAGCAATGCCAT-1 12 7.46217 5816
## 1_AAATGCCAGAACAATC-1 12 10.69878 2533
## low_lib_size low_n_features high_Mito_percent discard
## <logical> <logical> <logical> <logical>
## 1_AAACGGGCAGTTCATG-1 FALSE FALSE FALSE FALSE
## 1_AAACGGGGTTCACCTC-1 FALSE FALSE FALSE FALSE
## 1_AAAGATGAGCGATGAC-1 FALSE FALSE FALSE FALSE
## 1_AAAGATGCAGCCAATT-1 FALSE FALSE FALSE FALSE
## 1_AAAGTAGCAATGCCAT-1 FALSE FALSE FALSE FALSE
## 1_AAATGCCAGAACAATC-1 FALSE FALSE FALSE FALSE
## sum.1 detected.1 total.1 sizeFactor cell_sparsity
## <numeric> <integer> <numeric> <numeric> <numeric>
## 1_AAACGGGCAGTTCATG-1 8101 2298 8101 3.636419 0.870113
## 1_AAACGGGGTTCACCTC-1 9550 2841 9550 5.029599 0.839379
## 1_AAAGATGAGCGATGAC-1 1692 893 1692 0.953496 0.949492
## 1_AAAGATGCAGCCAATT-1 6160 2059 6160 3.072866 0.883446
## 1_AAAGTAGCAATGCCAT-1 5815 1942 5815 2.777673 0.890226
## 1_AAATGCCAGAACAATC-1 2531 1078 2531 1.233575 0.938983
## label cluster kmeans10 kmeans.best louvain leiden
## <factor> <factor> <numeric> <numeric> <factor> <factor>
## 1_AAACGGGCAGTTCATG-1 2 4 5 6 4 4
## 1_AAACGGGGTTCACCTC-1 2 4 5 6 4 4
## 1_AAAGATGAGCGATGAC-1 5 3 10 7 3 2
## 1_AAAGATGCAGCCAATT-1 4 5 6 15 1 6
## 1_AAAGTAGCAATGCCAT-1 5 5 6 15 1 6
## 1_AAATGCCAGAACAATC-1 5 2 6 9 3 2
## cluster.out.10 cluster.out.10.col
## <factor> <character>
## 1_AAACGGGCAGTTCATG-1 5 #CA8385
## 1_AAACGGGGTTCACCTC-1 5 #CA8385
## 1_AAAGATGAGCGATGAC-1 2 #AA38D5
## 1_AAAGATGCAGCCAATT-1 2 #AA38D5
## 1_AAAGTAGCAATGCCAT-1 2 #AA38D5
## 1_AAATGCCAGAACAATC-1 2 #AA38D5
A brief inspection of the results shows clusters contain varying contributions from batches:
colLabels(sce) <- sce$louvain # clusters.mnn
tab <- table(colLabels(sce), sce$SampleGroup)
tab
##
## ETV6-RUNX1 HHD PBMMC PRE-T
## 1 135 0 116 0
## 2 438 0 53 0
## 3 584 0 10 0
## 4 366 0 149 0
## 5 2 0 168 0
## 6 202 0 398 0
## 7 23 0 85 0
## 8 2 0 168 0
## 9 140 0 136 0
## 10 40 0 111 0
## 11 68 0 106 0
tab <- table(colLabels(sce), sce$SampleName)
pheatmap::pheatmap(tab,
border_color = NA,
drop_levels = TRUE,
cluster_rows = !FALSE,
cluster_cols = !FALSE
)
On the t-SNE plots below, cells are coloured by type or sample (‘batch of origin’). Cluster numbers are superimposed based on the median coordinate of cells assigned to that cluster.
p1 <- plotTSNE(sce, colour_by="SampleGroup", text_by="label", point_size=0.3)
p2 <- plotTSNE(sce, colour_by="SampleName", point_size=0.3)
gridExtra::grid.arrange(p1, p2+facet_wrap(~colData(sce)$SampleName), ncol=2)
The data set used so far comprises 500 cells per sample, by design, to keep it small and computations short for the course. The contingency table above shows some samples contribute few cells to some clusters, which would prevent their analysis.
For the differential expression and abundance analyses we will use a larger data, with 1200 cells per sample. We will first load ‘uncorrected’ (i.e. non merged) SCE object keeping normalised counts. We then merge samples as done previously, also showing here how to use a convenient function correctExperiments
from the batchelor
package, and how to set the order to merge samples in with fastMNN
, here PBMMCs sample first, then ETV6-RUNX1s. We finally perform clustering and dimensionality reduction for visualisation.
# read in 'uncorrected' SCE object with 1200 cells per sample
# (see code in multiSplCompPrep2kcpsUncorr.R used to make that file
# from the uncorrected SCE object keeping all cells, too large for simple GitHub)
uncorrected <- readRDS("../Robjects/caron_sce_nz_postDeconv_1p2kcps_dsi_PBMMC_ETV6-RUNX1_uncorr.Rds")
#--- merging ---#
# see batchelor
set.seed(01001001)
# order based on levels of batch here sampleName
# get indices of samples in the levels of sample metadata column
# that will be used to inform on batches,
# here SampleName
uncorrected$SampleName <- factor(uncorrected$SampleName)
pbmmc_index <- grep("PBMMC", levels(uncorrected$SampleName)) %>% as.list
etv6runx1_index <- grep("ETV6-RUNX1", levels(uncorrected$SampleName)) %>% as.list
# set merging order
mergeOrderList <- list(pbmmc_index, etv6runx1_index)
# merge samples, with fastMNN
merged <- correctExperiments(uncorrected,
batch = uncorrected$SampleName,
subset.row = uncorrected$chosenHvg,
PARAM = FastMnnParam(
merge.order=mergeOrderList
)
)
# glimpse at outcome
merged
## class: SingleCellExperiment
## dim: 14971 8400
## metadata(2): merge.info pca.info
## assays(3): reconstructed counts logcounts
## rownames(14971): DPM1 SCYL3 ... AC139491.7 AC003043.2
## rowData names(12): rotation ensembl_gene_id ... detected gene_sparsity
## colnames(8400): GGACGTCCATCGACGC-1 ATTGGACAGTTGAGAT-1 ...
## GTCACAAAGTTACCCA-12 CTTGGCTCAAGAAAGG-12
## colData names(22): batch Sample ... SampleName SampleGroup
## reducedDimNames(4): corrected PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):
#--- clustering ---#
g <- buildSNNGraph(merged, use.dimred = "corrected")
clusters <- igraph::cluster_louvain(g)
merged$clusters.mnn <- factor(paste0("c", clusters$membership),
levels = paste0("c", sort(unique(clusters$membership))))
#colLabels(merged) <- merged$clusters.mnn
#--- dimensionality-reduction ---#
merged <- runTSNE(merged, dimred = "corrected", external_neighbors=TRUE)
merged <- runUMAP(merged, dimred = "corrected", external_neighbors=TRUE)
# write merged to file:
merged_counts <- counts(merged)
saveRDS(merged_counts, file = "../Robjects/caron_sce_nz_postDeconv_1p2kcps_dsi_PBMMC_ETV6-RUNX1_merged_counts.Rds")
counts(merged) <- NULL # remove raw counts to save space
saveRDS(merged, file = "../Robjects/caron_sce_nz_postDeconv_1p2kcps_dsi_PBMMC_ETV6-RUNX1_merged.Rds")
counts(merged) <- merged_counts
Cluster x sample contingency table:
tab <- table(merged$clusters.mnn, merged$SampleName)
pheatmap::pheatmap(tab,
border_color = NA,
drop_levels = TRUE,
cluster_rows = !FALSE,
cluster_cols = FALSE
)
TSNE plots:
p1 <- plotTSNE(merged, colour_by="SampleGroup", text_by="clusters.mnn", point_size = 0.3)
p2 <- plotTSNE(merged, colour_by="SampleName", point_size = 0.3) +
facet_wrap(~colData(merged)$SampleGroup)
#p1/p2 + plot_layout(heights = c(2, 1))
p1 + p2 + plot_layout(widths = c(1, 2))
The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples.
# Using 'label' and 'sample' as our two factors; each column of the output
# corresponds to one unique combination of these two factors.
columnsToUse <- c("batch", "SampleName", "SampleGroup", "clusters.mnn")
colData(merged) <- colData(merged) %>% data.frame() %>% select(all_of(columnsToUse)) %>% DataFrame
summed <- aggregateAcrossCells(merged,
id = DataFrame(
label = merged$clusters.mnn,
sample = merged$SampleName
)
)
colData(summed) %>% head(3)
## DataFrame with 3 rows and 7 columns
## batch SampleName SampleGroup clusters.mnn label sample
## <character> <factor> <factor> <factor> <factor> <factor>
## 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1 c1 c1 ETV6-RUNX1_1
## 2 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1 c1 c1 ETV6-RUNX1_2
## 3 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1 c1 c1 ETV6-RUNX1_3
## ncells
## <integer>
## 1 159
## 2 6
## 3 1
At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking:
Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. (Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.)
The DE analysis will be performed using quasi-likelihood (QL) methods from the edgeR package (Robinson, McCarthy, and Smyth 2010; Chen, Lun, and Smyth 2016). This uses a negative binomial generalized linear model (NB GLM) to handle overdispersed count data in experiments with limited replication. In our case, we have biological variation with few replicates per sample group, so edgeR (or its contemporaries) is a natural choice for the analysis.
We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we arbitrarily pick one of the labels to use for this demonstration.
labelToGet <- "c7"
current <- summed[,summed$label==labelToGet]
colData(current)
## DataFrame with 7 rows and 7 columns
## batch SampleName SampleGroup clusters.mnn label sample
## <character> <factor> <factor> <factor> <factor> <factor>
## 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1 c7 c7 ETV6-RUNX1_1
## 2 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1 c7 c7 ETV6-RUNX1_2
## 3 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1 c7 c7 ETV6-RUNX1_3
## 4 ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1 c7 c7 ETV6-RUNX1_4
## 5 PBMMC_1 PBMMC_1 PBMMC c7 c7 PBMMC_1
## 6 PBMMC_3 PBMMC_3 PBMMC c7 c7 PBMMC_3
## 7 PBMMC_4 PBMMC_4 PBMMC c7 c7 PBMMC_4
## ncells
## <integer>
## 1 375
## 2 189
## 3 73
## 4 329
## 5 214
## 6 47
## 7 106
# Creating up a DGEList object for use in edgeR:
countsToUse <- counts(current)
colnames(countsToUse) <- colData(current)$SampleName
y <- DGEList(countsToUse, samples=colData(current))
y
## An object of class "DGEList"
## $counts
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_3
## DPM1 236 72 34 109 72 32
## SCYL3 26 9 7 25 7 4
## C1orf112 38 17 6 22 23 15
## FGR 1 0 0 1 1 0
## CFH 0 0 0 0 0 2
## PBMMC_4
## DPM1 64
## SCYL3 2
## C1orf112 19
## FGR 5
## CFH 4
## 14966 more rows ...
##
## $samples
## group lib.size norm.factors batch SampleName SampleGroup
## ETV6-RUNX1_1 1 2884815 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1
## ETV6-RUNX1_2 1 766397 1 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1
## ETV6-RUNX1_3 1 355256 1 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1
## ETV6-RUNX1_4 1 1479948 1 ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1
## PBMMC_1 1 1247801 1 PBMMC_1 PBMMC_1 PBMMC
## PBMMC_3 1 439670 1 PBMMC_3 PBMMC_3 PBMMC
## PBMMC_4 1 582237 1 PBMMC_4 PBMMC_4 PBMMC
## clusters.mnn label sample ncells
## ETV6-RUNX1_1 c7 c7 ETV6-RUNX1_1 375
## ETV6-RUNX1_2 c7 c7 ETV6-RUNX1_2 189
## ETV6-RUNX1_3 c7 c7 ETV6-RUNX1_3 73
## ETV6-RUNX1_4 c7 c7 ETV6-RUNX1_4 329
## PBMMC_1 c7 c7 PBMMC_1 214
## PBMMC_3 c7 c7 PBMMC_3 47
## PBMMC_4 c7 c7 PBMMC_4 106
A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 20 cells (Crowell et al. 2019).
discarded <- current$ncells < 20
y <- y[,!discarded]
summary(discarded)
## Mode FALSE
## logical 7
Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Genes are discarded if they are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design).
keep <- filterByExpr(y, group=current$SampleGroup)
y <- y[keep,]
summary(keep)
## Mode FALSE TRUE
## logical 7618 7353
Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). Counts for our pseudo-bulk samples are large enough to apply bulk normalization methods.
y <- calcNormFactors(y)
y$samples
As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile. This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile.
par(mfrow=c(2,4))
for (i in seq_len(ncol(y))) {
plotMD(y, column=i)
}
We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles. This is closely related to PCA and allows us to visualize the structure of the data. Here, the aim is to check whether samples separate by our known factors of interest. Strong separation foreshadows a large number of DEGs in the subsequent analysis.
plotMDS(cpm(y, log=TRUE),
col = as.numeric(y$samples$SampleGroup)
)
Our aim is to test whether the log-fold change between sample groups is significantly different from zero.
design <- model.matrix(~factor(SampleGroup), y$samples)
design
## (Intercept) factor(SampleGroup)PBMMC
## ETV6-RUNX1_1 1 0
## ETV6-RUNX1_2 1 0
## ETV6-RUNX1_3 1 0
## ETV6-RUNX1_4 1 0
## PBMMC_1 1 1
## PBMMC_3 1 1
## PBMMC_4 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(SampleGroup)`
## [1] "contr.treatment"
We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend, which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend.
y <- estimateDisp(y, design)
summary(y$trended.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07910 0.08661 0.09343 0.09533 0.10294 0.12957
Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively.
plotBCV(y)
We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis.
fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$var.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5429 0.6308 0.7135 0.7143 0.7744 1.3459
summary(fit$df.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2655 8.7182 8.7182 8.3484 8.7182 8.7182
QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).
plotQLDisp(fit)
We test for differences in expression due to sample group using glmQLFTest(). DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. If very few genes are significantly DE that sample group has little effect on the transcriptome.
res <- glmQLFTest(fit, coef=ncol(design))
summary(decideTests(res))
## factor(SampleGroup)PBMMC
## Down 685
## NotSig 6148
## Up 520
topTags(res)$table
Now that we have laid out the theory underlying the DE analysis, we repeat this process for each of the labels. This is conveniently done using the pseudoBulkDGE
function from scran
, which will loop over all labels and apply the exact analysis described above to each label. To prepare for this, we filter out all sample-label combinations with insufficient cells.
summed.filt <- summed[,summed$ncells >= 20]
We construct a common design matrix that will be used in the analysis for each label. Recall that this matrix should have one row per unique sample (and named as such), reflecting the fact that we are modelling counts on the sample level instead of the cell level.
# Pulling out a sample-level 'targets' data.frame:
targets <- colData(merged)[!duplicated(merged$SampleName),] %>%
data.frame() %>%
select(-clusters.mnn)
# Constructing the design matrix:
design <- model.matrix(~factor(SampleGroup), data=targets)
rownames(design) <- targets$SampleName
We then apply the pseudoBulkDGE
function to obtain a list of DE genes for each label. This function puts some additional effort into automatically dealing with labels that are not represented in all sample groups, for which a DE analysis between conditions is meaningless; or are not represented in a sufficient number of replicate samples to enable modelling of biological variability.
summed.filt$SampleGroup <- factor(summed.filt$SampleGroup)
de.results <- pseudoBulkDGE(summed.filt,
label = summed.filt$label,
design = ~SampleGroup,
coef = "SampleGroupPBMMC",
condition = summed.filt$SampleGroup
)
We examine the numbers of DEGs at a FDR of 5% for each label using the decideTestsPerLabel
function. Note that genes listed as NA were either filtered out as low-abundance genes for a given label’s analysis, or the comparison of interest was not possible for a particular label, e.g., due to lack of residual degrees of freedom or an absence of samples from both conditions.
is.de <- decideTestsPerLabel(de.results, threshold=0.05)
summarizeTestsPerLabel(is.de)
## -1 0 1 NA
## c1 428 3903 378 10262
## c10 138 682 135 14016
## c11 0 2549 3 12419
## c12 2 4529 6 10434
## c13 50 3862 21 11038
## c3 0 8494 0 6477
## c4 0 2629 2 12340
## c6 0 2346 2 12623
## c7 685 6148 520 7618
## c8 67 1909 48 12947
For each gene, we compute the percentage of cell types in which that gene is upregulated or downregulated. (Here, we consider a gene to be non-DE if it is not retained after filtering.).
# Upregulated across most cell types.
up.de <- is.de > 0 & !is.na(is.de)
head(sort(rowMeans(up.de), decreasing=TRUE), 10)
## LYZ S100A8 S100A9 IGKC IGLC2 RGS2 IGLL1 RSL24D1 HINT1 LRRC75A
## 0.7 0.7 0.6 0.5 0.4 0.3 0.3 0.3 0.3 0.3
# Downregulated across cell types.
down.de <- is.de < 0 & !is.na(is.de)
head(sort(rowMeans(down.de), decreasing=TRUE), 10)
## DNTT MDK SOCS2 CD99 CD74 ID3 TERF2 HHEX SLC35E3 MME
## 0.6 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4
We further identify label-specific DE genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected even at a generous FDR threshold of 50%.
remotely.de <- decideTestsPerLabel(de.results, threshold=0.5)
not.de <- remotely.de==0 | is.na(remotely.de)
# first cluster in is.de
cx <- colnames(is.de)[1]
cx
## [1] "c1"
other.labels <- setdiff(colnames(not.de), cx)
unique.degs <- is.de[,cx]!=0 & rowMeans(not.de[,other.labels])==1
unique.degs <- names(which(unique.degs))
head(unique.degs)
## [1] "TYMP" "SEC63" "ADAM28" "TSPAN32" "NDE1" "IGF2BP2"
# Choosing the top-ranked gene for inspection:
de.inspec <- list()
de.inspec[[cx]] <- de.results[[cx]]
de.inspec[[cx]] <- de.inspec[[cx]][order(de.inspec[[cx]]$PValue),]
de.inspec[[cx]] <- de.inspec[[cx]][rownames(de.inspec[[cx]]) %in% unique.degs,]
sizeFactors(summed.filt) <- NULL
plotExpression(logNormCounts(summed.filt),
features=rownames(de.inspec[[cx]])[1],
x="SampleGroup", colour_by="SampleGroup",
other_fields="label") +
facet_wrap(~label) +
ggtitle(glue::glue("{cx}: {rownames(de.inspec[[cx]])[1]}"))
We also list the labels that were skipped due to the absence of replicates or contrasts. If it is necessary to extract statistics in the absence of replicates, several strategies can be applied such as reducing the complexity of the model or using a predefined value for the NB dispersion. We refer readers to the edgeR user’s guide for more details.
print(metadata(de.results)$failed)
## [1] "c14" "c2" "c9"
Identify label-specific DE genes that are significant in ‘c10’ yet not DE in any other label.
Plot the top-ranked gene for inspection.
In a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome.
We prepare for the DA analysis by quantifying the number of cells assigned to each label (or cluster).
abundances <- table(merged$clusters.mnn, merged$SampleName)
abundances <- unclass(abundances)
head(abundances)
##
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_3
## c1 159 6 1 6 82 26
## c2 310 94 67 140 7 2
## c3 275 465 191 275 17 9
## c4 35 324 213 81 6 1
## c5 0 4 1 2 10 16
## c6 3 13 108 25 55 112
##
## PBMMC_4
## c1 40
## c2 9
## c3 39
## c4 30
## c5 10
## c6 108
Performing the DA analysis
Our DA analysis will again be performed with the edgeR package. This allows us to take advantage of the NB GLM methods to model overdispersed count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label (Lun, Richard, and Marioni 2017). The aim is to share information across labels to improve our estimates of the biological variability in cell abundance between replicates.
# Attaching some column metadata.
extra.info <- colData(merged)[match(colnames(abundances), merged$SampleName),]
y.ab <- DGEList(abundances, samples=extra.info)
y.ab
## An object of class "DGEList"
## $counts
##
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_3
## c1 159 6 1 6 82 26
## c2 310 94 67 140 7 2
## c3 275 465 191 275 17 9
## c4 35 324 213 81 6 1
## c5 0 4 1 2 10 16
##
## PBMMC_4
## c1 40
## c2 9
## c3 39
## c4 30
## c5 10
## 9 more rows ...
##
## $samples
## group lib.size norm.factors batch SampleName SampleGroup
## ETV6-RUNX1_1 1 1200 1 ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1
## ETV6-RUNX1_2 1 1200 1 ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1
## ETV6-RUNX1_3 1 1200 1 ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1
## ETV6-RUNX1_4 1 1200 1 ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1
## PBMMC_1 1 1200 1 PBMMC_1 PBMMC_1 PBMMC
## PBMMC_3 1 1200 1 PBMMC_3 PBMMC_3 PBMMC
## PBMMC_4 1 1200 1 PBMMC_4 PBMMC_4 PBMMC
## clusters.mnn
## ETV6-RUNX1_1 c3
## ETV6-RUNX1_2 c2
## ETV6-RUNX1_3 c12
## ETV6-RUNX1_4 c3
## PBMMC_1 c5
## PBMMC_3 c12
## PBMMC_4 c14
We filter out low-abundance labels as previously described. This avoids cluttering the result table with very rare subpopulations that contain only a handful of cells. For a DA analysis of cluster abundances, filtering is generally not required as most clusters will not be of low-abundance (otherwise there would not have been enough evidence to define the cluster in the first place).
keep <- filterByExpr(y.ab, group=y.ab$samples$SampleGroup)
y.ab <- y.ab[keep,]
summary(keep)
## Mode FALSE TRUE
## logical 1 13
Unlike DE analyses, we do not perform an additional normalization step with calcNormFactors(). This means that we are only normalizing based on the “library size”, i.e., the total number of cells in each sample. Any changes we detect between conditions will subsequently represent differences in the proportion of cells in each cluster.
Here, the log-fold change in our model refers to the change in cell abundance between sample groups, rather than the change in gene expression.
design <- model.matrix(~factor(SampleGroup), y.ab$samples)
We use the estimateDisp() function to estimate the NB dispersion for each cluster. We turn off the trend as we do not have enough points for its stable estimation.
y.ab <- estimateDisp(y.ab, design, trend="none")
summary(y.ab$common.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.076 1.076 1.076 1.076 1.076 1.076
plotBCV(y.ab, cex=1)
We repeat this process with the QL dispersion, again disabling the trend.
fit.ab <- glmQLFit(y.ab, design, robust=TRUE, abundance.trend=FALSE)
summary(fit.ab$var.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.092 1.092 1.092 1.092 1.092 1.092
summary(fit.ab$df.prior)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.83 17.83 18.02 18.73 19.72 19.72
plotQLDisp(fit.ab, cex=1)
We test for differences in abundance between sample groups using glmQLFTest
.
res <- glmQLFTest(fit.ab, coef=ncol(design))
summary(decideTests(res))
## factor(SampleGroup)PBMMC
## Down 3
## NotSig 9
## Up 1
topTags(res)
## Coefficient: factor(SampleGroup)PBMMC
## logFC logCPM F PValue FDR
## c14 5.0094055 16.00434 14.6653030 0.0007783701 0.01011881
## c2 -4.6414988 16.21927 11.2802351 0.0025392625 0.01650521
## c3 -3.7909076 17.21815 8.6393052 0.0070335698 0.03047880
## c4 -3.7130043 16.35003 6.8559492 0.0153567008 0.04990928
## c5 2.6930222 12.72348 4.6963493 0.0400716306 0.10418624
## c10 2.2511488 16.13139 3.9266891 0.0587529878 0.12729814
## c13 2.1058646 15.15269 2.5430823 0.1245321723 0.23127403
## c6 1.2962896 15.66538 1.1786364 0.2888716225 0.46941639
## c11 1.1437557 15.60827 0.8637170 0.3624095505 0.52348046
## c7 -0.9804788 17.28613 0.7242925 0.4029100976 0.52378313
As mentioned above, we do not use calcNormFactors
in our default DA analysis. This normalization step assumes that most of the input features are not different between conditions. While this assumption is reasonable for most types of gene expression data, it is generally too strong for cell type abundance - most experiments consist of only a few cell types that may all change in abundance upon perturbation. Thus, our default approach is to only normalize based on the total number of cells in each sample, which means that we are effectively testing for differential proportions between conditions.
Unfortunately, the use of the total number of cells leaves us susceptible to composition effects. For example, a large increase in abundance for one cell subpopulation will introduce decreases in proportion for all other subpopulations - which is technically correct, but may be misleading if one concludes that those other subpopulations are decreasing in abundance of their own volition. If composition biases are proving problematic for interpretation of DA results, we have several avenues for removing them or mitigating their impact by leveraging a priori biological knowledge.
If it is possible to assume that most labels (i.e., cell types) do not change in abundance, we can use calcNormFactors
to compute normalization factors.
y.ab2 <- calcNormFactors(y.ab)
y.ab2$samples$norm.factors
## [1] 2.5263724 0.6368131 0.8356093 0.7037243 0.8092654 0.9521008 1.3718637
We then proceed with the remainder of the edgeR analysis, shown below in condensed format. A shift of positive log-fold changes towards zero is consistent with the removal of composition biases.
y.ab2 <- estimateDisp(y.ab2, design, trend="none")
fit.ab2 <- glmQLFit(y.ab2, design, robust=TRUE, abundance.trend=FALSE)
res2 <- glmQLFTest(fit.ab2, coef=ncol(design))
topTags(res2, n=10)
## Coefficient: factor(SampleGroup)PBMMC
## logFC logCPM F PValue FDR
## c14 4.6081006 15.98170 9.9083169 0.002497777 0.02655523
## c2 -4.5487946 16.06766 8.3915267 0.005152783 0.02655523
## c4 -4.4325139 16.75786 7.5645135 0.007729133 0.02655523
## c3 -4.2041857 17.46871 7.4524498 0.008170841 0.02655523
## c5 2.3099847 12.77380 2.6239687 0.110173217 0.28246003
## c10 2.0052191 16.17464 2.3480134 0.130366170 0.28246003
## c13 1.7271054 15.27023 1.5707757 0.214661457 0.39865699
## c1 1.3238033 14.87464 1.0004340 0.320968329 0.52157353
## c7 -0.9392892 17.34518 0.5036139 0.480491376 0.63425461
## c6 0.9208494 15.74608 0.4867877 0.487888162 0.63425461
Imagine ETV6-RUNX1_4 failed, leaving you with three ETV6-RUNX1 replicates … but all else remains as above, including the clusters identified.
Identify clusters whose abundance differ between conditions.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
## [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
## [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] DT_0.18 patchwork_1.1.1
## [3] forcats_0.5.1 stringr_1.4.0
## [5] dplyr_1.0.7 purrr_0.3.4
## [7] readr_1.4.0 tidyr_1.1.3
## [9] tibble_3.1.2 tidyverse_1.3.1
## [11] edgeR_3.34.0 limma_3.48.1
## [13] batchelor_1.8.0 scran_1.20.1
## [15] scater_1.20.1 ggplot2_3.3.5
## [17] scuttle_1.2.0 SingleCellExperiment_1.14.1
## [19] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [21] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
## [23] IRanges_2.26.0 S4Vectors_0.30.0
## [25] BiocGenerics_0.38.0 MatrixGenerics_1.4.0
## [27] matrixStats_0.59.0 knitr_1.33
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 ggbeeswarm_0.6.0
## [3] colorspace_2.0-2 ellipsis_0.3.2
## [5] bluster_1.2.1 XVector_0.32.0
## [7] BiocNeighbors_1.10.0 fs_1.5.0
## [9] rstudioapi_0.13 farver_2.1.0
## [11] RSpectra_0.16-0 fansi_0.5.0
## [13] lubridate_1.7.10 xml2_1.3.2
## [15] splines_4.1.0 sparseMatrixStats_1.4.0
## [17] jsonlite_1.7.2 ResidualMatrix_1.2.0
## [19] broom_0.7.8 cluster_2.1.2
## [21] dbplyr_2.1.1 uwot_0.1.10
## [23] pheatmap_1.0.12 compiler_4.1.0
## [25] httr_1.4.2 dqrng_0.3.0
## [27] backports_1.2.1 assertthat_0.2.1
## [29] Matrix_1.3-4 cli_3.0.1
## [31] BiocSingular_1.8.1 htmltools_0.5.1.1
## [33] tools_4.1.0 rsvd_1.0.5
## [35] igraph_1.2.6 gtable_0.3.0
## [37] glue_1.4.2 GenomeInfoDbData_1.2.6
## [39] Rcpp_1.0.7 cellranger_1.1.0
## [41] jquerylib_0.1.4 vctrs_0.3.8
## [43] DelayedMatrixStats_1.14.0 xfun_0.24
## [45] beachmat_2.8.0 rvest_1.0.0
## [47] lifecycle_1.0.0 irlba_2.3.3
## [49] statmod_1.4.36 zlibbioc_1.38.0
## [51] scales_1.1.1 hms_1.1.0
## [53] RColorBrewer_1.1-2 yaml_2.2.1
## [55] gridExtra_2.3 sass_0.4.0
## [57] stringi_1.7.3 highr_0.9
## [59] ScaledMatrix_1.0.0 BiocParallel_1.26.1
## [61] rlang_0.4.11 pkgconfig_2.0.3
## [63] bitops_1.0-7 evaluate_0.14
## [65] lattice_0.20-44 labeling_0.4.2
## [67] htmlwidgets_1.5.3 cowplot_1.1.1
## [69] tidyselect_1.1.1 magrittr_2.0.1
## [71] R6_2.5.0 generics_0.1.0
## [73] metapod_1.0.0 DelayedArray_0.18.0
## [75] DBI_1.1.1 pillar_1.6.1
## [77] haven_2.4.1 withr_2.4.2
## [79] RCurl_1.98-1.3 modelr_0.1.8
## [81] crayon_1.4.1 utf8_1.2.1
## [83] rmarkdown_2.9 viridis_0.6.1
## [85] locfit_1.5-9.4 grid_4.1.0
## [87] readxl_1.3.1 reprex_2.0.0
## [89] digest_0.6.27 munsell_0.5.0
## [91] beeswarm_0.4.0 viridisLite_0.4.0
## [93] vipor_0.4.5 bslib_0.2.5.1