projDir <- "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020"
outDirBit <- "AnaWiSce/Attempt1"
nbPcToComp <- 50
Source: Multi-sample comparisons of the OSCA book.
A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect changes in composition or expression between conditions. For example, a researcher could use this strategy to detect changes in cell type abundance after drug treatment (Richard et al. 2018) or genetic modifications (Scialdone et al. 2016). This provides more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific experimental perturbations.
Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories - differential expression (DE) and differential abundance (DA) analyses. The former tests for changes in expression between conditions for cells of the same type that are present in both conditions, while the latter tests for changes in the composition of cell types (or states, etc.) between conditions.
We will use the data set comprising the 11 samples (1000 cells per sample) analysed with fastMNN and the nested list of samples.
The differential analyses in this chapter will be predicated on many of the pre-processing steps covered previously. For brevity, we will not explicitly repeat them here, only noting that we have already merged cells from all samples into the same coordinate system and clustered the merged dataset to obtain a common partitioning across all samples.
Load the SCE object:
setName <- "caron"
# Read object in:
##setSuf <- "_1kCellPerSpl"
##tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s_clustered.Rds", projDir, outDirBit, setName, setSuf)
setSuf <- "_1kCps"
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s_Fmwbl.Rds", projDir, outDirBit, setName, setSuf)
print(tmpFn)
[1] "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020/AnaWiSce/Attempt1/Robjects/caron_sce_nz_postDeconv_1kCps_Fmwbl.Rds"
if(!file.exists(tmpFn))
{
knitr::knit_exit()
}
sce <- readRDS(tmpFn)
sce
class: SingleCellExperiment
dim: 12317 11000
metadata(2): merge.info pca.info
assays(1): reconstructed
rownames(12317): ENSG00000000003 ENSG00000000457 ... ENSG00000285458
ENSG00000285476
rowData names(1): rotation
colnames: NULL
colData names(25): Sample Barcode ... type clusters.mnn
reducedDimNames(2): corrected TSNE
altExpNames(0):
A brief inspection of the results shows clusters contain varying contributions from batches:
library(scater)
colLabels(sce) <- sce$clusters.mnn
table(colLabels(sce), sce$type)
ETV6-RUNX1 HHD PBMMC PRE-T
c1 27 1 64 0
c10 103 48 225 80
c11 72 1 64 14
c12 24 0 3 8
c13 57 1 61 0
c14 1 1 43 3
c15 19 54 39 207
c16 272 278 71 16
c17 8 6 74 49
c18 80 2 109 8
c19 1 0 1 319
c2 8 5 19 8
c20 4 0 50 7
c21 389 141 22 0
c22 1154 362 35 1
c23 42 11 10 103
c24 144 61 238 45
c25 2 0 243 3
c26 345 75 545 97
c27 0 0 27 0
c28 10 12 28 11
c3 867 752 398 106
c4 23 1 57 0
c5 167 115 156 151
c6 12 26 86 689
c7 165 44 41 17
c8 2 3 212 50
c9 2 0 79 8
table(colLabels(sce), sce$Sample.Name2)
ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 HHD_1 HHD_2 PBMMC_1
c1 1 0 10 16 0 1 15
c10 4 6 76 17 39 9 42
c11 1 3 14 54 0 1 1
c12 0 0 5 19 0 0 0
c13 1 2 12 42 1 0 2
c14 0 0 0 1 1 0 15
c15 15 0 1 3 34 20 21
c16 227 20 9 16 160 118 27
c17 4 1 2 1 6 0 59
c18 3 0 15 62 0 2 0
c19 0 0 1 0 0 0 1
c2 0 3 5 0 2 3 6
c20 0 0 3 1 0 0 11
c21 106 82 34 167 59 82 6
c22 426 309 96 323 133 229 8
c23 6 21 10 5 8 3 7
c24 1 19 104 20 54 7 40
c25 0 0 1 1 0 0 204
c26 3 16 277 49 57 18 109
c27 0 0 0 0 0 0 11
c28 0 0 1 9 11 1 3
c3 65 420 290 92 349 403 175
c4 0 0 9 14 0 1 1
c5 65 54 13 35 43 72 81
c6 8 1 2 1 18 8 60
c7 64 43 8 50 22 22 15
c8 0 0 1 1 3 0 45
c9 0 0 1 1 0 0 35
PBMMC_2 PBMMC_3 PRE-T_1 PRE-T_2
c1 43 6 0 0
c10 94 89 7 73
c11 60 3 1 13
c12 2 1 0 8
c13 56 3 0 0
c14 14 14 0 3
c15 7 11 162 45
c16 11 33 2 14
c17 1 14 40 9
c18 105 4 1 7
c19 0 0 18 301
c2 7 6 2 6
c20 30 9 0 7
c21 4 12 0 0
c22 9 18 0 1
c23 3 0 73 30
c24 86 112 8 37
c25 28 11 1 2
c26 217 219 7 90
c27 10 6 0 0
c28 6 19 1 10
c3 44 179 63 43
c4 47 9 0 0
c5 21 54 134 17
c6 13 13 465 224
c7 8 18 14 3
c8 54 113 1 49
c9 20 24 0 8
On the t-SNE plots below, cells colored by type or sample (‘batch of origin’). Cluster numbers are superimposed based on the median coordinate of cells assigned to that cluster.
plotTSNE(sce, colour_by="type", text_by="label")
plotTSNE(sce, colour_by="Sample.Name2")
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s_Fmwbl2.Rds", projDir, outDirBit, setName, setSuf)
tmpList <- readRDS(tmpFn)
chosen.hvgs <- tmpList$chosen.hvgs
rescaled.mbn <- tmpList$rescaled.mbn
uncorrected <- tmpList$uncorrected
colToKeep <- c("Run", "Sample.Name", "source_name", "block", "setName", "Sample.Name2")
colData(uncorrected) <- colData(uncorrected)[,colToKeep]
colData(uncorrected)[1:3,]
DataFrame with 3 rows and 6 columns
Run Sample.Name source_name block setName Sample.Name2
<character> <character> <factor> <factor> <character> <character>
1 SRR9264343 GSM3872434 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_1
2 SRR9264343 GSM3872434 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_1
3 SRR9264343 GSM3872434 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_1
#--- merging ---#
library(batchelor)
set.seed(01001001)
merged <- correctExperiments(uncorrected,
batch=uncorrected$Sample.Name2,
subset.row=chosen.hvgs,
PARAM=FastMnnParam(
merge.order=list( list(1,2,3,4), list(9,10,11), list(5,6), list(7,8) )
)
)
merged
class: SingleCellExperiment
dim: 12317 11000
metadata(2): merge.info pca.info
assays(3): reconstructed counts logcounts
rownames(12317): ENSG00000000003 ENSG00000000457 ... ENSG00000285458
ENSG00000285476
rowData names(12): rotation ensembl_gene_id ... detected gene_sparsity
colnames: NULL
colData names(7): batch Run ... setName Sample.Name2
reducedDimNames(1): corrected
altExpNames(0):
#--- clustering ---#
g <- buildSNNGraph(merged, use.dimred="corrected")
clusters <- igraph::cluster_louvain(g)
merged$clusters.mnn <- factor(paste0("c", clusters$membership))
#colLabels(merged) <- merged$clusters.mnn
#--- dimensionality-reduction ---#
merged <- runTSNE(merged, dimred="corrected", external_neighbors=TRUE)
merged <- runUMAP(merged, dimred="corrected", external_neighbors=TRUE)
library(scater)
table(merged$clusters.mnn, merged$block)
ABMMC ETV6-RUNX1 HHD PBMMC PRE-T
c1 0 1185 283 31 1
c10 0 208 4 232 11
c11 0 88 7 200 34
c12 0 890 737 420 124
c13 0 85 42 214 66
c2 0 142 61 230 46
c3 0 16 52 0 0
c4 0 257 352 157 565
c5 0 371 83 587 713
c6 0 339 152 239 210
c7 0 15 16 394 74
c8 0 403 211 77 155
c9 0 1 0 219 1
table(merged$clusters.mnn, merged$Sample.Name2)
ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 HHD_1 HHD_2 PBMMC_1
c1 443 312 93 337 114 169 10
c10 5 5 41 157 1 3 3
c11 1 3 32 52 2 5 35
c12 55 447 294 94 292 445 186
c13 5 4 63 13 33 9 42
c2 1 19 102 20 54 7 35
c3 9 4 3 0 51 1 0
c4 221 14 13 9 203 149 91
c5 3 19 296 53 65 18 124
c6 127 93 25 94 59 93 133
c7 0 0 3 12 15 1 111
c8 130 80 35 158 111 100 38
c9 0 0 0 1 0 0 192
PBMMC_2 PBMMC_3 PRE-T_1 PRE-T_2
c1 9 12 1 0
c10 219 10 2 9
c11 133 32 1 33
c12 49 185 89 35
c13 87 85 7 59
c2 81 114 9 37
c3 0 0 0 0
c4 25 41 379 186
c5 231 232 203 510
c6 30 76 183 27
c7 105 178 2 72
c8 8 31 124 31
c9 23 4 0 1
plotTSNE(merged, colour_by="block", text_by="clusters.mnn")
plotTSNE(merged, colour_by="Sample.Name2")
The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples.
# Using 'label' and 'sample' as our two factors; each column of the output
# corresponds to one unique combination of these two factors.
summed <- aggregateAcrossCells(merged,
id = DataFrame(
label=merged$clusters.mnn,
sample=merged$Sample.Name2
)
)
summed
class: SingleCellExperiment
dim: 12317 128
metadata(2): merge.info pca.info
assays(1): counts
rownames(12317): ENSG00000000003 ENSG00000000457 ... ENSG00000285458
ENSG00000285476
rowData names(12): rotation ensembl_gene_id ... detected gene_sparsity
colnames: NULL
colData names(11): batch Run ... sample ncells
reducedDimNames(3): corrected TSNE UMAP
altExpNames(0):
colData(summed) %>% head(3)
DataFrame with 3 rows and 11 columns
batch Run Sample.Name source_name block setName
<character> <character> <character> <integer> <integer> <character>
1 ETV6-RUNX1_1 SRR9264343 GSM3872434 2 2 Caron
2 ETV6-RUNX1_2 SRR9264344 GSM3872435 2 2 Caron
3 ETV6-RUNX1_3 SRR9264345 GSM3872436 2 2 Caron
Sample.Name2 clusters.mnn label sample ncells
<character> <integer> <factor> <character> <integer>
1 ETV6-RUNX1_1 1 c1 ETV6-RUNX1_1 443
2 ETV6-RUNX1_2 1 c1 ETV6-RUNX1_2 312
3 ETV6-RUNX1_3 1 c1 ETV6-RUNX1_3 93
At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking:
Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. (Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.)
The DE analysis will be performed using quasi-likelihood (QL) methods from the edgeR package (Robinson, McCarthy, and Smyth 2010; Chen, Lun, and Smyth 2016). This uses a negative binomial generalized linear model (NB GLM) to handle overdispersed count data in experiments with limited replication. In our case, we have biological variation with three paired replicates per condition, so edgeR (or its contemporaries) is a natural choice for the analysis.
We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we arbitrarily pick one of the labels to use for this demonstration.
labelToGet <- "c1"
current <- summed[,summed$label==labelToGet]
# Creating up a DGEList object for use in edgeR:
suppressMessages(library(edgeR))
y <- DGEList(counts(current), samples=colData(current))
y
An object of class "DGEList"
$counts
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
ENSG00000000003 0 1 0 0 0 2 0 0
ENSG00000000457 17 13 4 11 3 5 0 0
ENSG00000000938 0 2 0 0 2 1 0 0
ENSG00000000971 0 0 0 0 0 0 0 0
ENSG00000001167 15 12 11 25 2 15 2 1
Sample9 Sample10
ENSG00000000003 0 0
ENSG00000000457 0 0
ENSG00000000938 0 0
ENSG00000000971 0 0
ENSG00000001167 0 0
12312 more rows ...
$samples
group lib.size norm.factors batch Run Sample.Name
Sample1 1 1062654 1 ETV6-RUNX1_1 SRR9264343 GSM3872434
Sample2 1 501796 1 ETV6-RUNX1_2 SRR9264344 GSM3872435
Sample3 1 153900 1 ETV6-RUNX1_3 SRR9264345 GSM3872436
Sample4 1 510069 1 ETV6-RUNX1_4 SRR9264346 GSM3872437
Sample5 1 339208 1 HHD_1 SRR9264347 GSM3872438
Sample6 1 590414 1 HHD_2 SRR9264348 GSM3872439
Sample7 1 15288 1 PBMMC_1 SRR9264351 GSM3872442
Sample8 1 11108 1 PBMMC_2 SRR9264353 GSM3872443
Sample9 1 9147 1 PBMMC_3 SRR9264354 GSM3872444
Sample10 1 1650 1 PRE-T_1 SRR9264349 GSM3872440
source_name block setName Sample.Name2 clusters.mnn label sample
Sample1 2 2 Caron ETV6-RUNX1_1 1 c1 ETV6-RUNX1_1
Sample2 2 2 Caron ETV6-RUNX1_2 1 c1 ETV6-RUNX1_2
Sample3 2 2 Caron ETV6-RUNX1_3 1 c1 ETV6-RUNX1_3
Sample4 2 2 Caron ETV6-RUNX1_4 1 c1 ETV6-RUNX1_4
Sample5 3 3 Caron HHD_1 1 c1 HHD_1
Sample6 3 3 Caron HHD_2 1 c1 HHD_2
Sample7 4 4 Caron PBMMC_1 1 c1 PBMMC_1
Sample8 4 4 Caron PBMMC_2 1 c1 PBMMC_2
Sample9 4 4 Caron PBMMC_3 1 c1 PBMMC_3
Sample10 5 5 Caron PRE-T_1 1 c1 PRE-T_1
ncells
Sample1 443
Sample2 312
Sample3 93
Sample4 337
Sample5 114
Sample6 169
Sample7 10
Sample8 9
Sample9 12
Sample10 1
A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization (Chapter 7) or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 20 cells (Crowell et al. 2019). Alternatively, we could apply the outlier-based strategy described in Chapter 6, but this makes the strong assumption that all label-sample combinations have similar numbers of cells that are sequenced to similar depth.
discarded <- current$ncells < 20
y <- y[,!discarded]
summary(discarded)
Mode FALSE TRUE
logical 6 4
Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Genes are discarded if they are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design).
keep <- filterByExpr(y, group=current$source_name)
y <- y[keep,]
summary(keep)
Mode FALSE TRUE
logical 6230 6087
Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). We do not need the bespoke single-cell methods described in Chapter 7, as the counts for our pseudo-bulk samples are large enough to apply bulk normalization methods. (Readers should be aware that edgeR normalization factors are closely related but not the same as the size factors described elsewhere in this book.)
y <- calcNormFactors(y)
y$samples
Our aim is to test whether the log-fold change between sample groups is significantly different from zero.
design <- model.matrix(~factor(source_name), y$samples)
design
(Intercept) factor(source_name)3
Sample1 1 0
Sample2 1 0
Sample3 1 0
Sample4 1 0
Sample5 1 1
Sample6 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`factor(source_name)`
[1] "contr.treatment"
We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend, which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend.
y <- estimateDisp(y, design)
summary(y$trended.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1404 0.1495 0.1599 0.1908 0.1919 0.5137
Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively.
plotBCV(y)
We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis.
fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$var.prior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5445 0.5837 0.6524 0.7272 0.7564 2.2279
summary(fit$df.prior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.803 6.323 6.323 6.071 6.323 6.323
QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).
plotQLDisp(fit)
We test for differences in expression due to sample group using glmQLFTest(). DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. If very few genes are significantly DE that sample group has little effect on the transcriptome.
res <- glmQLFTest(fit, coef=ncol(design))
summary(decideTests(res))
factor(source_name)3
Down 132
NotSig 5784
Up 171
topTab <- topTags(res)$table
tmpAnnot <- rowData(current)[,c("ensembl_gene_id","Symbol")] %>% data.frame
topTab %>% tibble::rownames_to_column("ensembl_gene_id") %>%
left_join(tmpAnnot, by="ensembl_gene_id")
The steps illustrated above with cluster 0 are now repeated for each cluster:
de.results <- list()
for (labelToGet in levels(summed$label)) {
current <- summed[,summed$label==labelToGet]
y <- DGEList(counts(current), samples=colData(current))
discarded <- isOutlier(colSums(counts(current)), log=TRUE, type="lower")
y <- y[,!discarded]
y <- y[filterByExpr(y, group=current$source_name),]
y <- calcNormFactors(y)
design <- try(
model.matrix(~factor(source_name), y$samples),
silent=TRUE
)
if (is(design, "try-error") ||
qr(design)$rank==nrow(design) ||
qr(design)$rank < ncol(design))
{
# Skipping labels without contrasts or without
# enough residual d.f. to estimate the dispersion.
next
}
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
res <- glmQLFTest(fit, coef=ncol(design))
de.results[[labelToGet]] <- res
}
We examine the numbers of DEGs at a FDR of 5% for each label (i.e. cluster). In general, there seems to be very little differential expression between the on and off conditions.
summaries <- lapply(de.results, FUN=function(x) summary(decideTests(x))[,1])
sum.tab <- do.call(rbind, summaries)
#sum.tab
sum.tab[order(rownames(sum.tab)),] %>%
as.data.frame() %>%
tibble::rownames_to_column("Cluster") %>%
datatable(rownames = FALSE, options = list(pageLength = 20, scrollX = TRUE))
We now list DEGs and the number of clusters they were detected in:
degs <- lapply(de.results, FUN=function(x) rownames(topTags(x, p.value=0.05)))
common.degs <- sort(table(unlist(degs)), decreasing=TRUE)
#head(common.degs, 20)
common.degs %>%
as.data.frame %>%
dplyr::rename(Gene = Var1, NbClu = Freq) %>%
datatable(rownames = FALSE, options = list(pageLength = 20, scrollX = TRUE))
“We also list the labels that were skipped due to the absence of replicates or contrasts. If it is necessary to extract statistics in the absence of replicates, several strategies can be applied such as reducing the complexity of the model or using a predefined value for the NB dispersion. We refer readers to the edgeR user’s guide for more details.”
skippedClusters <- setdiff(unique(summed$label), names(summaries))
The number of clusters skipped is 0.
if(length(skippedClusters)>0)
{
skippedClusters
}
grmToShowList <- vector("list", length = nlevels(merged$clusters.mnn))
names(grmToShowList) <- levels(merged$clusters.mnn)
genesToExclude <- c()
nbGeneToShow <- 20
#degs <- lapply(de.results, FUN=function(x) (topTags(x, p.value=0.05)))
degs <- lapply(de.results, FUN=function(x) (as.data.frame(topTags(x, n=nbGeneToShow))))
for( namex in levels(merged$clusters.mnn) )
{
nbGeneToUse <- min(c(nrow(degs[[namex]]), nbGeneToShow))
# format
# format p value:
tmpCol <- grep("PValue|FDR", colnames(degs[[namex]]), value=TRUE)
degs[[namex]][,tmpCol] <- apply(degs[[namex]][,tmpCol],
2,
function(x){format(x, scientific = TRUE, digits = 1)})
# format logFC:
tmpCol <- c("logFC", "logCPM", "F")
degs[[namex]][,tmpCol] <- apply(degs[[namex]][,tmpCol], 2, function(x){round(x, 2)})
rm(tmpCol)
# subset data
grmToShow <- degs[[namex]] %>%
as.data.frame() %>%
tibble::rownames_to_column("gene") %>%
arrange(FDR, desc(abs(logFC))) %>%
filter(! gene %in% genesToExclude) %>%
group_modify(~ head(.x, nbGeneToUse))
# keep data
grmToShow$cluster <- namex
grmToShowList[[namex]] <- grmToShow
# tidy
rm(nbGeneToUse)
}
grmToShowDf <- do.call("rbind", grmToShowList)
tmpCol <- c("cluster", "gene")
grmToShowDf %>%
select(tmpCol, setdiff(colnames(grmToShowDf), tmpCol)) %>%
filter(gene %in% names(common.degs) & as.numeric(FDR) < 0.05) %>%
datatable(rownames = FALSE, filter="top", options=list(scrollX = TRUE, pageLength = 15))
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(tmpCol)` instead of `tmpCol` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
tmpBool <- as.numeric(grmToShowDf$FDR) < 0.05
markers.to.plot <- unique(grmToShowDf[tmpBool, "gene"])
markers.to.plot <- markers.to.plot[1:5]
Now that we have laid out the theory underlying the DE analysis, we repeat this process for each of the labels. This is conveniently done using the pseudoBulkDGE() function from scran, which will loop over all labels and apply the exact analysis described above to each label. To prepare for this, we filter out all sample-label combinations with insufficient cells.
summed.filt <- summed[,summed$ncells >= 20]
We construct a common design matrix that will be used in the analysis for each label. Recall that this matrix should have one row per unique sample (and named as such), reflecting the fact that we are modelling counts on the sample level instead of the cell level.
# Pulling out a sample-level 'targets' data.frame:
targets <- colData(merged)[!duplicated(merged$Sample.Name2),]
# Constructing the design matrix:
design <- model.matrix(~factor(source_name), data=targets)
rownames(design) <- targets$Sample.Name2
We then apply the pseudoBulkDGE() function to obtain a list of DE genes for each label. This function puts some additional effort into automatically dealing with labels that are not represented in all sample groups, for which a DE analysis between conditions is meaningless; or are not represented in a sufficient number of replicate samples to enable modelling of biological variability.
library(scran)
de.results <- pseudoBulkDGE(summed.filt,
sample=summed.filt$Sample.Name2,
label=summed.filt$label,
design=design,
coef=ncol(design),
# 'condition' sets the group size for filterByExpr(),
# to perfectly mimic our previous manual analysis.
condition=targets$source_name
)
We examine the numbers of DEGs at a FDR of 5% for each label using the decideTestsPerLabel() function. Note that genes listed as NA were either filtered out as low-abundance genes for a given label’s analysis, or the comparison of interest was not possible for a particular label, e.g., due to lack of residual degrees of freedom or an absence of samples from both conditions.
is.de <- decideTestsPerLabel(de.results, threshold=0.05)
summarizeTestsPerLabel(is.de)
-1 0 1 NA
c1 0 0 0 12317
c10 0 0 0 12317
c11 0 0 0 12317
c12 307 2081 183 9746
c13 0 1320 1 10996
c2 2 1826 6 10483
c3 0 0 0 12317
c4 477 6330 266 5244
c5 212 4416 219 7470
c6 528 3361 337 8091
c7 0 0 0 12317
c8 486 3446 319 8066
c9 0 0 0 12317
For each gene, we compute the percentage of cell types in which that gene is upregulated or downregulated. (Here, we consider a gene to be non-DE if it is not retained after filtering.).
# Upregulated across most cell types.
up.de <- is.de > 0 & !is.na(is.de)
head(sort(rowMeans(up.de), decreasing=TRUE), 10)
ENSG00000064886 ENSG00000066294 ENSG00000078687 ENSG00000103254 ENSG00000106018
0.3846154 0.3846154 0.3846154 0.3846154 0.3846154
ENSG00000133169 ENSG00000137731 ENSG00000138650 ENSG00000143641 ENSG00000143851
0.3846154 0.3846154 0.3846154 0.3846154 0.3846154
# Downregulated across cell types.
down.de <- is.de < 0 & !is.na(is.de)
head(sort(rowMeans(down.de), decreasing=TRUE), 10)
ENSG00000007312 ENSG00000065809 ENSG00000076641 ENSG00000081189 ENSG00000090238
0.3846154 0.3846154 0.3846154 0.3846154 0.3846154
ENSG00000100242 ENSG00000100721 ENSG00000105369 ENSG00000115652 ENSG00000118523
0.3846154 0.3846154 0.3846154 0.3846154 0.3846154
We further identify label-specific DE genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected even at a generous FDR threshold of 50%.
remotely.de <- decideTestsPerLabel(de.results, threshold=0.5)
not.de <- remotely.de==0 | is.na(remotely.de)
other.labels <- setdiff(colnames(not.de), "c2")
unique.degs <- is.de[,"c2"]!=0 & rowMeans(not.de[,other.labels])==1
unique.degs <- names(which(unique.degs))
other.labels <- setdiff(colnames(not.de), "c4")
unique.degs <- is.de[,"c4"]!=0 & rowMeans(not.de[,other.labels])==1
unique.degs <- names(which(unique.degs))
# Choosing the top-ranked gene for inspection:
de.c4 <- de.results$c4
de.c4 <- de.c4[order(de.c4$PValue),]
de.c4 <- de.c4[rownames(de.c4) %in% unique.degs,]
sizeFactors(summed.filt) <- NULL
plotExpression(logNormCounts(summed.filt),
features=rownames(de.c4)[1],
x="source_name", colour_by="source_name",
other_fields="label") +
facet_wrap(~label)
We also list the labels that were skipped due to the absence of replicates or contrasts. If it is necessary to extract statistics in the absence of replicates, several strategies can be applied such as reducing the complexity of the model or using a predefined value for the NB dispersion. We refer readers to the edgeR user’s guide for more details.
print(metadata(de.results)$failed)
[1] "c1" "c10" "c11" "c3" "c7" "c9"
n a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome.
We prepare for the DA analysis by quantifying the number of cells assigned to each label (or cluster).
abundances <- table(merged$clusters.mnn, merged$Sample.Name2)
abundances <- unclass(abundances)
head(abundances)
ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 HHD_1 HHD_2 PBMMC_1
c1 443 312 93 337 114 169 10
c10 5 5 41 157 1 3 3
c11 1 3 32 52 2 5 35
c12 55 447 294 94 292 445 186
c13 5 4 63 13 33 9 42
c2 1 19 102 20 54 7 35
PBMMC_2 PBMMC_3 PRE-T_1 PRE-T_2
c1 9 12 1 0
c10 219 10 2 9
c11 133 32 1 33
c12 49 185 89 35
c13 87 85 7 59
c2 81 114 9 37
Performing the DA analysis
Our DA analysis will again be performed with the edgeR package. This allows us to take advantage of the NB GLM methods to model overdispersed count data in the presence of limited replication - except that the counts are not of reads per gene, but of cells per label (Lun, Richard, and Marioni 2017). The aim is to share information across labels to improve our estimates of the biological variability in cell abundance between replicates.
# Attaching some column metadata.
extra.info <- colData(merged)[match(colnames(abundances), merged$Sample.Name2),]
y.ab <- DGEList(abundances, samples=extra.info)
y.ab
An object of class "DGEList"
$counts
ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 HHD_1 HHD_2 PBMMC_1
c1 443 312 93 337 114 169 10
c10 5 5 41 157 1 3 3
c11 1 3 32 52 2 5 35
c12 55 447 294 94 292 445 186
c13 5 4 63 13 33 9 42
PBMMC_2 PBMMC_3 PRE-T_1 PRE-T_2
c1 9 12 1 0
c10 219 10 2 9
c11 133 32 1 33
c12 49 185 89 35
c13 87 85 7 59
8 more rows ...
$samples
group lib.size norm.factors batch Run Sample.Name
ETV6-RUNX1_1 1 1000 1 ETV6-RUNX1_1 SRR9264343 GSM3872434
ETV6-RUNX1_2 1 1000 1 ETV6-RUNX1_2 SRR9264344 GSM3872435
ETV6-RUNX1_3 1 1000 1 ETV6-RUNX1_3 SRR9264345 GSM3872436
ETV6-RUNX1_4 1 1000 1 ETV6-RUNX1_4 SRR9264346 GSM3872437
HHD_1 1 1000 1 HHD_1 SRR9264347 GSM3872438
source_name block setName Sample.Name2 clusters.mnn
ETV6-RUNX1_1 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_1 c6
ETV6-RUNX1_2 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_2 c8
ETV6-RUNX1_3 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_3 c5
ETV6-RUNX1_4 ETV6-RUNX1 ETV6-RUNX1 Caron ETV6-RUNX1_4 c8
HHD_1 HHD HHD Caron HHD_1 c7
6 more rows ...
We filter out low-abundance labels as previously described. This avoids cluttering the result table with very rare subpopulations that contain only a handful of cells. For a DA analysis of cluster abundances, filtering is generally not required as most clusters will not be of low-abundance (otherwise there would not have been enough evidence to define the cluster in the first place).
keep <- filterByExpr(y.ab, group=y.ab$samples$source_name)
y.ab <- y.ab[keep,]
summary(keep)
Mode FALSE TRUE
logical 1 12
Unlike DE analyses, we do not perform an additional normalization step with calcNormFactors(). This means that we are only normalizing based on the “library size”, i.e., the total number of cells in each sample. Any changes we detect between conditions will subsequently represent differences in the proportion of cells in each cluster. The motivation behind this decision is discussed in more detail in Section 14.4.3.
Here, the log-fold change in our model refers to the change in cell abundance between sample groups, rather than the change in gene expression.
design <- model.matrix(~factor(source_name), y.ab$samples)
We use the estimateDisp() function to estimate the NB dipersion for each cluster. We turn off the trend as we do not have enough points for its stable estimation.
y.ab <- estimateDisp(y.ab, design, trend="none")
summary(y.ab$common.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.017 1.017 1.017 1.017 1.017 1.017
plotBCV(y.ab, cex=1)
We repeat this process with the QL dispersion, again disabling the trend.
fit.ab <- glmQLFit(y.ab, design, robust=TRUE, abundance.trend=FALSE)
summary(fit.ab$var.prior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.157 1.157 1.157 1.157 1.157 1.157
summary(fit.ab$df.prior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
23.34 23.34 23.34 24.10 25.61 25.61
plotQLDisp(fit.ab, cex=1)
We test for differences in abundance between sample groups using glmQLFTest().
res <- glmQLFTest(fit.ab, coef=ncol(design))
summary(decideTests(res))
factor(source_name)PRE-T
Down 1
NotSig 11
Up 0
topTags(res)
Coefficient: factor(source_name)PRE-T
logFC logCPM F PValue FDR
c1 -8.8893519 17.07235 18.23350328 0.000158277 0.001899324
c7 3.2601228 15.52573 4.83688825 0.035612506 0.213675037
c10 -3.2120505 15.39844 3.08504552 0.089104799 0.344105248
c4 2.1343167 16.90254 2.63766139 0.114701749 0.344105248
c5 1.9410456 17.29501 2.11215108 0.156398940 0.375357455
c12 -1.8413698 17.59928 1.71643024 0.199307039 0.398614078
c13 0.6319960 15.24543 0.22447446 0.639043019 0.809047739
c2 -0.6234366 15.46926 0.18697282 0.668505881 0.809047739
c9 0.7369656 14.42541 0.14094409 0.710044583 0.809047739
c8 -0.3779754 16.26214 0.08626566 0.770840636 0.809047739
As mentioned above, we do not use calcNormFactors() in our default DA analysis. This normalization step assumes that most of the input features are not different between conditions. While this assumption is reasonable for most types of gene expression data, it is generally too strong for cell type abundance - most experiments consist of only a few cell types that may all change in abundance upon perturbation. Thus, our default approach is to only normalize based on the total number of cells in each sample, which means that we are effectively testing for differential proportions between conditions.
Unfortunately, the use of the total number of cells leaves us susceptible to composition effects. For example, a large increase in abundance for one cell subpopulation will introduce decreases in proportion for all other subpopulations - which is technically correct, but may be misleading if one concludes that those other subpopulations are decreasing in abundance of their own volition. If composition biases are proving problematic for interpretation of DA results, we have several avenues for removing them or mitigating their impact by leveraging a priori biological knowledge. 14.4.3.2 Assuming most labels do not change
If it is possible to assume that most labels (i.e., cell types) do not change in abundance, we can use calcNormFactors() to compute normalization factors.
y.ab2 <- calcNormFactors(y.ab)
y.ab2$samples$norm.factors
[1] 0.7085578 0.8214020 0.8171844 1.0331737 0.8686328 0.5418219 1.7478390
[8] 1.8674296 1.8315815 0.8196523 0.8824318
We then proceed with the remainder of the edgeR analysis, shown below in condensed format. A shift of positive log-fold changes towards zero is consistent with the removal of composition biases.
y.ab2 <- estimateDisp(y.ab2, design, trend="none")
fit.ab2 <- glmQLFit(y.ab2, design, robust=TRUE, abundance.trend=FALSE)
res2 <- glmQLFTest(fit.ab2, coef=ncol(design))
topTags(res2, n=10)
Coefficient: factor(source_name)PRE-T
logFC logCPM F PValue FDR
c1 -8.9890098 17.41093 15.15115665 0.0001981979 0.002378375
c7 3.3741853 14.98811 5.92828284 0.0170168369 0.102101021
c10 -3.0663655 15.07547 3.44448555 0.0669698626 0.267879450
c4 1.9304189 17.18145 2.13638867 0.1475711494 0.366624466
c5 1.9039062 17.24607 2.08203436 0.1527601941 0.366624466
c12 -1.8537335 17.86149 1.51313953 0.2220945660 0.444189132
c2 -0.6696831 15.29312 0.21973417 0.6404568093 0.835706180
c13 0.5688409 15.03125 0.17396711 0.6776739501 0.835706180
c9 0.8099331 13.74005 0.15287008 0.6967974881 0.835706180
c8 -0.3542759 16.51826 0.06443934 0.8002325040 0.835706180