Acknowledgments: much of the material in this section has been
derived from the chapters on differential expression and abundance in
the OSCA
book and the Hemberg
Group course materials. Additional material concerning
miloR
has been based on the demonstration
from the Marioni Lab.
A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect changes in composition or expression between conditions. This can provide more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific biological states.
Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories - differential expression (DE) and differential abundance (DA) analyses. The former tests for changes in expression between conditions for cells of the same ‘type’ that are present in both conditions, while the latter tests for changes in the composition of cell types (or states, etc.) between conditions.
library(scater)
library(scran)
library(batchelor)
library(edgeR)
library(tidyverse)
library(patchwork)
library(DT)
library(bluster)
library(BiocParallel)
library(miloR)
bpp <- MulticoreParam(7)
Due to the nature of differencial analysis we cannot use the downsampled version of the dataset for the this section so we have the full Caron dataset. We will start with just the PBMMC and ETV6-RUNX1 samples. It has been QCed, normalised and batch corrected as shown last week and clustered as shown this morning. We have selected the Leiden clustering with k=60 to go forward.
Load the SCE object:
sce <- readRDS("R_objects/Caron_clustered.PBMMCandETV6RUNX1.rds")
rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$Symbol)
sce
## class: SingleCellExperiment
## dim: 33102 30461
## metadata(1): Samples
## assays(3): counts logcounts reconstructed
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(30461): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 11_TTTGTCATCAGTTGAC-1 11_TTTGTCATCTCGTTTA-1
## colData names(13): Sample Barcode ... k.60_cluster.fun.leiden label
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):
We can plot any of the outputs of the dimensionality reduction in the reduced dimensions slot of the single cell object.
plotReducedDim(sce, dimred = "TSNE_corrected", colour_by = "label")
The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples.
Here we use label
and SampleName
as the
parameter to aggregate by. This means for each true sample there will be
a pseudo-sample for each of our defined clusters. We therefore would
have a maximum of 7x17 (119) pseudo-samples. We will have less than this
as there will be a few clusters with no cells from a particular
sample.
summed <- aggregateAcrossCells(sce,
id=colData(sce)[,c("label", "SampleName")])
summed
## class: SingleCellExperiment
## dim: 33102 102
## metadata(1): Samples
## assays(1): counts
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames: NULL
## colData names(16): Sample Barcode ... SampleName ncells
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):
At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking:
Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. (Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.)
The DE analysis will be performed using quasi-likelihood (QL) methods
from the edgeR
package (Robinson, McCarthy, and Smyth 2010;
Chen, Lun, and Smyth 2016). This uses a negative binomial generalized
linear model (NB GLM) to handle overdispersed count data in experiments
with limited replication. In our case, we have biological variation with
few replicates per sample group, so edgeR
(or its
contemporaries) is a natural choice for the analysis. You can read more
about the edgeR
package and how each step is being
calculated here: (edgeR user guide)[https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf]
It is possible to use DESeq2
here instead, the OSCA book
uses edgeR
so we have stuck to that but it is up to
you.
We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we picked one of the labels to use for this demonstration.
current <- summed[,summed$label == "1"]
colData(current)
## DataFrame with 7 rows and 16 columns
## Sample Barcode SampleName SampleGroup sum detected
## <character> <character> <character> <character> <numeric> <integer>
## 1 SRR9264343 NA ETV6-RUNX1_1 ETV6-RUNX1 NA NA
## 2 SRR9264344 NA ETV6-RUNX1_2 ETV6-RUNX1 NA NA
## 3 SRR9264345 NA ETV6-RUNX1_3 ETV6-RUNX1 NA NA
## 4 SRR9264346 NA ETV6-RUNX1_4 ETV6-RUNX1 NA NA
## 5 SRR9264351 NA PBMMC_1 PBMMC NA NA
## 6 SRR9264353 NA PBMMC_2 PBMMC NA NA
## 7 SRR9264354 NA PBMMC_3 PBMMC NA NA
## subsets_Mito_sum subsets_Mito_detected subsets_Mito_percent total
## <numeric> <integer> <numeric> <numeric>
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## sizeFactor k.60_cluster.fun.leiden label label SampleName ncells
## <numeric> <factor> <factor> <factor> <character> <integer>
## 1 NA 1 1 1 ETV6-RUNX1_1 1754
## 2 NA 1 1 1 ETV6-RUNX1_2 4935
## 3 NA 1 1 1 ETV6-RUNX1_3 1790
## 4 NA 1 1 1 ETV6-RUNX1_4 2341
## 5 NA 1 1 1 PBMMC_1 40
## 6 NA 1 1 1 PBMMC_2 68
## 7 NA 1 1 1 PBMMC_3 189
In order to use edgeR we must have the data stored in a
DGEList
object. For this we need our counts matrix and a
data frame containing metadata on our pseudo-samples.
countsToUse <- counts(current)
colnames(countsToUse) <- colData(current)$SampleName
y <- DGEList(countsToUse, samples=colData(current))
y
## An object of class "DGEList"
## $counts
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1
## MIR1302-2HG 0 0 0 0 0
## ENSG00000238009 2 1 0 2 0
## ENSG00000239945 0 0 0 0 0
## ENSG00000241860 7 20 3 0 1
## ENSG00000286448 0 0 0 0 0
## PBMMC_2 PBMMC_3
## MIR1302-2HG 0 0
## ENSG00000238009 0 0
## ENSG00000239945 0 0
## ENSG00000241860 1 0
## ENSG00000286448 0 0
## 33097 more rows ...
##
## $samples
## group lib.size norm.factors Sample Barcode SampleName
## ETV6-RUNX1_1 1 10938549 1 SRR9264343 <NA> ETV6-RUNX1_1
## ETV6-RUNX1_2 1 14380564 1 SRR9264344 <NA> ETV6-RUNX1_2
## ETV6-RUNX1_3 1 5803180 1 SRR9264345 <NA> ETV6-RUNX1_3
## ETV6-RUNX1_4 1 7835579 1 SRR9264346 <NA> ETV6-RUNX1_4
## PBMMC_1 1 172373 1 SRR9264351 <NA> PBMMC_1
## PBMMC_2 1 399403 1 SRR9264353 <NA> PBMMC_2
## PBMMC_3 1 359025 1 SRR9264354 <NA> PBMMC_3
## SampleGroup sum detected subsets_Mito_sum subsets_Mito_detected
## ETV6-RUNX1_1 ETV6-RUNX1 NA NA NA NA
## ETV6-RUNX1_2 ETV6-RUNX1 NA NA NA NA
## ETV6-RUNX1_3 ETV6-RUNX1 NA NA NA NA
## ETV6-RUNX1_4 ETV6-RUNX1 NA NA NA NA
## PBMMC_1 PBMMC NA NA NA NA
## PBMMC_2 PBMMC NA NA NA NA
## PBMMC_3 PBMMC NA NA NA NA
## subsets_Mito_percent total sizeFactor k.60_cluster.fun.leiden
## ETV6-RUNX1_1 NA NA NA 1
## ETV6-RUNX1_2 NA NA NA 1
## ETV6-RUNX1_3 NA NA NA 1
## ETV6-RUNX1_4 NA NA NA 1
## PBMMC_1 NA NA NA 1
## PBMMC_2 NA NA NA 1
## PBMMC_3 NA NA NA 1
## label label.1 SampleName.1 ncells
## ETV6-RUNX1_1 1 1 ETV6-RUNX1_1 1754
## ETV6-RUNX1_2 1 1 ETV6-RUNX1_2 4935
## ETV6-RUNX1_3 1 1 ETV6-RUNX1_3 1790
## ETV6-RUNX1_4 1 1 ETV6-RUNX1_4 2341
## PBMMC_1 1 1 PBMMC_1 40
## PBMMC_2 1 1 PBMMC_2 68
## PBMMC_3 1 1 PBMMC_3 189
A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 20 cells (Crowell et al. 2019).
discarded <- current$ncells < 20
y <- y[,!discarded]
summary(discarded)
## Mode FALSE
## logical 7
Another typical step in bulk RNA-seq analyses is to remove genes that
are lowly expressed. This reduces computational work, improves the
accuracy of mean-variance trend modelling and decreases the severity of
the multiple testing correction. Genes are discarded if they are not
expressed above a log-CPM threshold in a minimum number of samples
(determined from the size of the smallest treatment group in the
experimental design). The filterByExpr
function keeps rows
that have ‘worthwhile’ counts in a minimum number of samples In this
case the minimum number would be 3 as that is the smallest SampleGroup
size.
keep <- filterByExpr(y, group=current$SampleGroup)
y <- y[keep,]
summary(keep)
## Mode FALSE TRUE
## logical 18548 14554
Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). Counts for our pseudo-bulk samples are large enough to apply bulk normalization methods. In the case of single cell data this is helping to account for the differing numbers of cell counts that have been summed to produce the pseudo-samples.
y <- calcNormFactors(y)
y$samples
## group lib.size norm.factors Sample Barcode SampleName
## ETV6-RUNX1_1 1 10938549 0.8773775 SRR9264343 <NA> ETV6-RUNX1_1
## ETV6-RUNX1_2 1 14380564 1.0045934 SRR9264344 <NA> ETV6-RUNX1_2
## ETV6-RUNX1_3 1 5803180 1.0453059 SRR9264345 <NA> ETV6-RUNX1_3
## ETV6-RUNX1_4 1 7835579 0.9224021 SRR9264346 <NA> ETV6-RUNX1_4
## PBMMC_1 1 172373 1.0925529 SRR9264351 <NA> PBMMC_1
## PBMMC_2 1 399403 1.0698062 SRR9264353 <NA> PBMMC_2
## PBMMC_3 1 359025 1.0067275 SRR9264354 <NA> PBMMC_3
## SampleGroup sum detected subsets_Mito_sum subsets_Mito_detected
## ETV6-RUNX1_1 ETV6-RUNX1 NA NA NA NA
## ETV6-RUNX1_2 ETV6-RUNX1 NA NA NA NA
## ETV6-RUNX1_3 ETV6-RUNX1 NA NA NA NA
## ETV6-RUNX1_4 ETV6-RUNX1 NA NA NA NA
## PBMMC_1 PBMMC NA NA NA NA
## PBMMC_2 PBMMC NA NA NA NA
## PBMMC_3 PBMMC NA NA NA NA
## subsets_Mito_percent total sizeFactor k.60_cluster.fun.leiden
## ETV6-RUNX1_1 NA NA NA 1
## ETV6-RUNX1_2 NA NA NA 1
## ETV6-RUNX1_3 NA NA NA 1
## ETV6-RUNX1_4 NA NA NA 1
## PBMMC_1 NA NA NA 1
## PBMMC_2 NA NA NA 1
## PBMMC_3 NA NA NA 1
## label label.1 SampleName.1 ncells
## ETV6-RUNX1_1 1 1 ETV6-RUNX1_1 1754
## ETV6-RUNX1_2 1 1 ETV6-RUNX1_2 4935
## ETV6-RUNX1_3 1 1 ETV6-RUNX1_3 1790
## ETV6-RUNX1_4 1 1 ETV6-RUNX1_4 2341
## PBMMC_1 1 1 PBMMC_1 40
## PBMMC_2 1 1 PBMMC_2 68
## PBMMC_3 1 1 PBMMC_3 189
As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile. This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile.
par(mfrow=c(2,4))
for (i in seq_len(ncol(y))) {
plotMD(y, column=i)
}
We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles. MDS arranges the points on the plot so that the distances among each pair of points correlates as best as possible to the dissimilarity between those two samples. The values on the two axes tell you nothing about the variables for a given sample - the plot is just a two dimensional space to arrange the points. You can think of this in a similar way to how you would read a PCA plot in bulk RNASeq as it allows us to visualize the structure of the data. The aim is to check whether samples separate by our known factors of interest. Strong separation foreshadows a large number of DEGs in the subsequent analysis.
y$samples$SampleGroup <- factor(y$samples$SampleGroup)
limma::plotMDS(cpm(y, log=TRUE), col = as.numeric(y$samples$SampleGroup))
Our aim is to test whether the log-fold change between sample groups is significantly different from zero.
design <- model.matrix(~factor(SampleGroup), y$samples)
design
## (Intercept) factor(SampleGroup)PBMMC
## ETV6-RUNX1_1 1 0
## ETV6-RUNX1_2 1 0
## ETV6-RUNX1_3 1 0
## ETV6-RUNX1_4 1 0
## PBMMC_1 1 1
## PBMMC_2 1 1
## PBMMC_3 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(SampleGroup)`
## [1] "contr.treatment"
We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend, which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend.
y <- estimateDisp(y, design)
When a negative binomial model is fitted, we need to estimate the BCV(s) before we carry out the analysis. Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Hence, it is equivalent to estimating the dispersion(s) of the negative binomial model. Trended and common BCV estimates are shown in blue and red, respectively.
plotBCV(y)
We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis.
fit <- glmQLFit(y, design, robust=TRUE)
QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).
plotQLDisp(fit)
We test for differences in expression due to sample group using glmQLFTest(). The QLF-test is preferred to a likelihood ratio test as it reflects the uncertainty in estimating the dispersion for each gene. DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. If very few genes are significantly DE that sample group has little effect on the transcriptome.
The coef
argument specifies which contrast you want
tested. It is the number of the column in the model matrix. For us since
we only have 2 columns in our model matrix (the intercept and PBMMC vs
ETV6-RUNX1) we can specify this as “2”.
results <- glmQLFTest(fit, coef = 2)
summary(decideTests(results))
## factor(SampleGroup)PBMMC
## Down 861
## NotSig 13238
## Up 455
We can get the results of our comparison using the
topTags
function. By default this function will output the
top 10 genes by FDR but you could change this using the n =
argument.
topTags(results)
## Coefficient: factor(SampleGroup)PBMMC
## logFC logCPM F PValue FDR
## HTR1F 9.892341 6.763185 976.8976 3.324859e-12 4.839000e-08
## FXYD2 7.304759 5.244366 494.8288 1.388601e-10 1.010485e-06
## L3MBTL4 6.729266 4.951867 286.0786 2.706082e-09 1.312810e-05
## MIR4432HG 5.576494 6.324972 218.4894 1.144237e-08 4.163307e-05
## ADGRE2 6.036213 5.212321 187.4924 2.574754e-08 7.494595e-05
## KCNN1 -6.713691 7.045461 138.5996 1.252499e-07 2.216359e-04
## OVCH2 -12.552129 6.413688 209.5101 1.307651e-07 2.216359e-04
## MSR1 -6.662685 7.966067 135.6341 1.401065e-07 2.216359e-04
## FAM171B 9.310463 2.277968 134.5088 1.462816e-07 2.216359e-04
## MSH6 -2.973405 7.889308 133.4300 1.525055e-07 2.216359e-04
Now that we have laid out the theory underlying the DE analysis, we
repeat this process for each of the labels. This is conveniently done
using the pseudoBulkDGE
function from scran
,
which will loop over all labels and apply the exact analysis described
above to each label. To prepare for this, we filter out all sample-label
combinations with insufficient cells.
summed.filt <- summed[,summed$ncells >= 20]
We construct a common design matrix that will be used in the analysis for each label. Recall that this matrix should have one row per unique sample (and named as such), reflecting the fact that we are modelling counts on the sample level instead of the cell level.
targets <- colData(sce)[!duplicated(sce$SampleName),] %>%
data.frame()
design <- model.matrix(~factor(SampleGroup), data=targets)
rownames(design) <- targets$SampleName
We then apply the pseudoBulkDGE
function to obtain a
list of DE genes for each label. This function puts some additional
effort into automatically dealing with labels that are not represented
in all sample groups, for which a DE analysis between conditions is
meaningless; or are not represented in a sufficient number of replicate
samples to enable modelling of biological variability.
summed.filt$SampleGroup <- factor(summed.filt$SampleGroup)
de.results <- pseudoBulkDGE(summed.filt,
label = summed.filt$label,
design = ~SampleGroup,
coef = "SampleGroupPBMMC",
condition = summed.filt$SampleName
)
de.results
## List of length 10
## names(10): 1 10 11 15 2 3 4 5 6 8
de.results[[1]]
## DataFrame with 33102 rows and 5 columns
## logFC logCPM F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## MIR1302-2HG NA NA NA NA NA
## ENSG00000238009 NA NA NA NA NA
## ENSG00000239945 NA NA NA NA NA
## ENSG00000241860 1.51813 0.238342 0.990913 0.338749 0.591997
## ENSG00000286448 NA NA NA NA NA
## ... ... ... ... ... ...
## ENSG00000277856 NA NA NA NA NA
## ENSG00000275063 0.181890 0.2612687 0.00326667 0.955776 0.985897
## ENSG00000275869 NA NA NA NA NA
## ENSG00000276017 NA NA NA NA NA
## ENSG00000278817 0.567431 0.0642062 0.08622580 0.773940 0.898339
We examine the numbers of DEGs at a FDR of 5% for each label using
the decideTestsPerLabel
function. Note that genes listed as
NA were either filtered out as low-abundance genes for a given label’s
analysis, or the comparison of interest was not possible for a
particular label, e.g., due to lack of residual degrees of freedom or an
absence of samples from both conditions.
is.de <- decideTestsPerLabel(de.results, threshold=0.05)
summarizeTestsPerLabel(is.de)
## -1 0 1 NA
## 1 928 16301 413 15460
## 10 135 8724 97 24146
## 11 73 11205 32 21792
## 15 1 5505 6 27590
## 2 373 14377 185 18167
## 3 919 8406 1006 22771
## 4 188 10450 405 22059
## 5 127 4479 412 28084
## 6 77 9134 111 23780
## 8 37 8227 53 24785
For each gene, we compute the percentage of labels in which that gene is upregulated or downregulated. (Here, we consider a gene to be non-DE if it is not retained after filtering.).
up.de <- is.de > 0 & !is.na(is.de)
head(sort(rowMeans(up.de), decreasing=TRUE), 10)
## LYZ S100A9 S100A8 SIK3 AFF3 JARID2 DYRK1A PATJ LRBA RAPGEF2
## 0.9 0.8 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5
down.de <- is.de < 0 & !is.na(is.de)
head(sort(rowMeans(down.de), decreasing=TRUE), 10)
## MDK TERF2 LINC03000 SOCS2 RPS4Y2
## 1.0 0.8 0.7 0.7 0.7
## HLA-C TAP1 ENSG00000227706 CCN2 MYO1G
## 0.6 0.6 0.6 0.6 0.6
We further identify label-specific DE genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected even at a generous FDR threshold of 50%.
remotely.de <- decideTestsPerLabel(de.results, threshold=0.5)
not.de <- remotely.de==0 | is.na(remotely.de)
cx <- "1"
other.labels <- setdiff(colnames(not.de), cx)
unique.degs <- is.de[,cx]!=0 & rowMeans(not.de[,other.labels])==1
unique.degs <- names(which(unique.degs))
head(unique.degs)
## [1] "ENSG00000228037" "C1QA" "ENSG00000286838" "ENSG00000223881"
## [5] "ENSG00000228613" "TEX41"
Another thing we may be interested in is to inspect the top
significantly differential expressed gene. To avoid an error caused by
negative size factors and since we don’t need them here we remove the
size factors from the summed and filtered single cell object
summed.filt
.
top_gene <- rownames(de.results[[1]][order(de.results[[1]]$FDR),])[1]
sizeFactors(summed.filt) <- NULL
plotExpression(logNormCounts(summed.filt),
features=top_gene,
x="SampleName", colour_by="SampleName",
other_fields="label") +
facet_wrap(~label) +
ggtitle(top_gene)
See the accompanying worksheet for this session, which includes the code below for the exercise. We want to achieve the following:
Rerun the differential expression analysis using the PRE-T and HHD samples.
Determine which cluster has the most DEGs?
Determine which genes are significantly DE in that cluster and not in any other?
# First load in the other two sample groups
sce_PRET_HHD <- readRDS("R_objects/Caron_clustered.PRETandHHD.rds")
# replace the ensembl IDs with gene symbols where possible
rownames(sce_PRET_HHD) <- uniquifyFeatureNames(rownames(sce_PRET_HHD), rowData(sce_PRET_HHD)$Symbol)
# check your sce object
sce_PRET_HHD
# Part A
# Re run the analysis using these new samples.
FIXME
# Part B
# Looking at your results, which cluster has the most DEGs?
FIXME
# Part C
# Which genes are sig. DE in that cluster and not the others?
FIXME
You can copy and paste from the previous code and just change which object is used.
You will need to run pseudoBulkDGE to get the results for all the clusters at once.
You can copy and paste this code from earlier too, but remember to replace the number of the cluster with the one you found in Part B.
Here is the complete script:
# First load in the other two sample groups
sce_PRET_HHD <- readRDS("R_objects/Caron_clustered.PRETandHHD.rds")
# replace the ensembl IDs with gene symbols where possible
rownames(sce_PRET_HHD) <- uniquifyFeatureNames(rownames(sce_PRET_HHD), rowData(sce_PRET_HHD)$Symbol)
# check your sce object
sce_PRET_HHD
# Part A
# Re run the analysis using these new samples.
summed_PRET_HHD <- aggregateAcrossCells(sce_PRET_HHD,
id=colData(sce_PRET_HHD)[,c("label", "SampleName")])
summed_PRET_HHD
summed_PRET_HHD.filt <- summed_PRET_HHD[,summed_PRET_HHD$ncells >= 20]
targets_PRET_HHD <- colData(sce_PRET_HHD)[!duplicated(sce_PRET_HHD$SampleName),] %>%
data.frame()
design_PRET_HHD <- model.matrix(~factor(SampleGroup), data=targets_PRET_HHD)
rownames(design_PRET_HHD) <- targets_PRET_HHD$SampleName
summed_PRET_HHD.filt$SampleGroup <- factor(summed_PRET_HHD.filt$SampleGroup)
de.results_PRET_HHD <- pseudoBulkDGE(summed_PRET_HHD.filt,
label = summed_PRET_HHD.filt$label,
design = ~SampleGroup,
coef = "SampleGroupPRE-T",
condition = summed_PRET_HHD.filt$SampleName)
# Part B
# Looking at your results, which cluster has the most DEGs?
is.de_PRET_HHD <- decideTestsPerLabel(de.results_PRET_HHD, threshold=0.05)
summarizeTestsPerLabel(is.de_PRET_HHD)
# Part C
# Which genes are sig. DE in that cluster and not the others?
remotely.de_PRET_HHD <- decideTestsPerLabel(de.results_PRET_HHD, threshold=0.5)
not.de_PRET_HHD <- remotely.de_PRET_HHD==0 | is.na(remotely.de_PRET_HHD)
cz <- "2"
other.labels_PRET_HHD <- setdiff(colnames(not.de_PRET_HHD), cz)
unique.degs_PRET_HHD <- is.de_PRET_HHD[,cz]!=0 & rowMeans(not.de_PRET_HHD[,other.labels_PRET_HHD])==1
unique.degs_PRET_HHD <- names(which(unique.degs_PRET_HHD))
head(unique.degs_PRET_HHD)
The above example of differential expression focused on testing for differences in expression between conditions for the same cell type or label. However, the same methodology can be applied to test for differences between cell types across samples. This kind of DE analysis can overcome a lack of suitable replication.
Using a dummy value for the label to allow us to include multiple cell types in the fitted model; otherwise, each cell type will be processed separately.
summed.sub <- summed[,summed$label %in% c("3", "4")]
between.res <- pseudoBulkDGE(summed.sub,
label=rep("dummy", ncol(summed.sub)),
design=~factor(SampleName) + factor(label),
coef="factor(label)4")[[1]]
table(Sig=between.res$FDR <= 0.05, Sign=sign(between.res$logFC))
## Sign
## Sig -1 1
## FALSE 2409 2508
## TRUE 2173 2101
between.res[order(between.res$FDR),]
## DataFrame with 33102 rows and 5 columns
## logFC logCPM F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## MEF2C -4.92038 8.26003 637.089 3.84693e-17 2.52228e-13
## ITK 8.35509 6.72293 615.125 5.48859e-17 2.52228e-13
## PRSS57 -7.77942 6.73196 554.816 1.55744e-16 4.77149e-13
## CD8B 7.82809 6.64907 531.031 2.42250e-16 4.77783e-13
## CCR7 6.63218 7.29159 527.332 2.59919e-16 4.77783e-13
## ... ... ... ... ... ...
## ENSG00000277856 NA NA NA NA NA
## ENSG00000275063 NA NA NA NA NA
## ENSG00000275869 NA NA NA NA NA
## ENSG00000276017 NA NA NA NA NA
## ENSG00000278817 NA NA NA NA NA
summed.sub <- logNormCounts(summed.sub, size.factors=NULL)
plotExpression(summed.sub,
features=head(rownames(between.res)[order(between.res$FDR)]),
x="label",
colour_by=I(factor(summed.sub$SampleName)))
Whether or not this is a scientifically meaningful comparison depends on the nature of the labels. These particular labels were defined by clustering, which means that the presence of DEGs is a foregone conclusion. Nonetheless, it may have some utility for applications where the labels are defined using independent information, e.g., from FACS.
In a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome.
In the past we have demonstrated how to do this by using edgeR on a
matrix of samples and clusters similar to the pseudo-samples but the
‘counts’ are cell numbers. Recently the same group who are responsible
for the rest of the scater/scran tool kit have published
miloR
which has some advantages over directly using edgeR
in this way.
MiloR uses methods based on (cydar)[https://www.nature.com/articles/nmeth.4295] which was designed for mass cytometry data although it does use the Negative Binomial GLM implementation from the edgeR package.
While differential abundance (DA) is commonly quantified in discrete cell clusters, Milo uses partally overlapping neighbourhoods of cells on a KNN graph, this means it isn’t dependent on our clustering results. Starting from a graph that faithfully recapitulates the biology of the cell population, Milo analysis consists of 3 steps:
The first step is to turn our single cell object into a Milo specific object. This is very similar but includes extra slots where we can store the neighbourhood information we generate.
milo <- Milo(sce)
milo
## class: Milo
## dim: 33102 30461
## metadata(1): Samples
## assays(3): counts logcounts reconstructed
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(30461): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 11_TTTGTCATCAGTTGAC-1 11_TTTGTCATCTCGTTTA-1
## colData names(13): Sample Barcode ... k.60_cluster.fun.leiden label
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1
We need to add the KNN graph to the Milo object. This is stored in
the graph slot, in igraph format. The miloR package includes
functionality to build and store the graph from the PCA dimensions
stored in the reducedDim slot. Instead of the standard PCA we will use
our MNN batch corrected PCA dimensions. The two parameters important to
set here are k
for the number of nearest neighbours for
graph building and d
for the number of our reduced
dimensions to use.
milo <- buildGraph(milo, k = 60, d = 30, reduced.dim = "corrected", BPPARAM = MulticoreParam(7))
We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by (Gut et al. 2015)[https://www.nature.com/articles/nmeth.3545].
For sampling you need to define a few parameters:
prop: the proportion of cells to randomly sample to start with (usually 0.1 - 0.2 is sufficient) k: the k to use for KNN refinement (we recommend using the same k used for KNN graph building) d: the number of reduced dimensions to use for KNN refinement (we recommend using the same d used for KNN graph building) refined: indicates whether you want to use the sampling refinement algorithm, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.
milo <- makeNhoods(milo, prop = 0.1, k = 60, d=30, refined = TRUE, reduced_dims = "corrected")
Once we have defined neighbourhoods, it’s good to take a look at how big the neighbourhoods are (i.e. how many cells form each neighbourhood). This affects the power of DA testing. We can check this out using the plotNhoodSizeHist function.
The authors of Milo have stated several different parameters of deciding if your histogram is correct. Either ‘peaking above 20’, ‘peaking between 50 and 100’ or ‘an average neighbourhood size over 5 x N_samples’. The tool is still in development. Realistically all of these statements give numbers in the same ballpark and so we can make a judgement on our data. If our histogram looks outside this we might consider rerunning makeNhoods increasing k and/or prop.
plotNhoodSizeHist(milo)
Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.
This adds to the Milo object a n×m matrix, where n is the number of neighbourhoods and m is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.
milo <- countCells(milo, meta.data = data.frame(colData(milo)), sample="SampleName")
head(nhoodCounts(milo))
## 6 x 7 sparse Matrix of class "dgCMatrix"
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_2 PBMMC_3
## 1 7 82 4 21 . . .
## 2 . 2 4 3 2 67 53
## 3 . . . . 13 38 43
## 4 39 36 40 132 2 1 6
## 5 . . 4 1 1 53 10
## 6 6 248 2 44 . . 2
Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.
We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between our PBMMC and ETV6-RUNX1 sample groups. We could also include a batch column in the design matrix at this point but to keep the demonstration simple we will skip this. We would normally do this as we know there is a batch effect that we want to account for in DA testing.
milo_design <- data.frame(colData(milo))[,c("SampleName", "SampleGroup")]
milo_design <- distinct(milo_design)
rownames(milo_design) <- milo_design$SampleName
milo_design
## SampleName SampleGroup
## ETV6-RUNX1_1 ETV6-RUNX1_1 ETV6-RUNX1
## ETV6-RUNX1_2 ETV6-RUNX1_2 ETV6-RUNX1
## ETV6-RUNX1_3 ETV6-RUNX1_3 ETV6-RUNX1
## ETV6-RUNX1_4 ETV6-RUNX1_4 ETV6-RUNX1
## PBMMC_1 PBMMC_1 PBMMC
## PBMMC_2 PBMMC_2 PBMMC
## PBMMC_3 PBMMC_3 PBMMC
Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).
It is important to note the value for d
here should
match what was used for constructing the KNN graph and also that if a
reduced dimension output it will default to PCA.
milo <- calcNhoodDistance(milo, d=30, reduced.dim = "corrected")
Now we can do the DA test, explicitly defining our experimental
design. In this case as discussed we will test the difference between
sample groups. The testNhoods
function calculates a
Fold-change and corrected P-value for each neighbourhood, which
indicates whether there is significant differential abundance between
sample groups.
da_results <- testNhoods(milo, design = ~ SampleGroup, design.df = milo_design, reduced.dim = "corrected")
da_results %>%
arrange(SpatialFDR) %>%
head()
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 98 10.27673 10.49514 22.82455 1.800848e-06 0.0001007465 98 0.0001204805
## 108 10.55687 10.77641 23.86867 1.047845e-06 0.0001007465 108 0.0001204805
## 129 10.78087 11.00341 24.57963 7.250755e-07 0.0001007465 129 0.0001204805
## 203 10.19692 10.41833 22.52467 2.104251e-06 0.0001007465 203 0.0001204805
## 230 10.45652 10.67320 23.52392 1.252870e-06 0.0001007465 230 0.0001204805
## 234 10.32545 10.54697 23.02045 1.626749e-06 0.0001007465 234 0.0001204805
We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced. We expect a peak to very low extreme and a tail towards 1. This blog (article)[http://varianceexplained.org/statistics/interpreting-pvalue-histogram/] explains the different pvalue histogram profiles and what they can mean.
ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)
Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).
We have marked a significance threshold of 10% FDR.
ggplot(da_results, aes(logFC, -log10(SpatialFDR))) +
geom_point() +
geom_hline(yintercept = 1)
It looks like we have detected several neighbourhoods were there is a significant difference in cell abundances between sample groups.
To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding.
milo <- buildNhoodGraph(milo)
In a normal analysis it is assumed that by stage we would have divided our cells in a biological meaningful way, be that using the cluster marker to define cell types or another method so we can use that as an overlay to help us make sense of the downstream functionality Milo supplies. In this demonstration we have not conclusively done this so we will just use our cluster numbers as a placeholder.
Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are colored by their log-Fold Change.
umap_plot <- plotReducedDim(milo, dimred = "UMAP_corrected", colour_by="label", text_by = "label")
nh_graph_plot <- plotNhoodGraphDA(milo, da_results, layout="UMAP_corrected",alpha=0.05)
umap_plot + nh_graph_plot +
plot_layout(guides="collect")
We might also be interested in visualizing whether DA is particularly evident in certain labels (or celltypes). To do this, we assign a label to each neighbourhood by finding the most abundant label within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harboring the label.
da_results <- annotateNhoods(milo, da_results, coldata_col = "label")
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -7.083725 8.050131 7.424013 6.447434e-03 0.024110230 1 0.0264054327
## 2 5.430335 9.545450 9.270190 2.335375e-03 0.013660535 2 0.0151427104
## 3 9.376783 9.622672 19.382107 1.081494e-05 0.000173652 3 0.0002169261
## 4 -2.183328 9.398469 1.636309 2.008632e-01 0.270886311 4 0.2831046750
## 5 5.019405 8.600057 9.895556 1.661787e-03 0.011484661 5 0.0129657438
## 6 -4.634793 9.201352 5.788311 1.615165e-02 0.042623319 6 0.0460461805
## label label_fraction
## 1 1 1.0000000
## 2 8 1.0000000
## 3 11 1.0000000
## 4 2 0.9921875
## 5 10 1.0000000
## 6 1 1.0000000
We can plot to see what the distribution of neighbourhoods is like within our labels. We could take this as a measure of how similar our clustering is with the KNN neighborhood as defined by Milo.
ggplot(da_results, aes(label_fraction)) + geom_histogram(bins=50)
While neighbourhoods tend to be homogeneous, we can define a threshold for label_fraction to exclude neighbourhoods that are a mix of cell types.
da_results$label <- ifelse(da_results$label_fraction < 0.7, "Mixed", da_results$label)
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -7.083725 8.050131 7.424013 6.447434e-03 0.024110230 1 0.0264054327
## 2 5.430335 9.545450 9.270190 2.335375e-03 0.013660535 2 0.0151427104
## 3 9.376783 9.622672 19.382107 1.081494e-05 0.000173652 3 0.0002169261
## 4 -2.183328 9.398469 1.636309 2.008632e-01 0.270886311 4 0.2831046750
## 5 5.019405 8.600057 9.895556 1.661787e-03 0.011484661 5 0.0129657438
## 6 -4.634793 9.201352 5.788311 1.615165e-02 0.042623319 6 0.0460461805
## label label_fraction
## 1 1 1.0000000
## 2 8 1.0000000
## 3 11 1.0000000
## 4 2 0.9921875
## 5 10 1.0000000
## 6 1 1.0000000
Now we can visualise the distribution of DA Fold Changes in different labels.
plotDAbeeswarm(da_results, group.by = "label")
This is already quite informative, we can see in several clusters are enrich and some are depleted between our sample groups. There are not many DA neighbourhoods with a mixed label either.
Once you have found your neighbourhoods showing significant DA between conditions, you might want to find gene signatures specific to the cells in those neighbourhoods. The function findNhoodGroupMarkers runs a one-VS-all differential gene expression test to identify marker genes for a group of neighbourhoods of interest. Before running this function you will need to define your neighbourhood groups depending on your biological question. This needs to be stored as a NhoodGroup column in the da_results data.frame.
In a case where all the DA neighbourhoods seem to belong to the same region of the graph, you might just want to test the significant DA neighbourhoods with the same logFC against all the rest. Here is a demonstration using a random set of 10 genes.
da_results$NhoodGroup <- as.numeric(da_results$SpatialFDR < 0.1 & da_results$logFC < 0)
da_nhood_markers <- findNhoodGroupMarkers(milo, da_results, subset.row = rownames(milo)[1:10])
head(da_nhood_markers)
## GeneID logFC_1 adj.P.Val_1 logFC_0 adj.P.Val_0
## 1 ENSG00000229905 -7.225889e-05 1 7.225889e-05 1
## 2 ENSG00000235146 0.000000e+00 1 0.000000e+00 1
## 3 ENSG00000238009 -9.575701e-04 1 9.575701e-04 1
## 4 ENSG00000239945 0.000000e+00 1 0.000000e+00 1
## 5 ENSG00000241860 2.215696e-04 1 -2.215696e-04 1
## 6 ENSG00000286448 -9.739554e-05 1 9.739554e-05 1
For this analysis we recommend aggregating the neighbourhood expression profiles by experimental samples (the same used for DA testing), by setting aggregate.samples=TRUE. This way single-cells will not be considered as “replicates” during DGE testing, and dispersion will be estimated between true biological replicates. Like so:
da_nhood_markers <- findNhoodGroupMarkers(milo, da_results, subset.row = rownames(milo)[1:10],
aggregate.samples = TRUE, sample_col = "SampleName")
head(da_nhood_markers)
## GeneID logFC_1 adj.P.Val_1 logFC_0 adj.P.Val_0
## 1 ENSG00000229905 -2.043428e-04 1 2.043428e-04 1
## 2 ENSG00000235146 0.000000e+00 1 0.000000e+00 1
## 3 ENSG00000238009 5.712642e-04 1 -5.712642e-04 1
## 4 ENSG00000239945 0.000000e+00 1 0.000000e+00 1
## 5 ENSG00000241860 1.047345e-03 1 -1.047345e-03 1
## 6 ENSG00000286448 -1.355253e-20 1 1.355253e-20 1
In many cases, such as this example, DA neighbourhoods are found in different areas of the KNN graph, and grouping together all significant DA populations might not be ideal, as they might include cells of very different celltypes. For this kind of scenario, we have implemented a neighbourhood function that uses community detection to partition neighbourhoods into groups on the basis of (1) the number of shared cells between 2 neighbourhoods; (2) the direction of fold-change for DA neighbourhoods; (3) the difference in fold change. This is done using the Louvain community detection algorithm.
The max.lfc.delta argument here determines the absolute difference in log fold change below which neighbourhoods should not be considered adjacent. The overlap argument determines the number of cells that must overlap between adjacent neighbourhoods for merging and there is a default FDR of 0.1 for determining which neighbourhoods are considered DA.
da_results <- groupNhoods(milo, da_results, max.lfc.delta = 10, overlap = 1)
head(da_results)
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 1 -7.083725 8.050131 7.424013 6.447434e-03 0.024110230 1 0.0264054327
## 2 5.430335 9.545450 9.270190 2.335375e-03 0.013660535 2 0.0151427104
## 3 9.376783 9.622672 19.382107 1.081494e-05 0.000173652 3 0.0002169261
## 4 -2.183328 9.398469 1.636309 2.008632e-01 0.270886311 4 0.2831046750
## 5 5.019405 8.600057 9.895556 1.661787e-03 0.011484661 5 0.0129657438
## 6 -4.634793 9.201352 5.788311 1.615165e-02 0.042623319 6 0.0460461805
## label label_fraction NhoodGroup
## 1 1 1.0000000 1
## 2 8 1.0000000 2
## 3 11 1.0000000 3
## 4 2 0.9921875 4
## 5 10 1.0000000 5
## 6 1 1.0000000 1
We can view this on a UMAP.
plotNhoodGroups(milo, da_results, layout="UMAP_corrected")
And on a bee swarm plot.
plotDAbeeswarm(da_results, "NhoodGroup")
See the accompanying worksheet for this session, which includes the code below for the exercise. We want to achieve the following:
max.lfc.delta
and
overlap
to see how the grouping is affected.# First load in the other two sample groups
set.seed(42) # set your random seed for reproducibility
# Part A
# rerun the grouping with values for max.lfc.delta of 1, 5, 25 and plot beeswarm for each
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = FIXME) , group.by = "NhoodGroup") + ggtitle("max LFC delta=1")
# Part B
# Using the max.lfc.delta you preferred from Part A now alter the overlap value trying 1, 3, 10.
FIXME
Replace the max.lfc.delta value to get the new beeswarm plot. Overlap is defaulted to 1 so we can just leave that for now.
Add the overlap argument into the groupNhoods function and try the different values suggested.
Here is the complete script:
# First load in the other two sample groups
set.seed(42) # set your random seed for reproducibility
# Part A
# rerun the grouping with values for max.lfc.delta of 1, 5, 25 and plot beeswarm for each
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 1) , group.by = "NhoodGroup") + ggtitle("max LFC delta=1")
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5) , group.by = "NhoodGroup") + ggtitle("max LFC delta=5")
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 25) , group.by = "NhoodGroup") + ggtitle("max LFC delta=25")
# Part B
# Using the max.lfc.delta you preferred from Part A now alter the overlap value trying 1, 3, 10.
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5, overlap = 1) , group.by = "NhoodGroup") + ggtitle("overlap=1")
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5, overlap = 3) , group.by = "NhoodGroup") + ggtitle("overlap=3")
plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5, overlap = 10) , group.by = "NhoodGroup") + ggtitle("overlap=10")
# The final code
set.seed(42)
da_results <- groupNhoods(milo, da_results, max.lfc.delta = 5, overlap=1)
plotNhoodGroups(milo, da_results, layout="UMAP_corrected")
plotDAbeeswarm(da_results, group.by = "NhoodGroup")
Once we have grouped neighbourhoods using groupNhoods we are now all set to identifying gene signatures between neighbourhood groups.
Let’s restrict the testing to highly variable genes in this case
set.seed(101)
dec <- modelGeneVar(milo)
hvgs <- getTopHVGs(dec, n=2000)
head(hvgs)
## [1] "HBB" "HBA2" "HBA1" "HBD" "HBM" "AHSP"
We run findNhoodGroupMarkers to test for one-vs-all differential gene expression for each neighbourhood group
set.seed(42)
nhood_markers <- findNhoodGroupMarkers(milo, da_results, subset.row = hvgs,
aggregate.samples = TRUE, sample_col = "SampleName")
head(nhood_markers)
## GeneID logFC_1 adj.P.Val_1 logFC_2 adj.P.Val_2 logFC_3
## 1 AAK1 -0.04947289 0.8252759 0.71919526 6.423608e-12 0.08829380
## 2 ABCB10 -0.19780960 0.6779130 -0.19317568 5.931055e-01 -0.22045817
## 3 ABCB6 -0.09685217 0.6779130 -0.08422366 6.306008e-01 -0.09934854
## 4 ABCB7 -0.00340844 0.9820320 -0.08121156 6.383011e-01 -0.19252155
## 5 ABCC4 -0.16426694 0.6779130 -0.10213583 7.222900e-01 -0.16297673
## 6 ABCG2 -0.05194183 0.8202183 -0.07239040 7.342526e-01 0.01626998
## adj.P.Val_3 logFC_4 adj.P.Val_4 logFC_5 adj.P.Val_5 logFC_6
## 1 0.5888909 -0.06037954 0.8874002 -0.2172419 1.294468e-01 -0.0829621623
## 2 0.4263990 -0.17004678 0.8769033 1.0632392 1.088709e-11 -0.1704196153
## 3 0.4842964 -0.08332507 0.8769033 0.3390345 7.411499e-04 -0.0798352219
## 4 0.1671680 -0.02081495 0.9467981 0.1411848 2.237871e-01 0.0050901796
## 5 0.5035200 -0.12338890 0.8769607 0.2986599 1.372684e-01 -0.0003324888
## 6 0.9273264 -0.02211144 0.9589972 0.2266879 1.264012e-01 -0.0408581339
## adj.P.Val_6 logFC_7 adj.P.Val_7 logFC_8 adj.P.Val_8 logFC_9
## 1 0.9703971 -0.2258893 9.254174e-02 -0.007415208 0.9649177 -0.20718712
## 2 0.9703971 0.6374723 4.940311e-04 -0.202749072 0.6703971 -0.21383212
## 3 0.9703971 0.4585965 5.223759e-07 -0.101685252 0.6703971 -0.09736909
## 4 0.9989145 0.2411565 1.955074e-02 0.072159800 0.6804150 0.01634036
## 5 0.9991790 0.7858936 9.551390e-07 -0.174374418 0.6703971 -0.11740202
## 6 0.9789949 0.4935845 6.540697e-05 -0.057816692 0.7717426 -0.11456751
## adj.P.Val_9 logFC_10 adj.P.Val_10 logFC_11 adj.P.Val_11
## 1 0.4184316 -0.08672881 0.8820868 0.09713615 0.61417778
## 2 0.5766622 -0.13569569 0.8820868 -0.29526848 0.29424016
## 3 0.6043124 -0.09320812 0.8820868 -0.11432649 0.45215231
## 4 0.9111764 -0.03492725 0.9251699 -0.17400203 0.24569760
## 5 0.6635902 -0.10677594 0.8867533 -0.18597552 0.48119712
## 6 0.6084486 -0.16794330 0.8820868 -0.31624138 0.09204369
We can get the markers that define a specific group.
gr9_markers <- nhood_markers[c("logFC_9", "adj.P.Val_9","GeneID")]
colnames(gr9_markers) <- c("logFC", "adj.P.Val", "GeneID")
head(gr9_markers[order(gr9_markers$adj.P.Val), ])
## logFC adj.P.Val GeneID
## 652 1.0104122 4.330811e-32 FCRL1
## 60 1.2608281 5.256457e-30 ANGPTL1
## 645 0.3957021 8.416153e-29 FCER2
## 998 0.3409898 1.904284e-28 LINC02397
## 857 1.2538224 2.259114e-27 IGHD
## 191 0.2942973 3.985596e-26 BTLA
We can start to do many visualisation and further investigations of the data using these grouping. More details are avalible in the Milo documentation.
markers <- nhood_markers$GeneID[nhood_markers$adj.P.Val_9 < 0.01
& nhood_markers$logFC_9 > 0]
plotNhoodExpressionGroups(milo, da_results, features=intersect(rownames(milo), markers[1:10]),
subset.nhoods = da_results$NhoodGroup %in% c('6','9'),
scale=TRUE,
grid.space = "fixed")
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] miloR_1.4.0 BiocParallel_1.30.3
## [3] bluster_1.6.0 DT_0.25
## [5] patchwork_1.1.2 forcats_0.5.2
## [7] stringr_1.4.1 dplyr_1.0.10
## [9] purrr_0.3.4 readr_2.1.2
## [11] tidyr_1.2.1 tibble_3.1.8
## [13] tidyverse_1.3.2 edgeR_3.38.4
## [15] limma_3.52.3 batchelor_1.12.3
## [17] scran_1.24.1 scater_1.24.0
## [19] ggplot2_3.3.6 scuttle_1.6.3
## [21] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [23] Biobase_2.56.0 GenomicRanges_1.48.0
## [25] GenomeInfoDb_1.32.4 IRanges_2.30.1
## [27] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [29] MatrixGenerics_1.8.1 matrixStats_0.62.0
##
## loaded via a namespace (and not attached):
## [1] googledrive_2.0.0 ggbeeswarm_0.6.0
## [3] colorspace_2.0-3 ellipsis_0.3.2
## [5] XVector_0.36.0 BiocNeighbors_1.14.0
## [7] fs_1.5.2 rstudioapi_0.14
## [9] farver_2.1.1 graphlayouts_0.8.1
## [11] ggrepel_0.9.1 fansi_1.0.3
## [13] lubridate_1.8.0 xml2_1.3.3
## [15] splines_4.2.1 codetools_0.2-18
## [17] sparseMatrixStats_1.8.0 cachem_1.0.6
## [19] knitr_1.40 polyclip_1.10-0
## [21] jsonlite_1.8.0 ResidualMatrix_1.6.1
## [23] broom_1.0.1 cluster_2.1.4
## [25] dbplyr_2.2.1 ggforce_0.3.4
## [27] compiler_4.2.1 httr_1.4.4
## [29] dqrng_0.3.0 backports_1.4.1
## [31] assertthat_0.2.1 Matrix_1.5-1
## [33] fastmap_1.1.0 gargle_1.2.1
## [35] cli_3.4.0 tweenr_2.0.2
## [37] BiocSingular_1.12.0 htmltools_0.5.3
## [39] tools_4.2.1 rsvd_1.0.5
## [41] igraph_1.3.4 gtable_0.3.1
## [43] glue_1.6.2 GenomeInfoDbData_1.2.8
## [45] Rcpp_1.0.9 cellranger_1.1.0
## [47] jquerylib_0.1.4 vctrs_0.4.1
## [49] DelayedMatrixStats_1.18.0 ggraph_2.0.6
## [51] xfun_0.33 beachmat_2.12.0
## [53] rvest_1.0.3 lifecycle_1.0.2
## [55] irlba_2.3.5 gtools_3.9.3
## [57] statmod_1.4.37 googlesheets4_1.0.1
## [59] MASS_7.3-58.1 zlibbioc_1.42.0
## [61] scales_1.2.1 tidygraph_1.2.2
## [63] hms_1.1.2 parallel_4.2.1
## [65] RColorBrewer_1.1-3 yaml_2.3.5
## [67] gridExtra_2.3 sass_0.4.2
## [69] stringi_1.7.8 highr_0.9
## [71] ScaledMatrix_1.4.1 rlang_1.0.5
## [73] pkgconfig_2.0.3 bitops_1.0-7
## [75] evaluate_0.16 lattice_0.20-45
## [77] labeling_0.4.2 htmlwidgets_1.5.4
## [79] cowplot_1.1.1 tidyselect_1.1.2
## [81] magrittr_2.0.3 R6_2.5.1
## [83] generics_0.1.3 metapod_1.4.0
## [85] DelayedArray_0.22.0 DBI_1.1.3
## [87] pillar_1.8.1 haven_2.5.1
## [89] withr_2.5.0 RCurl_1.98-1.8
## [91] crayon_1.5.1 modelr_0.1.9
## [93] utf8_1.2.2 tzdb_0.3.0
## [95] rmarkdown_2.16 viridis_0.6.2
## [97] locfit_1.5-9.6 grid_4.2.1
## [99] readxl_1.4.1 reprex_2.0.2
## [101] digest_0.6.29 munsell_0.5.0
## [103] beeswarm_0.4.0 viridisLite_0.4.1
## [105] vipor_0.4.5 bslib_0.4.0