1 Differential expression and abundance between conditions

1.1 Motivation

A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect changes in composition or expression between conditions. This can provide more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific biological states.

Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories - differential expression (DE) and differential abundance (DA) analyses. The former tests for changes in expression between conditions for cells of the same ‘type’ that are present in both conditions, while the latter tests for changes in the composition of cell types (or states, etc.) between conditions.

1.2 Setup

library(scater)
library(scran)
library(batchelor)
library(edgeR)
library(tidyverse)
library(patchwork)
library(DT)
library(bluster)
library(BiocParallel)
library(miloR)

bpp <- MulticoreParam(7)

1.3 Setting up the data

Due to the nature of differencial analysis we cannot use the downsampled version of the dataset for the this section so we have the full Caron dataset. We will start with just the PBMMC and ETV6-RUNX1 samples. It has been QCed, normalised and batch corrected as shown last week and clustered as shown this morning. We have selected the Leiden clustering with k=60 to go forward.

Load the SCE object:

sce <- readRDS("R_objects/Caron_clustered.PBMMCandETV6RUNX1.rds")

rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$Symbol)

sce

## class: SingleCellExperiment 
## dim: 33102 30461 
## metadata(1): Samples
## assays(3): counts logcounts reconstructed
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(30461): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
##   11_TTTGTCATCAGTTGAC-1 11_TTTGTCATCTCGTTTA-1
## colData names(13): Sample Barcode ... k.60_cluster.fun.leiden label
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):

We can plot any of the outputs of the dimensionality reduction in the reduced dimensions slot of the single cell object.

plotReducedDim(sce, dimred = "TSNE_corrected", colour_by = "label")

1.4 Differential expression between conditions

1.4.1 Creating pseudo-bulk samples

The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples.

Here we use label and SampleName as the parameter to aggregate by. This means for each true sample there will be a pseudo-sample for each of our defined clusters. We therefore would have a maximum of 7x17 (119) pseudo-samples. We will have less than this as there will be a few clusters with no cells from a particular sample.

summed <- aggregateAcrossCells(sce, 
                               id=colData(sce)[,c("label", "SampleName")])
summed

## class: SingleCellExperiment 
## dim: 33102 102 
## metadata(1): Samples
## assays(1): counts
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames: NULL
## colData names(16): Sample Barcode ... SampleName ncells
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):

At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking:

Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models. Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).) Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. (Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.)

1.4.2 Performing the DE analysis

1.4.2.1 Introduction

The DE analysis will be performed using quasi-likelihood (QL) methods from the edgeR package (Robinson, McCarthy, and Smyth 2010; Chen, Lun, and Smyth 2016). This uses a negative binomial generalized linear model (NB GLM) to handle overdispersed count data in experiments with limited replication. In our case, we have biological variation with few replicates per sample group, so edgeR (or its contemporaries) is a natural choice for the analysis. You can read more about the edgeR package and how each step is being calculated here: (edgeR user guide)[https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf]

It is possible to use DESeq2 here instead, the OSCA book uses edgeR so we have stuck to that but it is up to you.

We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we picked one of the labels to use for this demonstration.

current <- summed[,summed$label == "1"]

colData(current)

## DataFrame with 7 rows and 16 columns
##        Sample     Barcode   SampleName SampleGroup       sum  detected
##   <character> <character>  <character> <character> <numeric> <integer>
## 1  SRR9264343          NA ETV6-RUNX1_1  ETV6-RUNX1        NA        NA
## 2  SRR9264344          NA ETV6-RUNX1_2  ETV6-RUNX1        NA        NA
## 3  SRR9264345          NA ETV6-RUNX1_3  ETV6-RUNX1        NA        NA
## 4  SRR9264346          NA ETV6-RUNX1_4  ETV6-RUNX1        NA        NA
## 5  SRR9264351          NA      PBMMC_1       PBMMC        NA        NA
## 6  SRR9264353          NA      PBMMC_2       PBMMC        NA        NA
## 7  SRR9264354          NA      PBMMC_3       PBMMC        NA        NA
##   subsets_Mito_sum subsets_Mito_detected subsets_Mito_percent     total
##          <numeric>             <integer>            <numeric> <numeric>
## 1               NA                    NA                   NA        NA
## 2               NA                    NA                   NA        NA
## 3               NA                    NA                   NA        NA
## 4               NA                    NA                   NA        NA
## 5               NA                    NA                   NA        NA
## 6               NA                    NA                   NA        NA
## 7               NA                    NA                   NA        NA
##   sizeFactor k.60_cluster.fun.leiden    label    label   SampleName    ncells
##    <numeric>                <factor> <factor> <factor>  <character> <integer>
## 1         NA                       1        1        1 ETV6-RUNX1_1      1754
## 2         NA                       1        1        1 ETV6-RUNX1_2      4935
## 3         NA                       1        1        1 ETV6-RUNX1_3      1790
## 4         NA                       1        1        1 ETV6-RUNX1_4      2341
## 5         NA                       1        1        1      PBMMC_1        40
## 6         NA                       1        1        1      PBMMC_2        68
## 7         NA                       1        1        1      PBMMC_3       189

In order to use edgeR we must have the data stored in a DGEList object. For this we need our counts matrix and a data frame containing metadata on our pseudo-samples.

countsToUse <- counts(current)
colnames(countsToUse) <- colData(current)$SampleName

y <- DGEList(countsToUse, samples=colData(current))
y

## An object of class "DGEList"
## $counts
##                 ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1
## MIR1302-2HG                0            0            0            0       0
## ENSG00000238009            2            1            0            2       0
## ENSG00000239945            0            0            0            0       0
## ENSG00000241860            7           20            3            0       1
## ENSG00000286448            0            0            0            0       0
##                 PBMMC_2 PBMMC_3
## MIR1302-2HG           0       0
## ENSG00000238009       0       0
## ENSG00000239945       0       0
## ENSG00000241860       1       0
## ENSG00000286448       0       0
## 33097 more rows ...
## 
## $samples
##              group lib.size norm.factors     Sample Barcode   SampleName
## ETV6-RUNX1_1     1 10938549            1 SRR9264343    <NA> ETV6-RUNX1_1
## ETV6-RUNX1_2     1 14380564            1 SRR9264344    <NA> ETV6-RUNX1_2
## ETV6-RUNX1_3     1  5803180            1 SRR9264345    <NA> ETV6-RUNX1_3
## ETV6-RUNX1_4     1  7835579            1 SRR9264346    <NA> ETV6-RUNX1_4
## PBMMC_1          1   172373            1 SRR9264351    <NA>      PBMMC_1
## PBMMC_2          1   399403            1 SRR9264353    <NA>      PBMMC_2
## PBMMC_3          1   359025            1 SRR9264354    <NA>      PBMMC_3
##              SampleGroup sum detected subsets_Mito_sum subsets_Mito_detected
## ETV6-RUNX1_1  ETV6-RUNX1  NA       NA               NA                    NA
## ETV6-RUNX1_2  ETV6-RUNX1  NA       NA               NA                    NA
## ETV6-RUNX1_3  ETV6-RUNX1  NA       NA               NA                    NA
## ETV6-RUNX1_4  ETV6-RUNX1  NA       NA               NA                    NA
## PBMMC_1            PBMMC  NA       NA               NA                    NA
## PBMMC_2            PBMMC  NA       NA               NA                    NA
## PBMMC_3            PBMMC  NA       NA               NA                    NA
##              subsets_Mito_percent total sizeFactor k.60_cluster.fun.leiden
## ETV6-RUNX1_1                   NA    NA         NA                       1
## ETV6-RUNX1_2                   NA    NA         NA                       1
## ETV6-RUNX1_3                   NA    NA         NA                       1
## ETV6-RUNX1_4                   NA    NA         NA                       1
## PBMMC_1                        NA    NA         NA                       1
## PBMMC_2                        NA    NA         NA                       1
## PBMMC_3                        NA    NA         NA                       1
##              label label.1 SampleName.1 ncells
## ETV6-RUNX1_1     1       1 ETV6-RUNX1_1   1754
## ETV6-RUNX1_2     1       1 ETV6-RUNX1_2   4935
## ETV6-RUNX1_3     1       1 ETV6-RUNX1_3   1790
## ETV6-RUNX1_4     1       1 ETV6-RUNX1_4   2341
## PBMMC_1          1       1      PBMMC_1     40
## PBMMC_2          1       1      PBMMC_2     68
## PBMMC_3          1       1      PBMMC_3    189

1.4.2.2 Pre-processing

A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 20 cells (Crowell et al. 2019).

discarded <- current$ncells < 20
y <- y[,!discarded]
summary(discarded)

##    Mode   FALSE 
## logical       7

Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Genes are discarded if they are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design). The filterByExpr function keeps rows that have ‘worthwhile’ counts in a minimum number of samples In this case the minimum number would be 3 as that is the smallest SampleGroup size.

keep <- filterByExpr(y, group=current$SampleGroup)
y <- y[keep,]
summary(keep)

##    Mode   FALSE    TRUE 
## logical   18548   14554

Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). Counts for our pseudo-bulk samples are large enough to apply bulk normalization methods. In the case of single cell data this is helping to account for the differing numbers of cell counts that have been summed to produce the pseudo-samples.

y <- calcNormFactors(y)
y$samples

##              group lib.size norm.factors     Sample Barcode   SampleName
## ETV6-RUNX1_1     1 10938549    0.8773775 SRR9264343    <NA> ETV6-RUNX1_1
## ETV6-RUNX1_2     1 14380564    1.0045934 SRR9264344    <NA> ETV6-RUNX1_2
## ETV6-RUNX1_3     1  5803180    1.0453059 SRR9264345    <NA> ETV6-RUNX1_3
## ETV6-RUNX1_4     1  7835579    0.9224021 SRR9264346    <NA> ETV6-RUNX1_4
## PBMMC_1          1   172373    1.0925529 SRR9264351    <NA>      PBMMC_1
## PBMMC_2          1   399403    1.0698062 SRR9264353    <NA>      PBMMC_2
## PBMMC_3          1   359025    1.0067275 SRR9264354    <NA>      PBMMC_3
##              SampleGroup sum detected subsets_Mito_sum subsets_Mito_detected
## ETV6-RUNX1_1  ETV6-RUNX1  NA       NA               NA                    NA
## ETV6-RUNX1_2  ETV6-RUNX1  NA       NA               NA                    NA
## ETV6-RUNX1_3  ETV6-RUNX1  NA       NA               NA                    NA
## ETV6-RUNX1_4  ETV6-RUNX1  NA       NA               NA                    NA
## PBMMC_1            PBMMC  NA       NA               NA                    NA
## PBMMC_2            PBMMC  NA       NA               NA                    NA
## PBMMC_3            PBMMC  NA       NA               NA                    NA
##              subsets_Mito_percent total sizeFactor k.60_cluster.fun.leiden
## ETV6-RUNX1_1                   NA    NA         NA                       1
## ETV6-RUNX1_2                   NA    NA         NA                       1
## ETV6-RUNX1_3                   NA    NA         NA                       1
## ETV6-RUNX1_4                   NA    NA         NA                       1
## PBMMC_1                        NA    NA         NA                       1
## PBMMC_2                        NA    NA         NA                       1
## PBMMC_3                        NA    NA         NA                       1
##              label label.1 SampleName.1 ncells
## ETV6-RUNX1_1     1       1 ETV6-RUNX1_1   1754
## ETV6-RUNX1_2     1       1 ETV6-RUNX1_2   4935
## ETV6-RUNX1_3     1       1 ETV6-RUNX1_3   1790
## ETV6-RUNX1_4     1       1 ETV6-RUNX1_4   2341
## PBMMC_1          1       1      PBMMC_1     40
## PBMMC_2          1       1      PBMMC_2     68
## PBMMC_3          1       1      PBMMC_3    189

As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile. This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile.

par(mfrow=c(2,4))
for (i in seq_len(ncol(y))) {
    plotMD(y, column=i)
}

We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles. MDS arranges the points on the plot so that the distances among each pair of points correlates as best as possible to the dissimilarity between those two samples. The values on the two axes tell you nothing about the variables for a given sample - the plot is just a two dimensional space to arrange the points. You can think of this in a similar way to how you would read a PCA plot in bulk RNASeq as it allows us to visualize the structure of the data. The aim is to check whether samples separate by our known factors of interest. Strong separation foreshadows a large number of DEGs in the subsequent analysis.

y$samples$SampleGroup <- factor(y$samples$SampleGroup)
limma::plotMDS(cpm(y, log=TRUE), col = as.numeric(y$samples$SampleGroup))

1.4.2.3 Statistical modelling

Our aim is to test whether the log-fold change between sample groups is significantly different from zero.

design <- model.matrix(~factor(SampleGroup), y$samples)
design

##              (Intercept) factor(SampleGroup)PBMMC
## ETV6-RUNX1_1           1                        0
## ETV6-RUNX1_2           1                        0
## ETV6-RUNX1_3           1                        0
## ETV6-RUNX1_4           1                        0
## PBMMC_1                1                        1
## PBMMC_2                1                        1
## PBMMC_3                1                        1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(SampleGroup)`
## [1] "contr.treatment"

We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend, which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend.

y <- estimateDisp(y, design)

When a negative binomial model is fitted, we need to estimate the BCV(s) before we carry out the analysis. Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Hence, it is equivalent to estimating the dispersion(s) of the negative binomial model. Trended and common BCV estimates are shown in blue and red, respectively.

plotBCV(y)

We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis.

fit <- glmQLFit(y, design, robust=TRUE)

QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).

plotQLDisp(fit)

We test for differences in expression due to sample group using glmQLFTest(). The QLF-test is preferred to a likelihood ratio test as it reflects the uncertainty in estimating the dispersion for each gene. DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. If very few genes are significantly DE that sample group has little effect on the transcriptome.

The coef argument specifies which contrast you want tested. It is the number of the column in the model matrix. For us since we only have 2 columns in our model matrix (the intercept and PBMMC vs ETV6-RUNX1) we can specify this as “2”.

results <- glmQLFTest(fit, coef = 2)
summary(decideTests(results))

##        factor(SampleGroup)PBMMC
## Down                        861
## NotSig                    13238
## Up                          455

We can get the results of our comparison using the topTags function. By default this function will output the top 10 genes by FDR but you could change this using the n = argument.

topTags(results)

## Coefficient:  factor(SampleGroup)PBMMC 
##                logFC   logCPM        F       PValue          FDR
## HTR1F       9.892341 6.763185 976.8976 3.324859e-12 4.839000e-08
## FXYD2       7.304759 5.244366 494.8288 1.388601e-10 1.010485e-06
## L3MBTL4     6.729266 4.951867 286.0786 2.706082e-09 1.312810e-05
## MIR4432HG   5.576494 6.324972 218.4894 1.144237e-08 4.163307e-05
## ADGRE2      6.036213 5.212321 187.4924 2.574754e-08 7.494595e-05
## KCNN1      -6.713691 7.045461 138.5996 1.252499e-07 2.216359e-04
## OVCH2     -12.552129 6.413688 209.5101 1.307651e-07 2.216359e-04
## MSR1       -6.662685 7.966067 135.6341 1.401065e-07 2.216359e-04
## FAM171B     9.310463 2.277968 134.5088 1.462816e-07 2.216359e-04
## MSH6       -2.973405 7.889308 133.4300 1.525055e-07 2.216359e-04

1.4.3 Putting it all together

Now that we have laid out the theory underlying the DE analysis, we repeat this process for each of the labels. This is conveniently done using the pseudoBulkDGE function from scran, which will loop over all labels and apply the exact analysis described above to each label. To prepare for this, we filter out all sample-label combinations with insufficient cells.

summed.filt <- summed[,summed$ncells >= 20]

We construct a common design matrix that will be used in the analysis for each label. Recall that this matrix should have one row per unique sample (and named as such), reflecting the fact that we are modelling counts on the sample level instead of the cell level.

targets <- colData(sce)[!duplicated(sce$SampleName),] %>%
  data.frame()

design <- model.matrix(~factor(SampleGroup), data=targets)
rownames(design) <- targets$SampleName

We then apply the pseudoBulkDGE function to obtain a list of DE genes for each label. This function puts some additional effort into automatically dealing with labels that are not represented in all sample groups, for which a DE analysis between conditions is meaningless; or are not represented in a sufficient number of replicate samples to enable modelling of biological variability.

summed.filt$SampleGroup <- factor(summed.filt$SampleGroup)

de.results <- pseudoBulkDGE(summed.filt, 
    label = summed.filt$label,
    design = ~SampleGroup,
    coef = "SampleGroupPBMMC",
    condition = summed.filt$SampleName
)

de.results

## List of length 10
## names(10): 1 10 11 15 2 3 4 5 6 8

de.results[[1]]

## DataFrame with 33102 rows and 5 columns
##                     logFC    logCPM          F    PValue       FDR
##                 <numeric> <numeric>  <numeric> <numeric> <numeric>
## MIR1302-2HG            NA        NA         NA        NA        NA
## ENSG00000238009        NA        NA         NA        NA        NA
## ENSG00000239945        NA        NA         NA        NA        NA
## ENSG00000241860   1.51813  0.238342   0.990913  0.338749  0.591997
## ENSG00000286448        NA        NA         NA        NA        NA
## ...                   ...       ...        ...       ...       ...
## ENSG00000277856        NA        NA         NA        NA        NA
## ENSG00000275063  0.181890 0.2612687 0.00326667  0.955776  0.985897
## ENSG00000275869        NA        NA         NA        NA        NA
## ENSG00000276017        NA        NA         NA        NA        NA
## ENSG00000278817  0.567431 0.0642062 0.08622580  0.773940  0.898339

We examine the numbers of DEGs at a FDR of 5% for each label using the decideTestsPerLabel function. Note that genes listed as NA were either filtered out as low-abundance genes for a given label’s analysis, or the comparison of interest was not possible for a particular label, e.g., due to lack of residual degrees of freedom or an absence of samples from both conditions.

is.de <- decideTestsPerLabel(de.results, threshold=0.05)
summarizeTestsPerLabel(is.de)

##     -1     0    1    NA
## 1  928 16301  413 15460
## 10 135  8724   97 24146
## 11  73 11205   32 21792
## 15   1  5505    6 27590
## 2  373 14377  185 18167
## 3  919  8406 1006 22771
## 4  188 10450  405 22059
## 5  127  4479  412 28084
## 6   77  9134  111 23780
## 8   37  8227   53 24785

For each gene, we compute the percentage of labels in which that gene is upregulated or downregulated. (Here, we consider a gene to be non-DE if it is not retained after filtering.).

up.de <- is.de > 0 & !is.na(is.de)
head(sort(rowMeans(up.de), decreasing=TRUE), 10)

##     LYZ  S100A9  S100A8    SIK3    AFF3  JARID2  DYRK1A    PATJ    LRBA RAPGEF2 
##     0.9     0.8     0.7     0.7     0.6     0.6     0.6     0.5     0.5     0.5

down.de <- is.de < 0 & !is.na(is.de)
head(sort(rowMeans(down.de), decreasing=TRUE), 10)

##             MDK           TERF2       LINC03000           SOCS2          RPS4Y2 
##             1.0             0.8             0.7             0.7             0.7 
##           HLA-C            TAP1 ENSG00000227706            CCN2           MYO1G 
##             0.6             0.6             0.6             0.6             0.6

We further identify label-specific DE genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected even at a generous FDR threshold of 50%.

remotely.de <- decideTestsPerLabel(de.results, threshold=0.5)
not.de <- remotely.de==0 | is.na(remotely.de)

cx <- "1"

other.labels <- setdiff(colnames(not.de), cx)

unique.degs <- is.de[,cx]!=0 & rowMeans(not.de[,other.labels])==1
unique.degs <- names(which(unique.degs))
head(unique.degs)

## [1] "ENSG00000228037" "C1QA"            "ENSG00000286838" "ENSG00000223881"
## [5] "ENSG00000228613" "TEX41"

Another thing we may be interested in is to inspect the top significantly differential expressed gene. To avoid an error caused by negative size factors and since we don’t need them here we remove the size factors from the summed and filtered single cell object summed.filt.

top_gene <- rownames(de.results[[1]][order(de.results[[1]]$FDR),])[1]

sizeFactors(summed.filt) <- NULL

plotExpression(logNormCounts(summed.filt), 
    features=top_gene,
    x="SampleName", colour_by="SampleName", 
    other_fields="label") + 
    facet_wrap(~label) +
  ggtitle(top_gene)

1.5 Exercise 1

See the accompanying worksheet for this session, which includes the code below for the exercise. We want to achieve the following:

Rerun the differential expression analysis using the PRE-T and HHD samples.
Determine which cluster has the most DEGs?
Determine which genes are significantly DE in that cluster and not in any other?

# First load in the other two sample groups
sce_PRET_HHD <- readRDS("R_objects/Caron_clustered.PRETandHHD.rds")

# replace the ensembl IDs with gene symbols where possible
rownames(sce_PRET_HHD) <- uniquifyFeatureNames(rownames(sce_PRET_HHD), rowData(sce_PRET_HHD)$Symbol)

# check your sce object
sce_PRET_HHD

# Part A
# Re run the analysis using these new samples.

FIXME

# Part B
# Looking at your results, which cluster has the most DEGs?

FIXME

# Part C
# Which genes are sig. DE in that cluster and not the others?

FIXME

Hint A

You can copy and paste from the previous code and just change which object is used.

Hint B

You will need to run pseudoBulkDGE to get the results for all the clusters at once.

Hint C

You can copy and paste this code from earlier too, but remember to replace the number of the cluster with the one you found in Part B.

Answer

Here is the complete script:

# First load in the other two sample groups
sce_PRET_HHD <- readRDS("R_objects/Caron_clustered.PRETandHHD.rds")

# replace the ensembl IDs with gene symbols where possible
rownames(sce_PRET_HHD) <- uniquifyFeatureNames(rownames(sce_PRET_HHD), rowData(sce_PRET_HHD)$Symbol)

# check your sce object
sce_PRET_HHD

# Part A
# Re run the analysis using these new samples.
summed_PRET_HHD <- aggregateAcrossCells(sce_PRET_HHD, 
                               id=colData(sce_PRET_HHD)[,c("label", "SampleName")])
summed_PRET_HHD

summed_PRET_HHD.filt <- summed_PRET_HHD[,summed_PRET_HHD$ncells >= 20]

targets_PRET_HHD <- colData(sce_PRET_HHD)[!duplicated(sce_PRET_HHD$SampleName),] %>%
  data.frame()

design_PRET_HHD <- model.matrix(~factor(SampleGroup), data=targets_PRET_HHD)
rownames(design_PRET_HHD) <- targets_PRET_HHD$SampleName

summed_PRET_HHD.filt$SampleGroup <- factor(summed_PRET_HHD.filt$SampleGroup)

de.results_PRET_HHD <- pseudoBulkDGE(summed_PRET_HHD.filt, 
                            label = summed_PRET_HHD.filt$label,
                            design = ~SampleGroup,
                            coef = "SampleGroupPRE-T",
                            condition = summed_PRET_HHD.filt$SampleName)

# Part B
# Looking at your results, which cluster has the most DEGs?

is.de_PRET_HHD <- decideTestsPerLabel(de.results_PRET_HHD, threshold=0.05)
summarizeTestsPerLabel(is.de_PRET_HHD)

# Part C
# Which genes are sig. DE in that cluster and not the others?

remotely.de_PRET_HHD <- decideTestsPerLabel(de.results_PRET_HHD, threshold=0.5)
not.de_PRET_HHD <- remotely.de_PRET_HHD==0 | is.na(remotely.de_PRET_HHD)

cz <- "2"

other.labels_PRET_HHD <- setdiff(colnames(not.de_PRET_HHD), cz)

unique.degs_PRET_HHD <- is.de_PRET_HHD[,cz]!=0 & rowMeans(not.de_PRET_HHD[,other.labels_PRET_HHD])==1
unique.degs_PRET_HHD <- names(which(unique.degs_PRET_HHD))
head(unique.degs_PRET_HHD)

1.5.1 Differences between celltypes

The above example of differential expression focused on testing for differences in expression between conditions for the same cell type or label. However, the same methodology can be applied to test for differences between cell types across samples. This kind of DE analysis can overcome a lack of suitable replication.

Using a dummy value for the label to allow us to include multiple cell types in the fitted model; otherwise, each cell type will be processed separately.

summed.sub <- summed[,summed$label %in% c("3", "4")]

between.res <- pseudoBulkDGE(summed.sub,
    label=rep("dummy", ncol(summed.sub)),
    design=~factor(SampleName) + factor(label),
    coef="factor(label)4")[[1]]

table(Sig=between.res$FDR <= 0.05, Sign=sign(between.res$logFC))

##        Sign
## Sig       -1    1
##   FALSE 2409 2508
##   TRUE  2173 2101

between.res[order(between.res$FDR),]

## DataFrame with 33102 rows and 5 columns
##                     logFC    logCPM         F      PValue         FDR
##                 <numeric> <numeric> <numeric>   <numeric>   <numeric>
## MEF2C            -4.92038   8.26003   637.089 3.84693e-17 2.52228e-13
## ITK               8.35509   6.72293   615.125 5.48859e-17 2.52228e-13
## PRSS57           -7.77942   6.73196   554.816 1.55744e-16 4.77149e-13
## CD8B              7.82809   6.64907   531.031 2.42250e-16 4.77783e-13
## CCR7              6.63218   7.29159   527.332 2.59919e-16 4.77783e-13
## ...                   ...       ...       ...         ...         ...
## ENSG00000277856        NA        NA        NA          NA          NA
## ENSG00000275063        NA        NA        NA          NA          NA
## ENSG00000275869        NA        NA        NA          NA          NA
## ENSG00000276017        NA        NA        NA          NA          NA
## ENSG00000278817        NA        NA        NA          NA          NA

summed.sub <- logNormCounts(summed.sub, size.factors=NULL)
plotExpression(summed.sub, 
    features=head(rownames(between.res)[order(between.res$FDR)]),
    x="label", 
    colour_by=I(factor(summed.sub$SampleName)))

Whether or not this is a scientifically meaningful comparison depends on the nature of the labels. These particular labels were defined by clustering, which means that the presence of DEGs is a foregone conclusion. Nonetheless, it may have some utility for applications where the labels are defined using independent information, e.g., from FACS.

1.6 Differential abundance between conditions

1.6.1 Overview

In a DA analysis, we test for significant changes in per-label cell abundance across conditions. This will reveal which cell types are depleted or enriched upon treatment, which is arguably just as interesting as changes in expression within each cell type. The DA analysis has a long history in flow cytometry (Finak et al. 2014; Lun, Richard, and Marioni 2017) where it is routinely used to examine the effects of different conditions on the composition of complex cell populations. By performing it here, we effectively treat scRNA-seq as a “super-FACS” technology for defining relevant subpopulations using the entire transcriptome.

In the past we have demonstrated how to do this by using edgeR on a matrix of samples and clusters similar to the pseudo-samples but the ‘counts’ are cell numbers. Recently the same group who are responsible for the rest of the scater/scran tool kit have published miloR which has some advantages over directly using edgeR in this way.

MiloR uses methods based on (cydar)[https://www.nature.com/articles/nmeth.4295] which was designed for mass cytometry data although it does use the Negative Binomial GLM implementation from the edgeR package.

While differential abundance (DA) is commonly quantified in discrete cell clusters, Milo uses partally overlapping neighbourhoods of cells on a KNN graph, this means it isn’t dependent on our clustering results. Starting from a graph that faithfully recapitulates the biology of the cell population, Milo analysis consists of 3 steps:

Sampling of representative neighbourhoods
Testing for differential abundance of conditions in all neighbourhoods
Accounting for multiple hypothesis testing using a weighted FDR procedure that accounts for the overlap of neighbourhoods

1.6.2 Create a Milo Object

The first step is to turn our single cell object into a Milo specific object. This is very similar but includes extra slots where we can store the neighbourhood information we generate.

milo <- Milo(sce)

milo

## class: Milo 
## dim: 33102 30461 
## metadata(1): Samples
## assays(3): counts logcounts reconstructed
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(30461): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
##   11_TTTGTCATCAGTTGAC-1 11_TTTGTCATCTCGTTTA-1
## colData names(13): Sample Barcode ... k.60_cluster.fun.leiden label
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1

1.6.3 Construct KNN graph

We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot. Instead of the standard PCA we will use our MNN batch corrected PCA dimensions. The two parameters important to set here are k for the number of nearest neighbours for graph building and d for the number of our reduced dimensions to use.

milo <- buildGraph(milo, k = 60, d = 30, reduced.dim = "corrected", BPPARAM = MulticoreParam(7))

1.6.4 Define neighbourhoods on the KNN graph

We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by (Gut et al. 2015)[https://www.nature.com/articles/nmeth.3545].

For sampling you need to define a few parameters:

prop: the proportion of cells to randomly sample to start with (usually 0.1 - 0.2 is sufficient) k: the k to use for KNN refinement (we recommend using the same k used for KNN graph building) d: the number of reduced dimensions to use for KNN refinement (we recommend using the same d used for KNN graph building) refined: indicates whether you want to use the sampling refinement algorithm, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.

milo <- makeNhoods(milo, prop = 0.1, k = 60, d=30, refined = TRUE, reduced_dims = "corrected")

Once we have defined neighbourhoods, it’s good to take a look at how big the neighbourhoods are (i.e. how many cells form each neighbourhood). This affects the power of DA testing. We can check this out using the plotNhoodSizeHist function.

The authors of Milo have stated several different parameters of deciding if your histogram is correct. Either ‘peaking above 20’, ‘peaking between 50 and 100’ or ‘an average neighbourhood size over 5 x N_samples’. The tool is still in development. Realistically all of these statements give numbers in the same ballpark and so we can make a judgement on our data. If our histogram looks outside this we might consider rerunning makeNhoods increasing k and/or prop.

plotNhoodSizeHist(milo)

1.6.5 Counting cells in neighbourhoods

Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.

This adds to the Milo object a n×m matrix, where n is the number of neighbourhoods and m is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

milo <- countCells(milo, meta.data = data.frame(colData(milo)), sample="SampleName")

head(nhoodCounts(milo))

## 6 x 7 sparse Matrix of class "dgCMatrix"
##   ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 PBMMC_1 PBMMC_2 PBMMC_3
## 1            7           82            4           21       .       .       .
## 2            .            2            4            3       2      67      53
## 3            .            .            .            .      13      38      43
## 4           39           36           40          132       2       1       6
## 5            .            .            4            1       1      53      10
## 6            6          248            2           44       .       .       2

1.6.6 Defining the experimental design

Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.

We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between our PBMMC and ETV6-RUNX1 sample groups. We could also include a batch column in the design matrix at this point but to keep the demonstration simple we will skip this. We would normally do this as we know there is a batch effect that we want to account for in DA testing.

milo_design <- data.frame(colData(milo))[,c("SampleName", "SampleGroup")]
milo_design <- distinct(milo_design)
rownames(milo_design) <- milo_design$SampleName

milo_design

##                SampleName SampleGroup
## ETV6-RUNX1_1 ETV6-RUNX1_1  ETV6-RUNX1
## ETV6-RUNX1_2 ETV6-RUNX1_2  ETV6-RUNX1
## ETV6-RUNX1_3 ETV6-RUNX1_3  ETV6-RUNX1
## ETV6-RUNX1_4 ETV6-RUNX1_4  ETV6-RUNX1
## PBMMC_1           PBMMC_1       PBMMC
## PBMMC_2           PBMMC_2       PBMMC
## PBMMC_3           PBMMC_3       PBMMC

1.6.7 Computing neighbourhood connectivity

Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).

It is important to note the value for d here should match what was used for constructing the KNN graph and also that if a reduced dimension output it will default to PCA.

milo <- calcNhoodDistance(milo, d=30, reduced.dim = "corrected")

1.6.8 Testing

Now we can do the DA test, explicitly defining our experimental design. In this case as discussed we will test the difference between sample groups. The testNhoods function calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between sample groups.

da_results <- testNhoods(milo, design = ~ SampleGroup, design.df = milo_design, reduced.dim = "corrected")

da_results %>%
  arrange(SpatialFDR) %>%
  head()

##        logFC   logCPM        F       PValue          FDR Nhood   SpatialFDR
## 98  10.27673 10.49514 22.82455 1.800848e-06 0.0001007465    98 0.0001204805
## 108 10.55687 10.77641 23.86867 1.047845e-06 0.0001007465   108 0.0001204805
## 129 10.78087 11.00341 24.57963 7.250755e-07 0.0001007465   129 0.0001204805
## 203 10.19692 10.41833 22.52467 2.104251e-06 0.0001007465   203 0.0001204805
## 230 10.45652 10.67320 23.52392 1.252870e-06 0.0001007465   230 0.0001204805
## 234 10.32545 10.54697 23.02045 1.626749e-06 0.0001007465   234 0.0001204805

1.6.9 Inspecting DA results

We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced. We expect a peak to very low extreme and a tail towards 1. This blog (article)[http://varianceexplained.org/statistics/interpreting-pvalue-histogram/] explains the different pvalue histogram profiles and what they can mean.

ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)

Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).

We have marked a significance threshold of 10% FDR.

ggplot(da_results, aes(logFC, -log10(SpatialFDR))) + 
  geom_point() +
  geom_hline(yintercept = 1)

It looks like we have detected several neighbourhoods were there is a significant difference in cell abundances between sample groups.

To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding.

milo <- buildNhoodGraph(milo)

In a normal analysis it is assumed that by stage we would have divided our cells in a biological meaningful way, be that using the cluster marker to define cell types or another method so we can use that as an overlay to help us make sense of the downstream functionality Milo supplies. In this demonstration we have not conclusively done this so we will just use our cluster numbers as a placeholder.

Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying significant DA are colored by their log-Fold Change.

umap_plot <- plotReducedDim(milo, dimred = "UMAP_corrected", colour_by="label", text_by = "label")

nh_graph_plot <- plotNhoodGraphDA(milo, da_results, layout="UMAP_corrected",alpha=0.05)

umap_plot + nh_graph_plot +
  plot_layout(guides="collect")

We might also be interested in visualizing whether DA is particularly evident in certain labels (or celltypes). To do this, we assign a label to each neighbourhood by finding the most abundant label within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harboring the label.

da_results <- annotateNhoods(milo, da_results, coldata_col = "label")
head(da_results)

##       logFC   logCPM         F       PValue         FDR Nhood   SpatialFDR
## 1 -7.083725 8.050131  7.424013 6.447434e-03 0.024110230     1 0.0264054327
## 2  5.430335 9.545450  9.270190 2.335375e-03 0.013660535     2 0.0151427104
## 3  9.376783 9.622672 19.382107 1.081494e-05 0.000173652     3 0.0002169261
## 4 -2.183328 9.398469  1.636309 2.008632e-01 0.270886311     4 0.2831046750
## 5  5.019405 8.600057  9.895556 1.661787e-03 0.011484661     5 0.0129657438
## 6 -4.634793 9.201352  5.788311 1.615165e-02 0.042623319     6 0.0460461805
##   label label_fraction
## 1     1      1.0000000
## 2     8      1.0000000
## 3    11      1.0000000
## 4     2      0.9921875
## 5    10      1.0000000
## 6     1      1.0000000

We can plot to see what the distribution of neighbourhoods is like within our labels. We could take this as a measure of how similar our clustering is with the KNN neighborhood as defined by Milo.

ggplot(da_results, aes(label_fraction)) + geom_histogram(bins=50)

While neighbourhoods tend to be homogeneous, we can define a threshold for label_fraction to exclude neighbourhoods that are a mix of cell types.

da_results$label <- ifelse(da_results$label_fraction < 0.7, "Mixed", da_results$label)

head(da_results)

##       logFC   logCPM         F       PValue         FDR Nhood   SpatialFDR
## 1 -7.083725 8.050131  7.424013 6.447434e-03 0.024110230     1 0.0264054327
## 2  5.430335 9.545450  9.270190 2.335375e-03 0.013660535     2 0.0151427104
## 3  9.376783 9.622672 19.382107 1.081494e-05 0.000173652     3 0.0002169261
## 4 -2.183328 9.398469  1.636309 2.008632e-01 0.270886311     4 0.2831046750
## 5  5.019405 8.600057  9.895556 1.661787e-03 0.011484661     5 0.0129657438
## 6 -4.634793 9.201352  5.788311 1.615165e-02 0.042623319     6 0.0460461805
##   label label_fraction
## 1     1      1.0000000
## 2     8      1.0000000
## 3    11      1.0000000
## 4     2      0.9921875
## 5    10      1.0000000
## 6     1      1.0000000

Now we can visualise the distribution of DA Fold Changes in different labels.

plotDAbeeswarm(da_results, group.by = "label")

This is already quite informative, we can see in several clusters are enrich and some are depleted between our sample groups. There are not many DA neighbourhoods with a mixed label either.

1.6.10 Finding markers of DA populations

Once you have found your neighbourhoods showing significant DA between conditions, you might want to find gene signatures specific to the cells in those neighbourhoods. The function findNhoodGroupMarkers runs a one-VS-all differential gene expression test to identify marker genes for a group of neighbourhoods of interest. Before running this function you will need to define your neighbourhood groups depending on your biological question. This needs to be stored as a NhoodGroup column in the da_results data.frame.

1.6.10.1 Custom grouping of neighbourhoods

In a case where all the DA neighbourhoods seem to belong to the same region of the graph, you might just want to test the significant DA neighbourhoods with the same logFC against all the rest. Here is a demonstration using a random set of 10 genes.

da_results$NhoodGroup <- as.numeric(da_results$SpatialFDR < 0.1 & da_results$logFC < 0)
da_nhood_markers <- findNhoodGroupMarkers(milo, da_results, subset.row = rownames(milo)[1:10])

head(da_nhood_markers)

##            GeneID       logFC_1 adj.P.Val_1       logFC_0 adj.P.Val_0
## 1 ENSG00000229905 -7.225889e-05           1  7.225889e-05           1
## 2 ENSG00000235146  0.000000e+00           1  0.000000e+00           1
## 3 ENSG00000238009 -9.575701e-04           1  9.575701e-04           1
## 4 ENSG00000239945  0.000000e+00           1  0.000000e+00           1
## 5 ENSG00000241860  2.215696e-04           1 -2.215696e-04           1
## 6 ENSG00000286448 -9.739554e-05           1  9.739554e-05           1

For this analysis we recommend aggregating the neighbourhood expression profiles by experimental samples (the same used for DA testing), by setting aggregate.samples=TRUE. This way single-cells will not be considered as “replicates” during DGE testing, and dispersion will be estimated between true biological replicates. Like so:

da_nhood_markers <- findNhoodGroupMarkers(milo, da_results, subset.row = rownames(milo)[1:10], 
                                          aggregate.samples = TRUE, sample_col = "SampleName")
head(da_nhood_markers)

##            GeneID       logFC_1 adj.P.Val_1       logFC_0 adj.P.Val_0
## 1 ENSG00000229905 -2.043428e-04           1  2.043428e-04           1
## 2 ENSG00000235146  0.000000e+00           1  0.000000e+00           1
## 3 ENSG00000238009  5.712642e-04           1 -5.712642e-04           1
## 4 ENSG00000239945  0.000000e+00           1  0.000000e+00           1
## 5 ENSG00000241860  1.047345e-03           1 -1.047345e-03           1
## 6 ENSG00000286448 -1.355253e-20           1  1.355253e-20           1

1.6.10.2 Automatic grouping of neighbourhoods

In many cases, such as this example, DA neighbourhoods are found in different areas of the KNN graph, and grouping together all significant DA populations might not be ideal, as they might include cells of very different celltypes. For this kind of scenario, we have implemented a neighbourhood function that uses community detection to partition neighbourhoods into groups on the basis of (1) the number of shared cells between 2 neighbourhoods; (2) the direction of fold-change for DA neighbourhoods; (3) the difference in fold change. This is done using the Louvain community detection algorithm.

The max.lfc.delta argument here determines the absolute difference in log fold change below which neighbourhoods should not be considered adjacent. The overlap argument determines the number of cells that must overlap between adjacent neighbourhoods for merging and there is a default FDR of 0.1 for determining which neighbourhoods are considered DA.

da_results <- groupNhoods(milo, da_results, max.lfc.delta = 10, overlap = 1)
head(da_results)

##       logFC   logCPM         F       PValue         FDR Nhood   SpatialFDR
## 1 -7.083725 8.050131  7.424013 6.447434e-03 0.024110230     1 0.0264054327
## 2  5.430335 9.545450  9.270190 2.335375e-03 0.013660535     2 0.0151427104
## 3  9.376783 9.622672 19.382107 1.081494e-05 0.000173652     3 0.0002169261
## 4 -2.183328 9.398469  1.636309 2.008632e-01 0.270886311     4 0.2831046750
## 5  5.019405 8.600057  9.895556 1.661787e-03 0.011484661     5 0.0129657438
## 6 -4.634793 9.201352  5.788311 1.615165e-02 0.042623319     6 0.0460461805
##   label label_fraction NhoodGroup
## 1     1      1.0000000          1
## 2     8      1.0000000          2
## 3    11      1.0000000          3
## 4     2      0.9921875          4
## 5    10      1.0000000          5
## 6     1      1.0000000          1

We can view this on a UMAP.

plotNhoodGroups(milo, da_results, layout="UMAP_corrected")

And on a bee swarm plot.

plotDAbeeswarm(da_results, "NhoodGroup")

1.7 Exercise 2

See the accompanying worksheet for this session, which includes the code below for the exercise. We want to achieve the following:

Change the values for max.lfc.delta and overlap to see how the grouping is affected.

# First load in the other two sample groups
set.seed(42) # set your random seed for reproducibility

# Part A
# rerun the grouping with values for max.lfc.delta of 1, 5, 25 and plot beeswarm for each

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = FIXME) , group.by = "NhoodGroup") + ggtitle("max LFC delta=1")


# Part B
# Using the max.lfc.delta you preferred from Part A now alter the overlap value trying 1, 3, 10.

FIXME

Hint A

Replace the max.lfc.delta value to get the new beeswarm plot. Overlap is defaulted to 1 so we can just leave that for now.

Hint B

Add the overlap argument into the groupNhoods function and try the different values suggested.

Answer

Here is the complete script:

# First load in the other two sample groups
set.seed(42) # set your random seed for reproducibility

# Part A
# rerun the grouping with values for max.lfc.delta of 1, 5, 25 and plot beeswarm for each

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 1) , group.by = "NhoodGroup") + ggtitle("max LFC delta=1")

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5) , group.by = "NhoodGroup") + ggtitle("max LFC delta=5")

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 25) , group.by = "NhoodGroup") + ggtitle("max LFC delta=25")

# Part B
# Using the max.lfc.delta you preferred from Part A now alter the overlap value trying 1, 3, 10.

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5, overlap = 1) , group.by = "NhoodGroup") + ggtitle("overlap=1")

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5, overlap = 3) , group.by = "NhoodGroup") + ggtitle("overlap=3")

plotDAbeeswarm(groupNhoods(milo, da_results, max.lfc.delta = 5, overlap = 10) , group.by = "NhoodGroup") + ggtitle("overlap=10")

# The final code

set.seed(42)
da_results <- groupNhoods(milo, da_results, max.lfc.delta = 5, overlap=1)
plotNhoodGroups(milo, da_results, layout="UMAP_corrected")
plotDAbeeswarm(da_results, group.by = "NhoodGroup")

1.7.1 Finding gene signatures for neighbourhoods

Once we have grouped neighbourhoods using groupNhoods we are now all set to identifying gene signatures between neighbourhood groups.

Let’s restrict the testing to highly variable genes in this case

set.seed(101)
dec <- modelGeneVar(milo)
hvgs <- getTopHVGs(dec, n=2000)
head(hvgs)

## [1] "HBB"  "HBA2" "HBA1" "HBD"  "HBM"  "AHSP"

We run findNhoodGroupMarkers to test for one-vs-all differential gene expression for each neighbourhood group

set.seed(42)
nhood_markers <- findNhoodGroupMarkers(milo, da_results, subset.row = hvgs, 
                                       aggregate.samples = TRUE, sample_col = "SampleName")
head(nhood_markers)

##   GeneID     logFC_1 adj.P.Val_1     logFC_2  adj.P.Val_2     logFC_3
## 1   AAK1 -0.04947289   0.8252759  0.71919526 6.423608e-12  0.08829380
## 2 ABCB10 -0.19780960   0.6779130 -0.19317568 5.931055e-01 -0.22045817
## 3  ABCB6 -0.09685217   0.6779130 -0.08422366 6.306008e-01 -0.09934854
## 4  ABCB7 -0.00340844   0.9820320 -0.08121156 6.383011e-01 -0.19252155
## 5  ABCC4 -0.16426694   0.6779130 -0.10213583 7.222900e-01 -0.16297673
## 6  ABCG2 -0.05194183   0.8202183 -0.07239040 7.342526e-01  0.01626998
##   adj.P.Val_3     logFC_4 adj.P.Val_4    logFC_5  adj.P.Val_5       logFC_6
## 1   0.5888909 -0.06037954   0.8874002 -0.2172419 1.294468e-01 -0.0829621623
## 2   0.4263990 -0.17004678   0.8769033  1.0632392 1.088709e-11 -0.1704196153
## 3   0.4842964 -0.08332507   0.8769033  0.3390345 7.411499e-04 -0.0798352219
## 4   0.1671680 -0.02081495   0.9467981  0.1411848 2.237871e-01  0.0050901796
## 5   0.5035200 -0.12338890   0.8769607  0.2986599 1.372684e-01 -0.0003324888
## 6   0.9273264 -0.02211144   0.9589972  0.2266879 1.264012e-01 -0.0408581339
##   adj.P.Val_6    logFC_7  adj.P.Val_7      logFC_8 adj.P.Val_8     logFC_9
## 1   0.9703971 -0.2258893 9.254174e-02 -0.007415208   0.9649177 -0.20718712
## 2   0.9703971  0.6374723 4.940311e-04 -0.202749072   0.6703971 -0.21383212
## 3   0.9703971  0.4585965 5.223759e-07 -0.101685252   0.6703971 -0.09736909
## 4   0.9989145  0.2411565 1.955074e-02  0.072159800   0.6804150  0.01634036
## 5   0.9991790  0.7858936 9.551390e-07 -0.174374418   0.6703971 -0.11740202
## 6   0.9789949  0.4935845 6.540697e-05 -0.057816692   0.7717426 -0.11456751
##   adj.P.Val_9    logFC_10 adj.P.Val_10    logFC_11 adj.P.Val_11
## 1   0.4184316 -0.08672881    0.8820868  0.09713615   0.61417778
## 2   0.5766622 -0.13569569    0.8820868 -0.29526848   0.29424016
## 3   0.6043124 -0.09320812    0.8820868 -0.11432649   0.45215231
## 4   0.9111764 -0.03492725    0.9251699 -0.17400203   0.24569760
## 5   0.6635902 -0.10677594    0.8867533 -0.18597552   0.48119712
## 6   0.6084486 -0.16794330    0.8820868 -0.31624138   0.09204369

We can get the markers that define a specific group.

gr9_markers <- nhood_markers[c("logFC_9", "adj.P.Val_9","GeneID")] 
colnames(gr9_markers) <- c("logFC", "adj.P.Val", "GeneID")
head(gr9_markers[order(gr9_markers$adj.P.Val), ])

##         logFC    adj.P.Val    GeneID
## 652 1.0104122 4.330811e-32     FCRL1
## 60  1.2608281 5.256457e-30   ANGPTL1
## 645 0.3957021 8.416153e-29     FCER2
## 998 0.3409898 1.904284e-28 LINC02397
## 857 1.2538224 2.259114e-27      IGHD
## 191 0.2942973 3.985596e-26      BTLA

We can start to do many visualisation and further investigations of the data using these grouping. More details are avalible in the Milo documentation.

markers <- nhood_markers$GeneID[nhood_markers$adj.P.Val_9 < 0.01 
                                & nhood_markers$logFC_9 > 0]

plotNhoodExpressionGroups(milo, da_results, features=intersect(rownames(milo), markers[1:10]),
                          subset.nhoods = da_results$NhoodGroup %in% c('6','9'), 
                          scale=TRUE,
                          grid.space = "fixed")

1.8 Session information

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] miloR_1.4.0                 BiocParallel_1.30.3        
##  [3] bluster_1.6.0               DT_0.25                    
##  [5] patchwork_1.1.2             forcats_0.5.2              
##  [7] stringr_1.4.1               dplyr_1.0.10               
##  [9] purrr_0.3.4                 readr_2.1.2                
## [11] tidyr_1.2.1                 tibble_3.1.8               
## [13] tidyverse_1.3.2             edgeR_3.38.4               
## [15] limma_3.52.3                batchelor_1.12.3           
## [17] scran_1.24.1                scater_1.24.0              
## [19] ggplot2_3.3.6               scuttle_1.6.3              
## [21] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [23] Biobase_2.56.0              GenomicRanges_1.48.0       
## [25] GenomeInfoDb_1.32.4         IRanges_2.30.1             
## [27] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [29] MatrixGenerics_1.8.1        matrixStats_0.62.0         
## 
## loaded via a namespace (and not attached):
##   [1] googledrive_2.0.0         ggbeeswarm_0.6.0         
##   [3] colorspace_2.0-3          ellipsis_0.3.2           
##   [5] XVector_0.36.0            BiocNeighbors_1.14.0     
##   [7] fs_1.5.2                  rstudioapi_0.14          
##   [9] farver_2.1.1              graphlayouts_0.8.1       
##  [11] ggrepel_0.9.1             fansi_1.0.3              
##  [13] lubridate_1.8.0           xml2_1.3.3               
##  [15] splines_4.2.1             codetools_0.2-18         
##  [17] sparseMatrixStats_1.8.0   cachem_1.0.6             
##  [19] knitr_1.40                polyclip_1.10-0          
##  [21] jsonlite_1.8.0            ResidualMatrix_1.6.1     
##  [23] broom_1.0.1               cluster_2.1.4            
##  [25] dbplyr_2.2.1              ggforce_0.3.4            
##  [27] compiler_4.2.1            httr_1.4.4               
##  [29] dqrng_0.3.0               backports_1.4.1          
##  [31] assertthat_0.2.1          Matrix_1.5-1             
##  [33] fastmap_1.1.0             gargle_1.2.1             
##  [35] cli_3.4.0                 tweenr_2.0.2             
##  [37] BiocSingular_1.12.0       htmltools_0.5.3          
##  [39] tools_4.2.1               rsvd_1.0.5               
##  [41] igraph_1.3.4              gtable_0.3.1             
##  [43] glue_1.6.2                GenomeInfoDbData_1.2.8   
##  [45] Rcpp_1.0.9                cellranger_1.1.0         
##  [47] jquerylib_0.1.4           vctrs_0.4.1              
##  [49] DelayedMatrixStats_1.18.0 ggraph_2.0.6             
##  [51] xfun_0.33                 beachmat_2.12.0          
##  [53] rvest_1.0.3               lifecycle_1.0.2          
##  [55] irlba_2.3.5               gtools_3.9.3             
##  [57] statmod_1.4.37            googlesheets4_1.0.1      
##  [59] MASS_7.3-58.1             zlibbioc_1.42.0          
##  [61] scales_1.2.1              tidygraph_1.2.2          
##  [63] hms_1.1.2                 parallel_4.2.1           
##  [65] RColorBrewer_1.1-3        yaml_2.3.5               
##  [67] gridExtra_2.3             sass_0.4.2               
##  [69] stringi_1.7.8             highr_0.9                
##  [71] ScaledMatrix_1.4.1        rlang_1.0.5              
##  [73] pkgconfig_2.0.3           bitops_1.0-7             
##  [75] evaluate_0.16             lattice_0.20-45          
##  [77] labeling_0.4.2            htmlwidgets_1.5.4        
##  [79] cowplot_1.1.1             tidyselect_1.1.2         
##  [81] magrittr_2.0.3            R6_2.5.1                 
##  [83] generics_0.1.3            metapod_1.4.0            
##  [85] DelayedArray_0.22.0       DBI_1.1.3                
##  [87] pillar_1.8.1              haven_2.5.1              
##  [89] withr_2.5.0               RCurl_1.98-1.8           
##  [91] crayon_1.5.1              modelr_0.1.9             
##  [93] utf8_1.2.2                tzdb_0.3.0               
##  [95] rmarkdown_2.16            viridis_0.6.2            
##  [97] locfit_1.5-9.6            grid_4.2.1               
##  [99] readxl_1.4.1              reprex_2.0.2             
## [101] digest_0.6.29             munsell_0.5.0            
## [103] beeswarm_0.4.0            viridisLite_0.4.1        
## [105] vipor_0.4.5               bslib_0.4.0

Introduction to single-cell RNA-seq analysis

Differencial Expression and Abundance

Sept 2022

1 Differential expression and abundance between conditions

1.1 Motivation

1.2 Setup

1.3 Setting up the data

1.4 Differential expression between conditions

1.4.1 Creating pseudo-bulk samples

1.4.2 Performing the DE analysis

1.4.2.1 Introduction

1.4.2.2 Pre-processing

1.4.2.3 Statistical modelling

1.4.3 Putting it all together

1.5 Exercise 1

1.5.1 Differences between celltypes

1.6 Differential abundance between conditions

1.6.1 Overview

1.6.2 Create a Milo Object

1.6.3 Construct KNN graph

1.6.4 Define neighbourhoods on the KNN graph

1.6.5 Counting cells in neighbourhoods

1.6.6 Defining the experimental design

1.6.7 Computing neighbourhood connectivity

1.6.8 Testing

1.6.9 Inspecting DA results

1.6.10 Finding markers of DA populations

1.6.10.1 Custom grouping of neighbourhoods

1.6.10.2 Automatic grouping of neighbourhoods

1.7 Exercise 2

1.7.1 Finding gene signatures for neighbourhoods

1.8 Session information