Acknowledgments: much of the material in this section has been derived from the chapters on differential expression and abundance in the OSCA book and the Hemberg Group course materials. Additional material concerning miloR has been based on the demonstration from the Marioni Lab.

1 Differential expression analysis

When we have data for multiple conditions (e.g. control vs treatment, healthy vs diseased, wild-type vs mutant, etc.) and multiple biological replicates for each of them, we can investigate which genes are differentially expressed between conditions. This is similar to what is done in more conventional bulk RNA-seq analysis, but in this case we have the added benefit of being able to do our analysis focusing on particular cell clusters/types (which would not be possible to do in a bulk experiment, unless we had done cell-sorting before the sequencing).

Differential analyses of multi-condition scRNA-seq experiments can be broadly split into two categories:

  • differential expression (DE) tests for changes in expression between conditions for cells of the same ‘type’ that are present in both conditions.
  • differential abundance (DA) tests for changes in the composition of cell types (or states, etc.) between conditions.

In this section we will cover the first of these, i.e. how to do a differential expression analysis for a simple design with a single factor (variable) with two levels (conditions). In our case this corresponds to the comparison between cells from healthy (PBMMC) versus leukemia (ETV6-RUNX1) donors.

There are three main steps for a bulk analysis:

  • Creating the pseudo-bulk samples
  • Filter low-count cells/samples
  • Run the differential analysis

1.1 Setup

We start by loading our packages and the single-cell object that will be used. We will focus our analysis on the set of PBMMC (healthy) and ETV6-RUNX1 (leukemia) samples. These data have gone through QC, normalised, batch-corrected and clustered as shown in the previous sections. We have clustered the cells using the Leiden algorithm with k=60, and did some basic cell type annotation based on known markers of blood/immune cells.

# load packages
library(scater)
library(scran)
library(batchelor)
library(bluster)
library(edgeR)
library(tidyverse)

# load the SCE object
sce <- readRDS("R_objects/Caron_clustered.PBMMCandETV6RUNX1.rds")

# plot UMAP done on the batch-corrected data
plotReducedDim(sce, dimred = "UMAP_corrected", 
               colour_by = "label", 
               text_by = "label")

1.2 Creating pseudo-bulk samples

Pseudo-bulk samples consist of summing the counts across all the cells for a given sample and cell label combination. It is up to you to decide what the cell labels should consist of. In our case, we’re using the Leiden clusters, which were further manually annotated according to cell types. But you may have defined these labels, for example, based on a more systematic cell type annotation using public atlases or reference gene panels (see Chapter 7 - Cell type annotation in the OSCA book).

To revise, here is how our sample + label combinations look like:

# tabulate number of cells per label + sample combination
table(sce$label, sce$SampleName)
##                     
##                      ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4
##   B (c1)                     1761         4939         1791         2331
##   B (c2)                      650         1023          251         1274
##   B (c3)                      244           97           17           27
##   T (c4)                       15          193         1897          283
##   Erythrocytes (c5)             1            0           71          178
##   CD20+ B (c6)                  6          115          545          139
##   B (c7)                       83          389           91           95
##   NK T (c8)                     9           44           96           59
##   Erythrocytes (c9)             3           13           95          159
##   Mono (c10)                    1            6           19           43
##   B (c11)                       0           17            9            5
##   CD20+ B (c12)                 0            2            0            0
##   CD20+ B (c13)                 0            1            0            0
##   T (c14)                       0            5           23            4
##   Erythrocytes (c15)            0            1           89         1019
##   Erythrocytes (c16)            0            0           26           79
##   Mono (c17)                    0            0            0            0
##                     
##                      PBMMC_1 PBMMC_2 PBMMC_3
##   B (c1)                  40      69     187
##   B (c2)                  16      39     113
##   B (c3)                  31     110     144
##   T (c4)                 129    1213    1143
##   Erythrocytes (c5)        7     483      18
##   CD20+ B (c6)            33     418     584
##   B (c7)                   6      21       5
##   NK T (c8)               26     388     280
##   Erythrocytes (c9)       26     541     119
##   Mono (c10)             140     405     690
##   B (c11)                  4      32      28
##   CD20+ B (c12)          124      80     253
##   CD20+ B (c13)          251     211     780
##   T (c14)                  0       1       5
##   Erythrocytes (c15)       0     788      39
##   Erythrocytes (c16)       0      11       5
##   Mono (c17)              13      42      37

We can see that the contribution of each sample to the different labels (clusters) is quite unbalanced for some of them. This may be biologically correct, or it may prompt us to revise our batch correction step. For now, we will proceed with these data as they are.

We use label and SampleName as the variables to aggregate by. We therefore will have a maximum of 7x17 (119) pseudo-samples. We will have less than this because, as we saw, there are some clusters with no cells from a particular sample.

# sum counts across cells - by label and sample
summed <- aggregateAcrossCells(sce, 
                               ids = colData(sce)[, c("label", "SampleName")])

# the output is a new SCE object with aggregated counts matrix
summed
## class: SingleCellExperiment 
## dim: 33102 100 
## metadata(1): Samples
## assays(1): counts
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames: NULL
## colData names(16): Sample Barcode ... SampleName ncells
## reducedDimNames(3): corrected TSNE_corrected UMAP_corrected
## mainExpName: NULL
## altExpNames(0):

1.3 Run differential expression

Now that we have pseudo-bulk samples, we can proceed to our differential expression analysis. This can be done using standard bulk RNA-seq R/Bioconductor packages, such as edgeR, DESeq2 or limma. In our case, we will use helper functions from scran, which, behind the scenes, use the edgeR package. If you are experienced in this sort of analysis, you could certainly use one of the other packages if you are more familiar with them. Going through the details of bulk RNA-seq differential expression analysis is out of the scope of these materials, but if you want to learn more we recommend that you attend our bulk RNA-seq course and read the documentation for edgeR or DESeq2 packages.

Before we run the actual differential analysis, we will filter pseudo-samples with low number of cells. Conveniently, the aggregateAcrossCells() function creates a new variable in our colData() slot with this information, so we can use it to do our filtering.

# "ncells" is added to our colData
colData(summed)[, c("SampleName", "ncells")]
## DataFrame with 100 rows and 2 columns
##       SampleName    ncells
##      <character> <integer>
## 1   ETV6-RUNX1_1      1761
## 2   ETV6-RUNX1_2      4939
## 3   ETV6-RUNX1_3      1791
## 4   ETV6-RUNX1_4      2331
## 5        PBMMC_1        40
## ...          ...       ...
## 96       PBMMC_2        11
## 97       PBMMC_3         5
## 98       PBMMC_1        13
## 99       PBMMC_2        42
## 100      PBMMC_3        37
# use this to filter our pseudo-bulk object
summed <- summed[, summed$ncells >= 20]

We are now ready for our differential analysis. The scran package includes the function pseudoBulkDGE(), which conveniently loops through each label and performs the differential analysis between the two conditions.

Here is the syntax for this step, with an explanation of the options to follow:

# perform differential analysis for each label
# this uses edgeR package behind the scenes.
de_results <- pseudoBulkDGE(summed, 
                            label = summed$label,
                            design = ~ SampleGroup, 
                            coef = "SampleGroupPBMMC",
                            condition = summed$SampleName)

There’s a few options here we need to explain.

First, the design = ~ SampleGroup option. This is R’s formula syntax for defining linear models. In this case, we are indicating that we want to model the gene expression based on the “SampleGroup” variable of our colData(). This formula is converted to a so-called design matrix, which is internally used by the statistical machinery to fit the model to the data and perform a statistical test against a null hypothesis.

It is useful to see what this model matrix looks like, to understand the next argument in this function (coef = "SampleGroupPBMMC").

# the model matrix being used by edgeR
head(model.matrix(~ SampleGroup, data = colData(summed)))
##   (Intercept) SampleGroupPBMMC
## 1           1                0
## 2           1                0
## 3           1                0
## 4           1                0
## 5           1                1
## 6           1                1

This matrix contains an indicator variable (with 0’s and 1’s) indicating which sample belongs to the PBMMC group. If we want to test for differential expression between our “ETV6-RUNX1” group and “PBMMC”, we need to specify the name of this indicator variable to the pseudoBulkDGE() function, as we did above.

It may seem strange to have to do this, in this case, as we only have two groups. But if you had more than two groups (e.g. if we also had the “HDD” samples), then there would be more possible comparisons to do.

Finally, we specified the option condition = summed$SampleName, which is used to automatically filter low-count genes.

Looking at our output object, we can see that this is a list of DataFrame tables, containing the results of the analysis:

# the results come as a list for each cell label
de_results
## List of length 11
## names(11): B (c1) B (c2) B (c3) B (c7) ... Mono (c10) NK T (c8) T (c4)
# extract one of the tables from the list
b_c1_res <- de_results[[1]]

These output tables contain metadata with the edgeR object that was used to fit the model, and we can use them for further quality-control analysis. For example, looking at the first of these comparisons for B cells present in our original cluster 1:

# extract the edgeR object used for differential expression
b_c1_dgelist <- metadata(b_c1_res)$y

# plot mean-CV relationship across genes
plotBCV(b_c1_dgelist)

# plot mean-difference plot to assess library size normalisation
# we expect simmetric distribution centered around zero
par(mfrow=c(2,4))
for (i in seq_len(ncol(b_c1_dgelist))) {
  plotMD(b_c1_dgelist, column=i)
  abline(h = 0, col = "salmon", lwd = 2)
}

# plot MDS 
plotMDS(cpm(b_c1_dgelist, log = TRUE), 
        col = ifelse(b_c1_dgelist$samples$SampleGroup == "PBMMC", "tomato", "steelblue"))

The latter plot represents a projection of the data on a 2-dimensional plot using a dimensionality reduction method similar to PCA. This is useful to check if our samples separate in the way we may expect, given their different conditions. In this case we can see that the first dimension explains ~40% of the variance in the data and separates our samples according to their sample group, which would make sense biologically if their states are distinct between leukemia and healthy samples.

We could further loop through all the results to produce a matrix of plots (this code is a little more advanced, but gives you an idea of how you could do this):

# loop through all results
lapply(de_results, function(current_result){
  # extract the edgeR object from the metadata
  current_result <- metadata(current_result)$y
  
  # make a colour scale
  mds_cols <- ifelse(current_result$samples$SampleGroup == "PBMMC", "tomato", "steelblue")
  
  # identify the cell label this corresponds to
  mds_title <- unique(current_result$samples$label)
  
  # make the plot
  # note the `col` and `main` options are standard options in the base R plot()
  plotMDS(cpm(current_result, log = TRUE), 
          col = mds_cols, main = mds_title)
})

1.4 Obtaining DE genes

# Explore DEGs ----

After running the differential expression analysis, we can look at the differentially expressed genes for each cell type label. We can use the decideTestsPerLabel() function to achieve this, which allows us to define a false-discovery rate threshold (adjusted p-value), depending on our desired level of stringency.

# identify DEGs based on FDR threshold
is_de <- decideTestsPerLabel(de_results, threshold = 0.05)

# this outputs a large matrix
is_de[1:10, 1:5]
##                 B (c1) B (c2) B (c3) B (c7) CD20+ B (c6)
## MIR1302-2HG         NA     NA     NA     NA           NA
## ENSG00000238009     NA     NA     NA     NA           NA
## ENSG00000239945     NA     NA     NA     NA           NA
## ENSG00000241860      0     NA     NA     NA           NA
## ENSG00000286448     NA     NA     NA     NA           NA
## ENSG00000235146     NA     NA     NA     NA           NA
## ENSG00000229905     NA     NA     NA     NA           NA
## LINC01409            0      0     NA     NA            0
## FAM87B              NA     NA     NA     NA           NA
## LINC00115            0      0     NA     NA           NA

The output is a large matrix (genes x label) containing the following values:

  • -1 and 1 indicating the gene was down- or up-regulated in PBMMC samples, respectively
  • 0 indicating the gene was not differentially expressed
  • NA indicating there wasn’t enough data to carry the test (e.g. all the replicates had zero counts in one of the groups)

This matrix can be hard to interpret by itself, but we can summarise it as follows:

# summarise the results
summarizeTestsPerLabel(is_de)
##                     -1     0    1    NA
## B (c1)             933 16292  409 15468
## B (c2)             378 14414  195 18115
## B (c3)             971  9081 1082 21968
## B (c7)               7  4175   14 28906
## CD20+ B (c6)        77  9134  111 23780
## Erythrocytes (c15)   1  5505    6 27590
## Erythrocytes (c5)  127  4479  412 28084
## Erythrocytes (c9)  138  8653  101 24210
## Mono (c10)          74 11213   32 21783
## NK T (c8)           71  4909  110 28012
## T (c4)             211 10876  376 21639

We can see that there is a very large number of NA values in the results. This reflects the fact that many genes may not be sufficiently expressed in a given cell type to perform the statistical analysis, which should not surprise us given the sparsity often seen in single-cell data.

However, we still get hundreds of genes differentially expressed in some of these clusters, suggesting differences between leukemia and healthy samples.

Let’s look at one of the top genes for one the B cell (cluster 1) label that we extracted out of our results list earlier.

# filter the results table for DEGs
b_c1_res %>% 
  as.data.frame() %>% 
  arrange(FDR) %>%
  head()
##               logFC   logCPM        F       PValue          FDR
## HTR1F      9.832270 6.751203 915.8780 8.014308e-13 1.413243e-08
## FXYD2      7.240332 5.259194 464.1688 4.654515e-11 4.103886e-07
## L3MBTL4    6.733745 4.923588 314.9357 4.600992e-10 2.704463e-06
## MIR4432HG  5.510355 6.313928 219.4639 3.779479e-09 1.666183e-05
## ADGRE2     5.961772 5.241688 174.0209 1.434813e-08 5.060299e-05
## MSR1      -6.734404 7.991053 140.0387 4.920405e-08 1.098080e-04
# remove size factors from the object (otherwise plotting function complains)
# and add normalised log counts to the object
sizeFactors(summed) <- NULL
summed <- logNormCounts(summed)

# visualise from our summed single cell object
plotExpression(summed, 
               features = "HTR1F",
               x = "SampleName", colour_by="SampleGroup", 
               other_fields="label") + 
  facet_wrap(~ label) +
  scale_x_discrete(guide = guide_axis(angle = 45))

In this case, we can see that the gene seems differentially expressed not only on B cell cluster 1, but in other B cell clusters as well.

1.5 Exercise

We want to achieve the following:

  • Rerun the differential expression analysis using the PRE-T and HHD samples.
  • Determine which cluster has the most DEGs.
  • Visualise the expression of one of the top genes for that cluster.
# First load in the other two sample groups
sce_PRET_HHD <- readRDS("R_objects/Caron_clustered.PRETandHHD.rds")

# check your sce object
sce_PRET_HHD

# Part A
# Run the analysis using these new samples

FIXME

# Part B
# Determine which cluster has the most DEGs

FIXME

# Part C
# Visualise the expression of one of the top genes for that cluster

FIXME
Hint A

You can copy and paste from the previous code and just change which object is used.

Hint B

You will need to run pseudoBulkDGE() to get the results for all the clusters at once.

Hint C

You can copy and paste this code from earlier too, but remember to replace the number of the cluster with the one you found in Part B.

Answer

Here is the complete script:

# First load in the other two sample groups
sce_PRET_HHD <- readRDS("R_objects/Caron_clustered.PRETandHHD.rds")

# check the sce object
sce_PRET_HHD

# Part A
# Run the analysis using these new samples

# create the pseudo-bulk samples
summed_PRET_HHD <- aggregateAcrossCells(sce_PRET_HHD, 
                                        id = colData(sce_PRET_HHD)[,c("label", "SampleName")])
summed_PRET_HHD

# filter to retain pseudo-bulk samples with at least 20 cells
summed_PRET_HHD <- summed_PRET_HHD[, summed_PRET_HHD$ncells >= 20]

# check the column names in the design matrix
# to determine the coefficient name for testing
model.matrix(~ SampleGroup, data = colData(summed_PRET_HHD))

# run the differential analysis
de_results_PRET_HHD <- pseudoBulkDGE(summed_PRET_HHD, 
                                     label = summed_PRET_HHD$label,
                                     design = ~ SampleGroup,
                                     coef = "SampleGroupPRE-T",
                                     condition = summed_PRET_HHD$SampleName)

# Part B
# Determine which cluster has the most DEGs
is_de_PRET_HHD <- decideTestsPerLabel(de_results_PRET_HHD, threshold=0.05)
summarizeTestsPerLabel(is_de_PRET_HHD)

# Part C
# Visualise the expression of one of the top genes for that cluster
# arrange results table by FDR
de_results_PRET_HHD[["B (c4)"]] %>%
  as.data.frame() %>% 
  arrange(FDR) %>%
  head()

# reset the size factors in the summed object
# and add normalised logcounts
sizeFactors(summed_PRET_HHD) <- NULL
summed_PRET_HHD <- logNormCounts(summed_PRET_HHD)

# visualise expression
plotExpression(summed_PRET_HHD, 
               features = "S100A16",
               x = "SampleName", colour_by="SampleGroup", 
               other_fields="label") + 
  facet_wrap(~label) +
  scale_x_discrete(guide = guide_axis(angle = 45))

1.6 Differences between celltypes

The same methodology we applied above can also be used to compare expression between cell types/labels. However, this is a little redundant with the marker gene analysis, which we have done previously. We leave the demonstration here for completeness, and you can read more about it in the respective section of the OSCA book.

In our case, let’s say we were interested in finding differentially expressed genes between T cells in cluster 4 and T cells in cluster 8 (presumed natural killer cells).

# subset the object to keep only the labels of interest
summed_subset <- summed[, summed$label %in% c("T (c4)", "NK T (c8)")]

# run the DE analysis
# notice the dummy variable addition - TODO: explain what this does
between_label_results <- pseudoBulkDGE(summed_subset,
                                      label = rep("dummy", ncol(summed_subset)),
                                      design = ~ factor(SampleName) + factor(label),
                                      coef = "factor(label)NK T (c8)")[[1]]

# tabulate the results
table(Sig = between_label_results$FDR <= 0.05, Sign = sign(between_label_results$logFC))
##        Sign
## Sig       -1    1
##   FALSE 3237 2767
##   TRUE  1205 1176
# check the top genes
between_label_results %>% 
  as.data.frame() %>% 
  arrange(FDR) %>% 
  head()
##            logFC    logCPM         F       PValue          FDR
## CST7    5.249409  8.189040 1078.8257 9.250290e-36 7.756368e-32
## RPL34  -1.440152 13.218584  667.3893 8.774124e-31 3.678552e-27
## FCGR3A  7.369647  5.103200  547.8855 8.796101e-29 1.475106e-25
## RPL32  -1.418528 12.360069  554.6349 6.622233e-29 1.475106e-25
## CCL4    6.500536  8.825846 1193.6222 8.635574e-29 1.475106e-25
## ZEB2    3.786263  7.498872  511.8107 4.244364e-28 5.415413e-25

We can see some genes here that are associated with NK cells, such as CCL4. However, this finding shouldn’t necessarily surprise us, as we used known markers such as this to classify our clusters. This emphasises the dubious validity of this approach to infer statistically differentially expressed genes between cell types - if we used these genes to define the cell types to start with, then it’s no surprise we find them as differentially expressed (along with other genes that are correlated with them).

Still, this approach can be useful as a complement to marker gene identification, to find candidate genes for further exploration.

1.7 Session information

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.7.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] lubridate_1.9.3             forcats_1.0.0              
##  [3] stringr_1.5.1               dplyr_1.1.4                
##  [5] purrr_1.0.2                 readr_2.1.5                
##  [7] tidyr_1.3.1                 tibble_3.2.1               
##  [9] tidyverse_2.0.0             edgeR_4.0.16               
## [11] limma_3.58.1                bluster_1.12.0             
## [13] batchelor_1.18.1            scran_1.30.2               
## [15] scater_1.30.1               ggplot2_3.5.0              
## [17] scuttle_1.12.0              SingleCellExperiment_1.24.0
## [19] SummarizedExperiment_1.32.0 Biobase_2.62.0             
## [21] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
## [23] IRanges_2.36.0              S4Vectors_0.40.2           
## [25] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
## [27] matrixStats_1.3.0          
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              gridExtra_2.3            
##  [3] rlang_1.1.3               magrittr_2.0.3           
##  [5] compiler_4.3.2            DelayedMatrixStats_1.24.0
##  [7] vctrs_0.6.5               pkgconfig_2.0.3          
##  [9] crayon_1.5.2              fastmap_1.1.1            
## [11] XVector_0.42.0            labeling_0.4.3           
## [13] utf8_1.2.4                rmarkdown_2.26           
## [15] tzdb_0.4.0                ggbeeswarm_0.7.2         
## [17] xfun_0.43                 zlibbioc_1.48.2          
## [19] cachem_1.0.8              beachmat_2.18.1          
## [21] jsonlite_1.8.8            highr_0.10               
## [23] DelayedArray_0.28.0       BiocParallel_1.36.0      
## [25] irlba_2.3.5.1             parallel_4.3.2           
## [27] cluster_2.1.6             R6_2.5.1                 
## [29] bslib_0.7.0               stringi_1.8.3            
## [31] jquerylib_0.1.4           Rcpp_1.0.12              
## [33] knitr_1.45                splines_4.3.2            
## [35] timechange_0.3.0          Matrix_1.6-5             
## [37] igraph_2.0.3              tidyselect_1.2.1         
## [39] rstudioapi_0.16.0         abind_1.4-5              
## [41] yaml_2.3.8                viridis_0.6.5            
## [43] codetools_0.2-20          lattice_0.22-6           
## [45] withr_3.0.0               evaluate_0.23            
## [47] pillar_1.9.0              generics_0.1.3           
## [49] RCurl_1.98-1.14           hms_1.1.3                
## [51] sparseMatrixStats_1.14.0  munsell_0.5.1            
## [53] scales_1.3.0              glue_1.7.0               
## [55] metapod_1.10.1            tools_4.3.2              
## [57] BiocNeighbors_1.20.2      ScaledMatrix_1.10.0      
## [59] locfit_1.5-9.9            grid_4.3.2               
## [61] colorspace_2.1-0          GenomeInfoDbData_1.2.11  
## [63] beeswarm_0.4.0            BiocSingular_1.18.0      
## [65] vipor_0.4.7               cli_3.6.2                
## [67] rsvd_1.0.5                fansi_1.0.6              
## [69] S4Arrays_1.2.1            viridisLite_0.4.2        
## [71] ResidualMatrix_1.12.0     gtable_0.3.4             
## [73] sass_0.4.9                digest_0.6.35            
## [75] SparseArray_1.2.4         ggrepel_0.9.5            
## [77] dqrng_0.3.2               farver_2.1.1             
## [79] htmltools_0.5.8.1         lifecycle_1.0.4          
## [81] statmod_1.5.0