1 Introduction

In order to aid the interpretation of the clustering results that we covered in the previous section, it is helpful to identify genes that contribute to the separation of cells into those clusters.

The main approach to achieve this, is to identify genes that are differently expressed between clusters. These may be, for example, exclusively expressed in a single cluster or perhaps differentiate between a few different clusters. There are different methods to identify expression differences between clusters: using mean expression level, or the ranking of the gene by its expression, or the proportions of cells that express the gene.

Our main objective in this section is to cover some of the methods that can be used to achieve this goal, and obtain a summary table of results. As always, the OSCA chapter on marker detection contains additional detail and advice.

2 Load data

# load packages
library(scater)
library(scran)
library(tidyverse)
library(patchwork)

We will use the data set generated in the clustering session. This contains 7 samples from the Caron data set. For the purposes of these materials, in the interests of time, each sample has been downsampled to only contain 500 cells.

# read single cell object
sce <- readRDS("R_objects/Caron_clustered.500.rds")

Note that we have also replaced the gene names (rownames) of our objects to use common gene names instead of Ensembl IDs. This was done using the uniquifyFeatureNames() function, which makes this safely by avoiding duplicate gene names.

rownames(sce)[11:20]
##  [1] "LINC01128"       "ENSG00000288531" "FAM41C"          "ENSG00000272438"
##  [5] "ENSG00000230699" "ENSG00000241180" "SAMD11"          "NOC2L"          
##  [9] "KLHL17"          "PLEKHN1"

For this demonstration we will investigate the clustering generated using the Leiden clustering algorithm with k set to 25. The results of this clustering have been added to a column in the colData called “label” using the colLabels() accessor. The advantage of this is that should we later decide to use a different clustering or labels (such as cell types), we can simply change the contents of this column and there would be no need to modify any subsequent code.

Double check that the “label” column contains the clustering that we are interested in:

# check labels are set to our clusters
all(sce$k.25_cluster.fun.leiden == sce$label)
## [1] TRUE

To remind ourselves, we can visualise the clusters on a UMAP:

# visualise UMAP of our clusters
plotReducedDim(sce, 
               dimred = "UMAP_corrected",
               colour_by = "label", 
               text_by = "label")

Our objective is to identify genes that distinguish these clusters from one another - “cluster marker genes”. Intuitively we hope that the clusters relate to specific cell populations, and therefore we are trying to find genes that will allow us to identify the cell types for each cluster.

For example genes such as the “CST3” gene, which is a known monocyte marker:

# visualise monocyte-specific marker
plotReducedDim(sce, 
               dimred = "UMAP_corrected",
               colour_by = "CST3", 
               text_by = "label", 
               by_exprs_values = "reconstructed",
               add_legend = FALSE)

3 Identifying cluster marker genes

Although we have defined our clusters based on the batch-corrected expression values, these should not be used for for gene-based analyses like marker gene detection. Instead, we should use the uncorrected (normalised) expression values for differential expression between clusters. This is because data integration algorithms bring cells together based on their overall gene expression, but for each gene individually the data transformation may introduce artificial agreement between batches, which is not ideal for gene-level differential analysis. Furthermore, the severity of these biases is dependent on the parameters used for data integration (such as the number of nearest neighbours in the fastMNN() method).

Valid assays to use in gene based differential analysis tests are the normalised counts obtained from the deconvolution method (using scuttle::computePooledFactors() + scuttle::logNormCounts()) or from the variance stabilising transformation method (using sctransform::vst()). In our SCE object, we have the normalised counts in the “logcounts” assay, which we can access with assay(sce, "logcounts") (or using the shortcut logcounts(sce)).

3.1 Pairwise Cluster Comparisons

The basic approach for marker gene identification across clusters is to perform statistical tests for each gene between every pair of clusters. The scoreMarkers() function can do this for us, while accounting for known factors (aka “blocking factors” or “blocks”), such as sample batch.

The scoreMarkers() function outputs a list of DataFrame objects, one for each cluster compared to all others. However, note that the blocking assumes that each pair of clusters is present in at least one of the blocks. If there are two clusters which are not both present in at least one block (in our case Samples), then that pairwise comparison will by necessity be omitted.

By default the scoreMarkers() function will use the log-normalised counts as stored in the “logcounts” assay slot of the single cell object, so there is no need for us to specify it.

# calculate pairwise marker gene statistics
markers <- scoreMarkers(sce, 
                        groups = sce$label, 
                        block = sce$SampleName)

The returned object is a list of the same length as the number of clusters. We can access the results for a particular cluster thus:

# extract results for cluster 11
c11_markers <- as.data.frame(markers[["11"]])
head(c11_markers)
##                  self.average other.average self.detected other.detected
## MIR1302-2HG     -9.525318e-05  0.0006218571  0.0000000000   0.0003457815
## ENSG00000238009 -7.055166e-04  0.0006147748  0.0006904368   0.0017581093
## ENSG00000239945  0.000000e+00  0.0000000000  0.0000000000   0.0000000000
## ENSG00000241860  1.467431e-02  0.0037458036  0.0123830204   0.0021588602
## ENSG00000286448  0.000000e+00  0.0000000000  0.0000000000   0.0000000000
## ENSG00000235146  0.000000e+00  0.0000000000  0.0000000000   0.0000000000
##                 mean.logFC.cohen min.logFC.cohen median.logFC.cohen
## MIR1302-2HG         -0.006551891     -0.07207080          0.0000000
## ENSG00000238009     -0.017921089     -0.09532710          0.0000000
## ENSG00000239945      0.000000000      0.00000000          0.0000000
## ENSG00000241860      0.138798000      0.06154317          0.1480302
## ENSG00000286448      0.000000000      0.00000000          0.0000000
## ENSG00000235146      0.000000000      0.00000000          0.0000000
##                 max.logFC.cohen rank.logFC.cohen  mean.AUC   min.AUC median.AUC
## MIR1302-2HG           0.0000000             7484 0.4998127 0.4979400  0.5000000
## ENSG00000238009       0.0000000             8233 0.4988433 0.4920561  0.5000000
## ENSG00000239945       0.0000000             7484 0.5000000 0.5000000  0.5000000
## ENSG00000241860       0.2003104             2621 0.5073458 0.5039262  0.5080327
## ENSG00000286448       0.0000000             7484 0.5000000 0.5000000  0.5000000
## ENSG00000235146       0.0000000             7484 0.5000000 0.5000000  0.5000000
##                   max.AUC rank.AUC mean.logFC.detected min.logFC.detected
## MIR1302-2HG     0.5000000     5251       -3.652833e-02      -4.018116e-01
## ENSG00000238009 0.5000000     5733       -6.465313e-02      -2.859813e-01
## ENSG00000239945 0.5000000     5251        1.144782e-18      -2.530189e-16
## ENSG00000241860 0.5106707     2339        5.795616e-01       1.378355e-01
## ENSG00000286448 0.5000000     5251        1.144782e-18      -2.530189e-16
## ENSG00000235146 0.5000000     5251        1.144782e-18      -2.530189e-16
##                 median.logFC.detected max.logFC.detected rank.logFC.detected
## MIR1302-2HG              0.000000e+00       2.237006e-16                4715
## ENSG00000238009         -8.706192e-17       1.321497e-16                5339
## ENSG00000239945          1.337348e-18       2.237006e-16                4715
## ENSG00000241860          4.888574e-01       1.011211e+00                1003
## ENSG00000286448          1.337348e-18       2.237006e-16                4715
## ENSG00000235146          1.337348e-18       2.237006e-16                4715

This DataFrame contains the results for cluster 10. The first four columns contain summary statistics:

  • self.average - the mean log-expression in the cluster of interest
  • other.average - the grand mean across all other clusters
  • self.detected - the proportion of cells with detected expression in the cluster of interest
  • other.detected - the mean detected proportion across all other clusters.

The remaining columns contain summaries of three scores from the pairwise comparisons. The three scores are:

  • logFC.cohen - “Cohen’s d” - this is the log fold change of mean gene expression standardized by the average standard deviation across the groups. This can be interpreted in a similar way to log fold change in that a positive value indicates upregulation in the cluster of interest.
  • AUC - “Area Under the Curve” - this quantifies the ability to distinguish between two gene expression distributions. It can be interpreted as the likelihood that any random cell in the cluster of interest will have a higher expression of the gene than any random cell in the other cluster. It ranges from 0 to 1, where 1 can be interpreted as upregulation, 0 downregulation, and 0.5 as no difference.
  • logFC.detected - this is the log fold change in the proportion of cells in which the gene is detected in the cluster of interest, versus the proportion of cells in which the gene is detected in the other cluster. Positive values indicate that the gene is detected in more cells in the cluster of interest than the other cluster. Note, this takes no account of the magnitude of the gene expression, instead this metric helps to identify presence/absence differences in gene expression between clusters.

More detail on the differences between these effect size scores can be found in the “Advanced” Marker detection chapter of the OSCA book.

Whilst all the pairwise scores can be retrieved by adding the argument full.stats=TRUE to scoreMarkers(), by default this function returns 5 summary statistics for each score:

  • mean.X - this is the mean of the score across all pairwise comparisons. It gives the relative expression of the gene versus the average of the other clusters.
  • min.X - this is the minimum score obtained across all pairwise comparisons. This is the most stringent summary statistic for detecting upregulated genes, if the score is high, then the gene is upregulated in the cluster of interest relative to all other clusters. Conversely, if the minimum is low, then the gene is downregulated relative to at least one other cluster.
  • median.X - this is the median of the score across all pairwise comparisons. It is more robust to outliers than the mean. Whilst this is beneficial in avoiding the situation where the effect size is very large in only a small number of comparisons, it may conversely be detrimental to marker gene selection if there are small number of clusters with very similar effect sizes, as these will be effectively ignored.
  • max.X - this is the maximum score obtained across all pairwise comparisons. This is the least stringent summary statistic for detecting upregulated genes as a high score only indicates that the gene is upregulated in the cluster of interest relative to at least one other clusters. Conversely, if the maximum is low, then the gene is downregulated relative to all other clusters.
  • rank.X - This is the minimum ranking (“min-rank”) of that gene by that score across all clusters. For each pairwise comparison the genes are ranked according to the score, this summary provides the lowest rank for that gene across all comparisons. Essentially, a gene with a low “min-rank” will be one of the best genes according to that score in at least one comparison.

The choice of the summary used for ranking will effect the stringency of the selection. See the the OSCA books “Basic” chapter on Marker gene detection for further discussion of these different summaries. In general the mean and median make reasonable defaults for most applications. In practice, the minimum and maximum are most helpful for diagnosing discrepancies between the mean and median, rather than being used directly for ranking.

Selecting genes based on a given min-rank, say 5, is useful as it will generate a list of genes that is the union of genes with a rank of 5 or less for any pairwise comparison. This will ensure we get at least 5 genes that distinguish the cluster of interest from all other clusters.

For example using the min-rank for Cohen’s d on cluster 11 yields 20 marker genes:

# filter markers based on rank statistics
c11_markers %>% 
  select(contains("cohen")) %>%
  filter(rank.logFC.cohen <= 5) %>%
  arrange(rank.logFC.cohen)
##                 mean.logFC.cohen min.logFC.cohen median.logFC.cohen
## S100A6                 4.2162224       1.7358982          3.6480872
## MNDA                   2.8918797       2.5041625          2.8789406
## FCER1G                 3.8658764       1.9474100          3.5173938
## LYZ                    4.5844176       2.1232942          4.9133791
## TYROBP                 4.4552307       2.8329535          4.5891267
## CST3                   3.9360885       1.9529362          4.4408743
## TMSB4X                 3.0228225       0.9019119          2.6328778
## LYN                    1.7960518      -0.1399217          1.9418177
## SRGN                   3.7751416       1.3830417          3.4922891
## COTL1                  3.3777957       1.4159509          2.8295088
## FTL                    2.8241147      -0.3085039          2.7148624
## S100A9                 2.7449945       1.2226671          2.9775643
## S100A4                 4.0452145       1.8627076          3.6829962
## TMSB10                 2.0462797       0.7608530          1.7812587
## ENSG00000257764        2.5798135       1.7016110          2.5734698
## LGALS1                 4.0107075       2.4023077          4.1281149
## AIF1                   2.8189536       1.4759873          2.8941228
## RPS3A                  0.8403469      -0.8621093          0.3442681
## LST1                   3.2648717       2.1604681          3.3296765
## SNX10                  2.0219680       0.3206237          1.5637960
##                 max.logFC.cohen rank.logFC.cohen
## S100A6                 8.158575                1
## MNDA                   3.469624                1
## FCER1G                 9.013874                1
## LYZ                    6.132738                1
## TYROBP                 7.441742                1
## CST3                   5.000914                1
## TMSB4X                 6.768085                1
## LYN                    3.446690                2
## SRGN                   5.799152                2
## COTL1                  8.758345                2
## FTL                    7.048414                2
## S100A9                 3.259106                3
## S100A4                 6.825441                3
## TMSB10                 4.735269                3
## ENSG00000257764        3.569238                3
## LGALS1                 4.839062                3
## AIF1                   3.580618                4
## RPS3A                  6.585066                5
## LST1                   4.500301                5
## SNX10                  7.032216                5

As we suspected, cluster 11 is likely to contain monocytes, based on the expression of CST3 and TYROBP, for example.

We can visualise the expression of some of the other top-ranked genes:

# visualise one of the markers
p1 <- plotReducedDim(sce, 
                     dimred = "UMAP_corrected",
                     colour_by = "LYZ", 
                     text_by = "label")

p2 <- plotExpression(sce, features = "LYZ", x = "label")

p1 + p2

Based on the expression of CD3D, which is a known marker for T cells, it seems likely that cells in clusters 6 and 7 are T cells.

# CD3D suggests cluster 6 and 7 are T cells
plotReducedDim(sce, 
               dimred = "UMAP_corrected",
               colour_by = "CD3D", 
               text_by = "label")

plotExpression(sce, 
               features = "CD3D", 
               x = "label")

We would like to confirm this by identifying other genes that differentiate these two clusters from the rest of the cells.

  1. Extract results for cluster 6 and convert it to a data.frame.
  2. Filter the data.frame using your choice of ranking statistic - rank.logFC.detected or rank.logFC.cohen or a combination of both.
  3. Visualise the expression of genes that seem interesting from your filters.

Hint

Fix the code below (where the word “FIXME” appears) to extract the data frame with metrics for this cluster (from our markers object), filter it and then plot the expression of a few of the genes.

# extract results from cluster 6 and convert to data.frame
c6_markers <- FIXME

# filter the data.frame using your choice of ranking statistic
# `rank.logFC.detected` or `rank.logFC.cohen`
# or a combination of both!
c6_markers %>% 
  filter(FIXME)

# visualise the expression of genes that seem interesting from your filters
plotExpression(sce, 
               features = FIXME, 
               x = "label")

Answer

Start by extracting the marker results for cluster 6 from our list:

c6_markers <- as.data.frame(markers[["6"]])

In our case, we chose to then filter our table with two criteria to find genes that:

  • are in the top 2 ranking for being mostly detected in this cluster (i.e. a presence/absence criteria)

AND

  • have a Cohen’s d in the top 10 (i.e. the average expression difference should be high, relative to the gene’s variance)
c6_top_genes <- c6_markers %>% 
  filter(rank.logFC.detected <= 2 & rank.logFC.cohen <= 10) %>% 
  rownames()
c6_top_genes
## [1] "CD3E"   "CD3D"   "PRKCH"  "BCL11B" "IL32"

We obtain 5 genes with these criteria. Finally, we visualize the expression of these genes:

plotExpression(sce, 
               features = c6_top_genes, 
               x = "label")

We can see that all of these genes behave in a similar way to our known marker CD3D, suggesting they are good markers for these cell types. This makes sense as CD3D and CD3E encode T cell surface proteins. IL32 encodes a cytokine, which is often associated with cancer and BCL11B is a transcription factor that has been linked with T-Cell malignancy. This result could open an avenue for further investigation in this study.

3.2 Heatmap of marker genes

We have already seen how we can use the plotExpression() function to visualise the distribution of expression in our data between clusters. We have also seen how to use plotReducedDim() to visualise a gene’s expression on the projected reduced dimensionality space.

Another useful type of visualisation is to use heatmaps to show the expression of these genes of interest. We will demonstrate this using the top marker genes for cluster 11.

# get top-ranked markers for cluster 11
c11_top_genes <- c11_markers %>% 
  filter(rank.logFC.cohen <= 5) %>% 
  rownames()

We can make generate a heatmap showing the expression in individual cells using the function plotHeatmap(). We will arrange the columns (cells) so that cells from the same cluster and cells from the same samplegroup are grouped together.

# visualise their expression as a heatmap
plotHeatmap(sce, 
            features = c11_top_genes,
            order_columns_by = c("label", "SampleGroup"))

Alternatively, we can summarise the expression across sample goups and generate a heatmap showing the average expression across cells within each group using the function plotGroupedHeatmap(). We can specify any factors causing batch effects using the block arguments and the batch effects will be regressed out of the averages.

# heatmap average per group (cluster)
plotGroupedHeatmap(sce, 
                   features = c11_top_genes,
                   group = "label",
                   block = "SampleGroup")

In both cases, the colour scale of expression is showing the logcounts in their original scale. However, for this kind of visualisation, it may sometimes be useful to scale the data (aka Z-score), which brings all the genes to the same relative scale.

# scaled heatmap (z-scores)
plotHeatmap(sce, 
            features = c11_top_genes,
            order_columns_by = c("label", "SampleGroup"),
            scale = TRUE, 
            center = TRUE,
            zlim = c(-3, 3))

plotGroupedHeatmap(sce, 
                   features = c11_top_genes,
                   group = "label",
                   block = "SampleGroup",
                   scale = TRUE, 
                   center = TRUE,
                   zlim = c(-3, 3))

In this case, the colour scale can be interpreted as the number of standard deviations above/below the mean of that gene across all cells.

Another useful visualisation is to use dot plots of expression that show both the average gene expression (as a colour scale) and the number of cells in which the gene is detected (as the size of the points).
We can generate such a plot using the plotDots() function:

# dot plot of expression showing average expression and detection rate
plotDots(sce, 
         features = c11_top_genes,
         group = "label", 
         block = "SampleGroup",
         scale = TRUE, center = TRUE, zlim = c(-3, 3))

3.3 Adjusting the log-fold change

The AUC and Cohen’s d scores incorporate both the gene expression differences between the clusters and the variance in gene expression scores within each cluster. If a gene has low variance, it is possible that it will be ranked highly even if the magnitude of the difference between the clusters is low. These genes will not necessarily make good marker genes. It may therefore be desirable to favour the detection of genes with larger log-fold changes.

For example, in the results from cluster 11, the gene SNX10 had a min-rank for Cohen’s d of 5:

c11_markers["SNX10", ] %>% 
  select(min.logFC.cohen, max.logFC.cohen, rank.logFC.cohen)
##       min.logFC.cohen max.logFC.cohen rank.logFC.cohen
## SNX10       0.3206237        7.032216                5

However, we can also see that its LFC goes from 0.3 to 7, which is a large range. Looking at its expression, we can see what might be going on:

plotExpression(sce,
               features = "SNX10",
               x = "label")

This gene has very low variation in expression in some clusters (because it’s lowly detected), and because Cohen’s d measures average differences scaled by variance, the gene comes up as having a high value for that metric in some comparisons.

To make our analysis more restrictive, we can instead indicate to the scoreMarkers() function what is the minimum LFC threshold we want to use to consider a gene for ranking. For example, a LFC > 2:

# score markers with LFC threshold of 2
markers_lfc <- scoreMarkers(sce,
                           groups = sce$label,
                           block = sce$SampleName,
                           lfc = 2)

# extract new results for cluster 11
c11_markers_lfc2 <- as.data.frame(markers_lfc[["11"]])

Now, SNX10’s rank dropped substantially:

c11_markers_lfc2["SNX10",  c("rank.logFC.cohen")]
## [1] 70

In conclusion, using an LFC threshold will change the ranking of the genes, to favour those genes that have highest LFC, even if they have higher variance in expression.

Obtaining p-values from marker gene analysis

You will notice that we did not make use of p-values in this analysis. The rationale for this is explained in the OSCA book:

Given that scoreMarkers() already reports effect sizes, it is tempting to take the next step and obtain p-values for the pairwise comparisons. Unfortunately, the p-values from the relevant tests cannot be reliably used to reject the null hypothesis. This is because DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis.

The main thing to remember is that, in practice, this is a valid approach to help us annotate groups of cells based on the expression of genes with known cell-specificity and to find new interesting genes for further experiments and validation (e.g. using microscopy or qPCR). In other words, identifying cluster marker genes should be taken as a way to generate new hypothesis from our data, rather than a valid statistical model to test for differential expression between cell types.

4 Cell Type Labelling

One of the main tasks we often want to perform is annotating our cells as known types of cells ocurring in our sampled tissue. This requires prior knowledge of cell transcriptomic states and therefore becomes easier if there are well-curated resources available. However, for less well-studied tissues these may not be available and so cell type annotation may rely on “manual annotation” using a small set of genes with known cell-specific expression (e.g. from microscopy data, qPCR on cell-sorted samples, etc.).

In this section we will do a very simple manual labelling of our clusters, based on known genes expressed in different blood cell types. However, there are more systematic methods for cell type annotation, in particular when prior information is available for those cells:

  • The SingleR package uses previously labelled bulk or single-cell datasets to annotate a new dataset.
  • The AUCcell package classifies cells into types based on user-provided lists of “signature” genes for each type. These lists can be generated from literature, or also be based on prior RNA-seq datasets.
  • Another strategy is to perform a standard gene set enrichment analysis on the top marker genes for each cluster.

Any of these approaches is suitable, and they can often be done in conjunction. To learn more about these approaches, read the chapter on cell type annotation in the OSCA book.

4.1 Manual annotation

A lot is known about immune cell markers, in particular as many surface markers have been identified as useful for immunophenotyping.

To help us in our annotation, we start by retrieving the top-ranked genes from each cluster into a list. We do this by looping through the list using the lapply() function and in each case picking the genes with rank < 10 for Cohen’s D statistic:

# loop through list of marker genes and extract top-ranked gene names
top_markers_all <- lapply(markers, function(x){
  x %>% 
    as.data.frame() %>% 
    filter(rank.logFC.cohen < 10) %>% 
    rownames()
})

# examining this list reveals several known markers of immune cells
top_markers_all
## $`1`
##  [1] "STMN1"           "LAPTM5"          "RPS27"           "RCSD1"          
##  [5] "ZFP36L2"         "ARPP21"          "FHIT"            "MME"            
##  [9] "LEF1"            "RPL34"           "RPS3A"           "PDE4D"          
## [13] "SSBP2"           "MEF2C"           "IRF1-AS1"        "MZB1"           
## [17] "CD74"            "EBF1"            "SOX4"            "HLA-E"          
## [21] "HLA-DRA"         "HLA-DRB1"        "HLA-DPA1"        "HLA-DPB1"       
## [25] "BACH2"           "CD24"            "AKAP12"          "ENSG00000287092"
## [29] "AUTS2"           "KMT2E"           "PSD3"            "RPS6"           
## [33] "APBB1IP"         "LINC01374"       "DNTT"            "MALAT1"         
## [37] "RPS3"            "GAPDH"           "SLC35E3"         "SOCS2"          
## [41] "RPL21"           "FOXO1"           "FOXN3"           "RPS2"           
## [45] "RPS15A"          "PRKCB"           "TERF2"           "RPL23A"         
## [49] "RPL17"           "PLCB1"           "ERG"             "VPREB1"         
## [53] "VPREB3"          "XBP1"            "CD99"            "TMSB4X"         
## 
## $`2`
##  [1] "RCSD1"      "ARHGAP15"   "FOXP1"      "CBLB"       "ZBTB20"    
##  [6] "MME"        "PDS5A"      "LEF1"       "GAB1"       "SSBP2"     
## [11] "MEF2C"      "CD74"       "EBF1"       "STK10"      "CASC15"    
## [16] "CD2AP"      "AKAP12"     "AUTS2"      "CALN1"      "KMT2E"     
## [21] "LINC-PINT"  "PSD3"       "PAG1"       "ST3GAL1"    "DOCK8"     
## [26] "CEMIP2"     "MALAT1"     "FCHSD2"     "GAB2"       "MAML2"     
## [31] "MED13L"     "PAN3"       "FOXO1"      "ARHGEF7"    "INO80"     
## [36] "ARIH1"      "CYTH1"      "RNF125"     "TCF4"       "GNG7"      
## [41] "INSR"       "ZNF350-AS1" "LARGE1"     "MT-ND2"     "MT-CO1"    
## [46] "MT-CO3"     "MT-ND3"     "MT-ND6"     "MT-CYB"    
## 
## $`3`
##  [1] "STMN1"   "HMGN2"   "RCSD1"   "NUCKS1"  "YWHAQ"   "TMSB10"  "PTMA"   
##  [8] "SMC4"    "H2AZ1"   "RPS3A"   "HMGB2"   "SSBP2"   "CD74"    "EBF1"   
## [15] "TUBB"    "HLA-DRA" "CD24"    "AUTS2"   "MCM7"    "STRBP"   "SET"    
## [22] "VIM"     "CFL1"    "RPS3"    "GAPDH"   "TUBA1B"  "NAP1L1"  "HMGB1"  
## [29] "IGHM"    "PCLAF"   "PRKCB"   "KPNB1"   "ACTG1"   "TYMS"    "UHRF1"  
## [36] "HMGN1"   "VPREB1"  "IGLL1"   "TMSB4X" 
## 
## $`4`
##  [1] "STMN1"      "KHDRBS1"    "MACF1"      "RPS27"      "BCL11A"    
##  [6] "TMSB10"     "VAMP8"      "ARHGAP15"   "CERS6"      "ITM2C"     
## [11] "CYTL1"      "RPS3A"      "RPS23"      "MEF2C"      "CD74"      
## [16] "RNF130"     "SOX4"       "HLA-DRA"    "HLA-DPA1"   "RPS18"     
## [21] "HMGA1"      "RPS12"      "PHF14"      "AUTS2"      "CDK6"      
## [26] "LYPLA1"     "RPL30"      "PTP4A3"     "RPS6"       "GNAQ"      
## [31] "PALM2AKAP2" "TXN"        "STRBP"      "RPL12"      "SET"       
## [36] "SNHG7"      "NHLRC2"     "GSTP1"      "GAPDH"      "ETV6"      
## [41] "LDHB"       "ATP5MC2"    "HNRNPA1"    "RPL41"      "TPT1"      
## [46] "PCDH9"      "RPLP1"      "RPS17"      "RPS2"       "RPS15A"    
## [51] "RPL13"      "SNHG29"     "RPL23A"     "RPL23"      "SUMO2"     
## [56] "ACTG1"      "RPL17"      "TCF4"       "RPS19"      "FTL"       
## [61] "RPL28"      "RPS21"      "SMIM11"     "ERG"        "APOO"      
## [66] "SEPTIN6"    "RPL10"     
## 
## $`5`
##  [1] "CD52"     "RPS27"    "RALGPS2"  "ANGPTL1"  "AFF3"     "CXCR4"   
##  [7] "ARHGAP15" "BANK1"    "RPL34"    "RPS3A"    "CD74"     "HLA-DRA" 
## [13] "HLA-DRB1" "HLA-DQB1" "HLA-DPA1" "HLA-DPB1" "RPS18"    "BACH2"   
## [19] "LYN"      "ADK"      "MS4A1"    "MALAT1"   "FCHSD2"   "RPL41"   
## [25] "BTG1"     "RPL21"    "RPS29"    "IGHD"     "IGHM"     "RPS2"    
## [31] "PRKCB"    "RPL13"    "RPL26"    "RPL18A"   "CD79A"    "FTL"     
## [37] "CD37"     "RPS5"     "TMSB4X"   "RPL39"    "MT-CO1"  
## 
## $`6`
##  [1] "CD52"      "RPS27"     "PTPRC"     "TMSB10"    "ARHGAP15"  "ANKRD44"  
##  [7] "RPL32"     "FOXP1"     "LEF1"      "RPL34"     "RPS3A"     "FYB1"     
## [13] "CAMK4"     "RPS14"     "LTB"       "RPS18"     "BACH2"     "LINC-PINT"
## [19] "RPL7"      "RPS6"      "IFITM1"    "MALAT1"    "MAML2"     "CD3E"     
## [25] "CD3D"      "RPL41"     "BTG1"      "RPL21"     "KLF12"     "TRAC"     
## [31] "RPS29"     "PRKCH"     "BCL11B"    "B2M"       "RPS2"      "IL32"     
## [37] "RPL13"     "RPL26"     "SKAP1"     "RPS15"     "RPS28"     "RPS5"     
## [43] "TMSB4X"    "RPL10"    
## 
## $`7`
##  [1] "S100A4"    "CD247"     "PTPRC"     "DUSP2"     "RPL31"     "ARHGAP15" 
##  [7] "CNOT6L"    "RPL34"     "RPS3A"     "GZMA"      "RPS14"     "HLA-A"    
## [13] "HLA-B"     "FYN"       "TNFAIP3"   "KMT2E"     "LINC-PINT" "TRBC1"    
## [19] "SRGN"      "IFITM1"    "CD3E"      "CD3D"      "CD69"      "KLRD1"    
## [25] "PRKCH"     "B2M"       "RPS2"      "IL32"      "CCL5"      "SKAP1"    
## [31] "PITPNC1"   "RPS28"     "MYO1F"     "HCST"      "NKG7"      "CST7"     
## [37] "ITGB2"     "TMSB4X"    "RPL10"    
## 
## $`8`
##  [1] "MACO1"    "TRIM58"   "SEC62"    "RAPGEF2"  "RHAG"     "MAN1A1"  
##  [7] "ARL4A"    "BPGM"     "SLC25A37" "BNIP3L"   "CA1"      "IFIT1B"  
## [13] "HBB"      "HBD"      "MALAT1"   "HMBS"     "HECTD4"   "HERC1"   
## [19] "DENND4A"  "HBM"      "HBA2"     "HBA1"     "AHSP"     "TERF2IP" 
## [25] "SPECC1"   "SLC4A1"   "SLC25A39" "BLVRB"    "ALAS2"   
## 
## $`9`
##  [1] "UROD"     "NFIA"     "NUCKS1"   "LBR"      "GYPC"     "SMC4"    
##  [7] "HMGB2"    "DEK"      "TUBB"     "CD36"     "ANK1"     "CA1"     
## [13] "CA2"      "SMC2"     "HBB"      "HBD"      "TUBA1B"   "HMGB1"   
## [19] "SLC25A21" "NUSAP1"   "PRC1"     "HBA2"     "HBA1"     "AHSP"    
## [25] "TOP2A"    "SLC4A1"   "PRDX2"    "UBA52"    "BLVRB"    "FTL"     
## [31] "ATP5F1E"  "ALAS2"   
## 
## $`10`
##  [1] "SEC62"    "SNCA"     "RPS12"    "SLC25A37" "BNIP3L"   "NCOA4"   
##  [7] "HBB"      "HBD"      "IFI27"    "HBA2"     "HBA1"     "AHSP"    
## [13] "UBB"      "OAZ1"     "UBA52"    "ATP5F1E"  "ALAS2"   
## 
## $`11`
##  [1] "S100A9"          "S100A8"          "S100A6"          "S100A4"         
##  [5] "MNDA"            "FCER1G"          "TMSB10"          "LRRFIP1"        
##  [9] "CSTA"            "RPS3A"           "ARHGAP26"        "LST1"           
## [13] "AIF1"            "SNX10"           "LYN"             "RPL7"           
## [17] "VIM"             "SRGN"            "GAPDH"           "LYZ"            
## [21] "ENSG00000257764" "RPS2"            "COTL1"           "CYBA"           
## [25] "RPS28"           "TYROBP"          "FTL"             "CST3"           
## [29] "LGALS1"          "TMSB4X"         
## 
## $`12`
##  [1] "STMN1"   "RPS27"   "PRKCE"   "TMSB10"  "AFF3"    "RPL31"   "ZEB2"   
##  [8] "PTMA"    "MME"     "CD38"    "ATP8A1"  "RPL34"   "RPS3A"   "PDE4D"  
## [15] "CD74"    "CYFIP2"  "EBF1"    "SOX4"    "HLA-DRA" "BACH2"   "LYN"    
## [22] "PAX5"    "SYK"     "RPS24"   "MALAT1"  "BCL7A"   "RUBCNL"  "PCDH9"  
## [29] "TCL1A"   "IGHM"    "B2M"     "ARIH1"   "ACSM3"   "PRKCB"   "CD79B"  
## [36] "RPS28"   "NIBAN3"  "CD37"    "RPL13A"  "IGLL5"   "VPREB3"  "RPL10"

If we took some time to examine this list (which would benefit from knowing about the underlying biology of the immune system), we can start to see some genes known to differentiate different types of cells:

  • HBA1 and HBA2 → expressed in red blood cells (erythrocytes)
  • CST3 → specific to monocytes
  • CD3E and CD3D → specific to T cells
  • NKG7 → specific to natural killer (NK) T cells
  • CD79A and CD24 → specific to B cells
  • MS4A1 (CD20) → a clinically-important antigen used as a target to treat B cell-related diseases, such as leukemia. Cells with this antigen are referred to as CD20+ B cells.

Let’s visualise these markers’ expression in our clusters:

# cell type specific genes
known_genes <- c(
  "HBA1", # erythrocytes
  "CST3", # monocytes
  "CD3E", # T cells
  "NKG7", # NK T cells
  "CD79A",  # B cells
  "MS4A1" # CD20 B cells
  )

# violin plot
plotExpression(sce, x = "label", features = known_genes)

# scaled heatmap of expression
plotGroupedHeatmap(sce, 
                   features = known_genes,
                   group = "label",
                   block = "SampleGroup", 
                   scale = TRUE, center = TRUE, 
                   zlim = c(-3, 3))

Based on the expression patterns for these cells, we classify these clusters as follows:

Cluster Labelled Cell Type Evidence
1 B cells CD79A
2 B cells CD79A
3 B cells CD79A
4 B cells CD79A
5 CD20+ B cells CD79A + MS4A1
6 T cells CD3D
7 NK T cells CD3D + NKG7
8 Erythrocytes HBA1
9 Erythrocytes HBA1
10 Erythrocytes HBA1
11 Monocytes CST3
12 B cells CD79A

Now that we have a more meaningful annotation for our clusters, let’s add this to our SingleCellExperiment object. We will also add the original cluster ID in parenthesis to remind ourselves that this annotation was done based on the clusters.

The cell labels are stored in the SingleCellExperiment object as a factor (a type of object in R to store categorical data), and so we can change the labels using the levels() function, like so:

# re-label the cells - original cluster in parenthesis
levels(colLabels(sce)) <- c("B (c1)", "B (c2)", 
                            "B (c3)", "B (c4)",
                            "CD20+ B (c5)", 
                            "T (c6)", "NK T (c7)", 
                            "Erythrocytes (c8)", "Erythrocytes (c9)", 
                            "Erythrocytes c(10)",
                            "Monocytes (c11)", "B (c12)")

Now, when we label our UMAP, we can see the new labels, which are more intutitive to interpret:

# visualise UMAP with new labels
plotReducedDim(sce, dimred = "UMAP_corrected", 
               colour_by = "label", text_by = "label")

4.2 (Bonus) Further exploration

4.2.1 Erythrocytes

Looking at the expression of HBA1 and HBA2:

# note we rotate the axis labels for convenience
# this uses standard ggplot2 syntax, see:
# https://stackoverflow.com/a/60650595/5023162
plotExpression(sce, x = "label", features = c("HBA1", "HBA2")) +
  scale_x_discrete(guide = guide_axis(angle = 45))

You may notice that there is a lot of background expression across several cell clusters. HBA1 and HBA2 are common components of the “soup” or ambient RNA in scRNA-seq experiments involving blood cells. Hemoglobin chains, such as HBA1 and HBA2, are very highly expressed in red blood cells and burst red blood cells will contribute their RNA to the medium from which the cells are loaded into the 10X Chromium machine. Some of this medium, or “soup”, containing ambient RNA is captured with each cell. There are methods available to correct for soup in your analysis such as SoupX.

4.2.2 B Cells

Our simple manual annotation wasn’t very conclusive about cluster 12. While it expresses the CD79A B-cell marker, it also appears to be quite separate from other B-cell clusters on our UMAP.

plotReducedDim(sce, dimred = "UMAP_corrected", 
               colour_by = "CD79A", text_by = "label")

This is also the case for cluster 5, however we could see that it likely represents CD20+ B-cells, which explains its separate clustering. However, we didn’t have any specific gene in our small list that distinguishes cluster 12 from the other B-cell clusters.

We could look at our list of top markers and identify any that are unique to this cluster:

# get markers in cluster 12 which are not in the other markers' lists
setdiff(top_markers_all[[12]], unlist(top_markers_all[1:11]))
##  [1] "PRKCE"  "ZEB2"   "CD38"   "ATP8A1" "CYFIP2" "PAX5"   "SYK"    "RPS24" 
##  [9] "BCL7A"  "RUBCNL" "TCL1A"  "ACSM3"  "CD79B"  "NIBAN3" "RPL13A" "IGLL5"

And use this list to further investigate the specific properties of this cluster. For example, we can see that CD38 (another “cluster of differentiation” gene) is expressed highly in this cluster:

plotExpression(sce, x = "label", 
               features = "CD38") +
  scale_x_discrete(guide = guide_axis(angle = 45))

One thing to note is that this cluster only seems to contain cells from PBMMC samples:

table(sce$SampleGroup, sce$label)
##             
##              B (c1) B (c2) B (c3) B (c4) CD20+ B (c5) T (c6) NK T (c7)
##   ETV6-RUNX1   1133     49    292     47           71    218        38
##   PBMMC          64      9    132     61          138    296       103
##             
##              Erythrocytes (c8) Erythrocytes (c9) Erythrocytes c(10)
##   ETV6-RUNX1                56                26                 65
##   PBMMC                     62                87                 77
##             
##              Monocytes (c11) B (c12)
##   ETV6-RUNX1               5       0
##   PBMMC                  208     263

Therefore, it suggests there are absent from our cancer blood samples.

The difficulty in annotating cluster 12 exemplifies the limitation of manual cell annotation, which can become quite laborious and limited in its ability to classify cells.

Also, we should keep in mind that perhaps our clustering wasn’t ideal, there may have been technical issues during normalisation and dataset integration, which we should investigate (e.g. by producing UMAP and clustering from different normalisation methods).

5 Session information

sessionInfo()

sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.1.2             forcats_0.5.2              
##  [3] stringr_1.4.1               dplyr_1.0.10               
##  [5] purrr_0.3.4                 readr_2.1.3                
##  [7] tidyr_1.2.1                 tibble_3.1.8               
##  [9] tidyverse_1.3.2             pheatmap_1.0.12            
## [11] scran_1.24.1                scater_1.24.0              
## [13] ggplot2_3.3.6               scuttle_1.6.3              
## [15] SingleCellExperiment_1.18.1 SummarizedExperiment_1.26.1
## [17] Biobase_2.56.0              GenomicRanges_1.48.0       
## [19] GenomeInfoDb_1.32.4         IRanges_2.30.1             
## [21] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [23] MatrixGenerics_1.8.1        matrixStats_0.62.0         
## 
## loaded via a namespace (and not attached):
##  [1] googledrive_2.0.0         ggbeeswarm_0.6.0         
##  [3] colorspace_2.0-3          ellipsis_0.3.2           
##  [5] bluster_1.6.0             XVector_0.36.0           
##  [7] BiocNeighbors_1.14.0      fs_1.5.2                 
##  [9] farver_2.1.1              ggrepel_0.9.1            
## [11] fansi_1.0.3               lubridate_1.8.0          
## [13] xml2_1.3.3                codetools_0.2-18         
## [15] sparseMatrixStats_1.8.0   cachem_1.0.6             
## [17] knitr_1.40                jsonlite_1.8.0           
## [19] broom_1.0.1               cluster_2.1.4            
## [21] dbplyr_2.2.1              compiler_4.2.2           
## [23] httr_1.4.4                dqrng_0.3.0              
## [25] backports_1.4.1           assertthat_0.2.1         
## [27] Matrix_1.5-3              fastmap_1.1.0            
## [29] gargle_1.2.1              limma_3.52.3             
## [31] cli_3.4.1                 BiocSingular_1.12.0      
## [33] htmltools_0.5.4           tools_4.2.2              
## [35] rsvd_1.0.5                igraph_1.3.5             
## [37] gtable_0.3.1              glue_1.6.2               
## [39] GenomeInfoDbData_1.2.8    Rcpp_1.0.9               
## [41] cellranger_1.1.0          jquerylib_0.1.4          
## [43] vctrs_0.4.1               DelayedMatrixStats_1.18.0
## [45] xfun_0.33                 beachmat_2.12.0          
## [47] rvest_1.0.3               lifecycle_1.0.2          
## [49] irlba_2.3.5.1             statmod_1.5.0            
## [51] googlesheets4_1.0.1       edgeR_3.38.4             
## [53] zlibbioc_1.42.0           scales_1.2.1             
## [55] hms_1.1.2                 parallel_4.2.2           
## [57] RColorBrewer_1.1-3        yaml_2.3.5               
## [59] gridExtra_2.3             sass_0.4.2               
## [61] stringi_1.7.8             highr_0.9                
## [63] ScaledMatrix_1.4.1        BiocParallel_1.30.4      
## [65] rlang_1.0.6               pkgconfig_2.0.3          
## [67] bitops_1.0-7              evaluate_0.16            
## [69] lattice_0.20-45           labeling_0.4.2           
## [71] cowplot_1.1.1             tidyselect_1.1.2         
## [73] magrittr_2.0.3            R6_2.5.1                 
## [75] generics_0.1.3            metapod_1.4.0            
## [77] DelayedArray_0.22.0       DBI_1.1.3                
## [79] pillar_1.8.1              haven_2.5.1              
## [81] withr_2.5.0               RCurl_1.98-1.9           
## [83] crayon_1.5.1              modelr_0.1.9             
## [85] utf8_1.2.2                tzdb_0.3.0               
## [87] rmarkdown_2.16            viridis_0.6.2            
## [89] locfit_1.5-9.6            grid_4.2.2               
## [91] readxl_1.4.1              reprex_2.0.2             
## [93] digest_0.6.29             munsell_0.5.0            
## [95] beeswarm_0.4.0            viridisLite_0.4.1        
## [97] vipor_0.4.5               bslib_0.4.0