1 Introduction
2 Setup
3 Identifying cluster marker genes
- 3.1 Differential expression analysis
  - 3.1.1 Tweaking differential expression analysis
- 3.2 Visualization of marker genes
4 Alternative Testing Strategies
5 Session information

1 Introduction

In order to aid the interpretation of the clustering results we covered in the previous section, it is helpful to identify genes that are contributing to the separation of cells into those clusters.

The main approach to achieve this, is to identify genes that are differently expressed between clusters. These may be, for example, exclusively expressed in a single cluster or perhaps differentiate between a few different clusters. There are different methods to identify differentially expressed genes, focusing on differences between the mean expression level, or the probability of expression between clusters or take the whole distribution of expression into account.

Our main objective in this section is to cover some of the methods that can be used to achieve this goal, and obtain a summary table of results.

2 Setup

Before we start, let’s load our packages and read our data in.

# Load packages ----
library(scater)
library(scran)
library(pheatmap)
library(tidyverse) # always load tidyverse after other packages

We will load two SingleCellExperiment objects generated in previous sections:

The SCE with logcounts normalised within each batch generated in the Batch Correction section.
The SCE with batch-corrected data integration using the batchelor::fastMNN() algorithm, including clusters covered in the Clustering section.

# Read data ----

# normalised within batches without batch-integration correction
uncorrected <- readRDS("Robjects/DataIntegration_uncorrected.Rds")
rownames(uncorrected) <- uniquifyFeatureNames(rownames(uncorrected), rowData(uncorrected)$Symbol)

# data corrected using batch integration with mutual nearest neighbours
corrected <- readRDS("Robjects/caron_postDeconv_5hCellPerSpl_dsi_PBMMC_ETV6-RUNX1_clust.Rds")
rownames(corrected) <- uniquifyFeatureNames(rownames(corrected), rowData(corrected)$Symbol)

Note that we also replace the gene names (rownames) of our objects to use common gene names instead of Ensembl IDs. Using the uniquifyFeatureNames() function makes this safely, avoiding duplicate gene names.

As a reminder, the corrected object contains the cell cluster assignments obtained, for example, with the “louvain” algorithm:

# visualise cluster assignments on the corrected data
plotUMAP(corrected, 
         colour_by = "louvain", 
         text_by = "louvain")

It will also be helpful for data visualisation purposes to copy the cluster assignment in the corrected data to our uncorrected (normalised) SCE object:

# copy cluster assignments to the uncorrected object
# first make sure that the cell names are in the same order
all(colnames(uncorrected) == colnames(corrected))

## [1] TRUE

colData(uncorrected)$louvain <- factor(colData(corrected)$louvain)

Our objective is to identify genes that distinguish between these clusters. For example genes such as the “CST3” gene, which is a known monocyte marker:

# visualise a previously known marker gene (for monocytes)
plotTSNE(corrected, 
         colour_by = "CST3", 
         text_by = "louvain", 
         by_exprs_values = "reconstructed")

3 Identifying cluster marker genes

Although we have defined our clusters based on the batch-corrected expression values, these should not be used for for gene-based analyses like marker gene detection. Instead, we should use the uncorrected (normalised) expression values for differential expression between clusters.

The reason for this is that batch correction algorithms do not necessarily preserve the magnitude (or even direction) of the differences in gene expression when attempting to integrate multiple batches. For example, the cosine normalization used by the “Mutual Nearest Neighbours” algorithm (fastMNN()) shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches.

In our case, valid assays to use in differential analysis tests are the logcounts obtained from the deconvolution method (using scuttle::computePooledFactors() + scuttle::logNormCounts()) or from the variance stabilising transformation method (using sctransform::vst()).

In our case, we have these normalised logcounts in the uncorrected object (see the section on data normalisation for a recap):

3.1 Differential expression analysis

The basic approach for marker gene identification across clusters is to perform a statistical test for the expression of a gene between every pair of clusters. The findMarkers() function can do this for us, while accounting for known factors (such as sample batch). We will start by using this function with default values, which compares the mean expression between each pair of clusters:

using a Welch t-test
testing for the null hypothesis of a log-fold change of 0
for genes that either have significantly higher or lower expression between the reference cluster and the cluster it is being compared against
ranking genes based on the outcome of any of the pairwise comparisons

# Marker gene identification ----

# identify marker genes based on mean expression differences
# default options do not need to be specified, but shown here for illustration
markers_default <- findMarkers(
  uncorrected, 
  groups = factor(corrected$louvain), # clusters to compare
  block = uncorrected$SampleGroup,    # covariates in statistical model
  test.type = "t",   # t-test (default)
  direction = "any", # test for either higher or lower expression (default)
  lfc = 0, # null hypothesis log-fold-change = 0 (default)
  pval.type = "any" # ranking of p-values based on any comparison (default)
)

# returns a list of length equal to the number of clusters
markers_default

## List of length 11
## names(11): 1 2 3 4 5 6 7 8 9 10 11

# check the result of a particular cluster
markers_default[[8]]

## DataFrame with 17700 rows and 14 columns
##                  Top      p.value          FDR summary.logFC   logFC.1
##            <integer>    <numeric>    <numeric>     <numeric> <numeric>
## LYZ                1 4.78916e-105 1.21097e-101       4.50487  4.508769
## SRGN               1  3.25193e-73  2.30237e-70       2.53638  1.954566
## RPS27              1  7.21116e-97  1.41819e-93      -1.74740 -0.387626
## HBA1               1 7.37087e-189 1.30464e-184      -9.84440  0.764925
## TYROBP             2  3.51942e-50  8.53340e-48       2.22153  1.944560
## ...              ...          ...          ...           ...       ...
## AC124248.2     17594            1            1             0         0
## AC061958.1     17595            1            1             0         0
## AL139351.3     17618            1            1             0         0
## AC092939.1     17646            1            1             0         0
## Z85996.3       17678            1            1             0         0
##              logFC.2   logFC.3   logFC.4   logFC.5   logFC.6   logFC.7
##            <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ         4.495563  4.030680  4.638569  4.639435  4.605950  4.473449
## SRGN        2.536376  2.271287  2.503678  2.546259  2.293513  2.493355
## RPS27       0.146651 -0.233198  0.424884 -0.206971 -1.747398  0.475726
## HBA1        0.590209  0.748297  1.043335  0.194616  0.305193 -4.833203
## TYROBP      2.092456  2.221534  2.093691  2.082816  2.090002  2.088786
## ...              ...       ...       ...       ...       ...       ...
## AC124248.2         0         0         0         0         0         0
## AC061958.1         0         0         0         0         0         0
## AL139351.3         0         0         0         0         0         0
## AC092939.1         0         0         0         0         0         0
## Z85996.3           0         0         0         0         0         0
##              logFC.9  logFC.10  logFC.11
##            <numeric> <numeric> <numeric>
## LYZ         3.753146  4.504874  4.481884
## SRGN        2.378514  1.146504  2.446328
## RPS27       0.134966 -0.601973 -1.140086
## HBA1       -9.844399 -0.226184 -0.414213
## TYROBP      1.980865  1.669684  2.076899
## ...              ...       ...       ...
## AC124248.2         0         0         0
## AC061958.1         0         0         0
## AL139351.3         0         0         0
## AC092939.1         0         0         0
## Z85996.3           0         0         0

The output of this function allows us to identify a combination of marker genes that, together, distinguish a particular cluster against the rest (i.e. they are “marker genes” for the cluster). For example, the table of results for cluster 8 shown above contains log2-fold changes of expression between cluster 8 against each of the other clusters, along with several statistics obtained by combining p-values across these pairwise comparisons.

Of particular interest is the Top column. This is the column that summarises how a particular gene ranked when compared against all the other genes in any of the pairwise tests done between clusters. For example, gene “LYZ” has Top = 1, meaning that it had the lowest p-value out of all genes when its expression was compared between cluster 8 and at least one of the other clusters. On the other hand, the gene “TYROBP” had Top = 2 meaning that it came second when compared with all other genes (over all pairwise comparisons).

Therefore, the set of genes with Top ≤ X is the union of the top X genes (ranked by p-value) from each pairwise comparison involving cluster 8. For example, the set of genes with Top = 1 contains genes with the lowest p-value from each pairwise comparison. Similarly, the set of genes with Top ≤ 10 contains the top 10 genes from each pairwise comparison.

The Top field represents findMarkers()’s approach to consolidate multiple pairwise comparisons into a single ranking for each cluster; each DataFrame produced by findMarkers() will order genes based on the Top value by default.

# extract results for one of the clusters
c8_markers_default <- markers_default[[8]]
c8_markers_default[1:10, c(1, 5:14)]

## DataFrame with 10 rows and 11 columns
##              Top   logFC.1   logFC.2   logFC.3   logFC.4   logFC.5   logFC.6
##        <integer> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ            1  4.508769  4.495563  4.030680  4.638569  4.639435  4.605950
## SRGN           1  1.954566  2.536376  2.271287  2.503678  2.546259  2.293513
## RPS27          1 -0.387626  0.146651 -0.233198  0.424884 -0.206971 -1.747398
## HBA1           1  0.764925  0.590209  0.748297  1.043335  0.194616  0.305193
## TYROBP         2  1.944560  2.092456  2.221534  2.093691  2.082816  2.090002
## CYBA           2  1.378884  1.890397  1.588554  1.758181  1.233403  1.732687
## LGALS1         2  2.107687  2.641486  2.457002  2.605708  2.579991  2.635851
## CST3           2  1.876773  2.301150  2.320934  2.381252  2.291110  2.397640
## GAPDH          2  0.352374  1.403788  0.611897  0.573315  2.152074  1.867548
## STMN1          2 -0.539725 -0.156797 -0.920125 -2.150276 -0.223848  0.531039
##          logFC.7   logFC.9  logFC.10  logFC.11
##        <numeric> <numeric> <numeric> <numeric>
## LYZ     4.473449  3.753146  4.504874  4.481884
## SRGN    2.493355  2.378514  1.146504  2.446328
## RPS27   0.475726  0.134966 -0.601973 -1.140086
## HBA1   -4.833203 -9.844399 -0.226184 -0.414213
## TYROBP  2.088786  1.980865  1.669684  2.076899
## CYBA    2.232252  1.759892  1.295630  0.759375
## LGALS1  2.634804  2.320965  2.479767  2.638351
## CST3    2.291425  2.146728  2.355764  2.406259
## GAPDH   1.224128  1.837913  1.635339  2.110766
## STMN1  -0.133436  0.136129  0.545440  0.461136

We can then use the Top field to identify a set of genes that is guaranteed to distinguish cluster 8 from any other cluster.

# identify set of genes in the top 3 p-value ranking
c8_markers_default[c8_markers_default$Top <= 3, ]

## DataFrame with 18 rows and 14 columns
##              Top      p.value          FDR summary.logFC   logFC.1   logFC.2
##        <integer>    <numeric>    <numeric>     <numeric> <numeric> <numeric>
## LYZ            1 4.78916e-105 1.21097e-101       4.50487  4.508769  4.495563
## SRGN           1  3.25193e-73  2.30237e-70       2.53638  1.954566  2.536376
## RPS27          1  7.21116e-97  1.41819e-93      -1.74740 -0.387626  0.146651
## HBA1           1 7.37087e-189 1.30464e-184      -9.84440  0.764925  0.590209
## TYROBP         2  3.51942e-50  8.53340e-48       2.22153  1.944560  2.092456
## ...          ...          ...          ...           ...       ...       ...
## FTL            3  9.56506e-74  7.05423e-71       3.08109  1.850987  2.680720
## TSPO           3  1.12719e-43  2.05682e-41       1.21579  0.973690  1.160254
## HBA2           3 8.18076e-185 4.82665e-181      -9.96456  1.008697  0.397922
## S100A4         3  8.36739e-73  5.61038e-70       2.99005  2.420962  2.991515
## MALAT1         3  1.51323e-86  2.06032e-83      -3.04268 -0.599402 -1.737540
##          logFC.3   logFC.4   logFC.5   logFC.6   logFC.7   logFC.9  logFC.10
##        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ     4.030680  4.638569  4.639435  4.605950  4.473449  3.753146  4.504874
## SRGN    2.271287  2.503678  2.546259  2.293513  2.493355  2.378514  1.146504
## RPS27  -0.233198  0.424884 -0.206971 -1.747398  0.475726  0.134966 -0.601973
## HBA1    0.748297  1.043335  0.194616  0.305193 -4.833203 -9.844399 -0.226184
## TYROBP  2.221534  2.093691  2.082816  2.090002  2.088786  1.980865  1.669684
## ...          ...       ...       ...       ...       ...       ...       ...
## FTL     2.465550  3.081093  2.262029  1.845478  1.848227  0.210337  2.266881
## TSPO    1.215786  1.070974  1.055939  1.071418  0.914459  1.029131  0.986104
## HBA2    0.740943  1.273823  0.335585  0.482073 -4.516484 -9.964555 -0.310510
## S100A4  3.108044  2.994276  2.990050  2.690913  1.750877  2.579884  1.688207
## MALAT1 -1.193104 -0.433391 -2.031141 -3.042685  0.459732 -0.516974 -2.501476
##         logFC.11
##        <numeric>
## LYZ     4.481884
## SRGN    2.446328
## RPS27  -1.140086
## HBA1   -0.414213
## TYROBP  2.076899
## ...          ...
## FTL     1.704198
## TSPO    1.108172
## HBA2   -0.357407
## S100A4  2.816049
## MALAT1 -2.018310

As we suspected, cluster 8 is likely to contain monocytes, based on the expression of CST3 and TYROBP, for example. Each DataFrame also contains several other statistics that may be of interest. The summary.logFC field provides a convenient summary of the direction and effect size for each gene, and is defined here as the log-fold change from the comparison with the lowest p-value. The p.value field contains the combined p-value that is obtained by applying Simes’ method to the pairwise p-values for each gene and represents the evidence against the joint null hypothesis, i.e., that the gene is not DE between cluster 8 and any other cluster. Examination of these statistics permits a quick evaluation of the suitability of a candidate marker; if both of these metrics are poor (small log-fold change, large p-value), the gene can most likely be dismissed.

We can visualise the expression of some of these genes as we have done before (for the purposes of visualisation only we can use the MNN-corrected values on the t-SNE plot):

# visualise one of the top genes using MNN-corrected values
plotTSNE(corrected, 
         colour_by = "LYZ", 
         text_by = "louvain", 
         by_exprs_values = "reconstructed")

# visualise the expression of the gene on the uncorrected values
plotExpression(uncorrected, 
               features = "LYZ", 
               x = "louvain")

3.1.1 Tweaking differential expression analysis

3.1.1.1 Focusing on upregulated genes

Our previous findMarkers() call considered both up- and down-regulated genes to be potential markers. However, down-regulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. If, for example, we wanted a positive marker for monocytes, the “HBA1” gene we obtained in our previous analysis would not be a great candidate:

Modify the previous call to the findMarkers() function to include only those genes that are up-regulated in each cluster when compared to the others.

# Exercise: Test for up-regulated genes ----

# HBA1 gene was one of the top genes in previous analysis
# but it is NOT expressed in cluster 8
plotExpression(uncorrected, 
               features = "HBA1", 
               x = "louvain")

# modify the previous call to findMarkers to focus on genes that are up-regulated
markers_up <- findMarkers(FIXME)

# extract the results for cluster 8 and check that this gene is no longer on top
c8_markers_up <- FIXME

# can you find out what the rank of HBA1 is now?

Answer

To focus on up-regulated markers, we can instead perform a one-sided t-test to identify genes that are up-regulated in each cluster compared to the others.

# modify the previous call to findMarkers to focus on genes that are up-regulated
markers_up <- findMarkers(
  uncorrected, 
  groups = factor(corrected$louvain), # clusters to compare
  block = uncorrected$SampleGroup,    # covariates in statistical model
  test.type = "t",   # t-test (default)
  direction = "up", # test for up-regulated genes only
  lfc = 0, # null hypothesis log-fold-change = 0 (default)
  pval.type = "any" # ranking of p-values based on any comparison (default)
)

The result of the function is a list, and we can extract the 8th element of that list, which contains the comparisons between cluster 8 and all other clusters:

# extract the results for cluster 8 and check that this gene is no longer on top
c8_markers_up <- markers_up[[8]]
rownames(c8_markers_up)[c8_markers_up$Top <= 3]

##  [1] "LYZ"    "SRGN"   "TYROBP" "CYBA"   "FTL"    "LGALS1" "CST3"   "GAPDH" 
##  [9] "S100A6" "TMSB10" "TSPO"   "S100A4"

We can see that the “HBA1” gene is no longer on the top of the marker list. In fact, it’s rank in the data is now 72nd:

c8_markers_up["HBA1", ]

## DataFrame with 1 row and 14 columns
##            Top    p.value        FDR summary.logFC   logFC.1   logFC.2
##      <integer>  <numeric>  <numeric>     <numeric> <numeric> <numeric>
## HBA1        72 4.2558e-20 4.8287e-18       1.04333  0.764925  0.590209
##        logFC.3   logFC.4   logFC.5   logFC.6   logFC.7   logFC.9  logFC.10
##      <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## HBA1  0.748297   1.04333  0.194616  0.305193   -4.8332   -9.8444 -0.226184
##       logFC.11
##      <numeric>
## HBA1 -0.414213

3.1.1.2 Using log-fold changes

The t-test also allows us to specify a non-zero log-fold change as the null hypothesis. This allows us to consider the magnitude of the log-fold change in our p-value calculations, in a manner that is more rigorous than simply filtering directly on the log-fold changes (McCarthy and Smyth 2009). Specifically, a simple threshold does not consider the variance and can enrich for genes that have both large log-fold changes and large variances.

For example, take these two genes, which originally came on top of our analysis:

While TSPO is more highly expressed in cluster 8 compared to other clusters, the difference between its expression to other clusters is not nearly as impressive as that of LYZ. It also seems to have quite a bit of variation in its expression, even within cluster 8.

Therefore, we may want to test for the null hypothesis that a gene’s LFC is greater than a particular threshold. We perform this by setting lfc= in our findMarkers() call - when combined with direction=, this tests for genes with log-fold changes that are significantly greater than 1:

# testing for the alternative hypothesis that LFC > 1
markers_up_lfc1 <- findMarkers(
  uncorrected, 
  groups = factor(corrected$louvain), # clusters to compare
  block = uncorrected$SampleGroup,    # covariates in statistical model
  test.type = "t",   # t-test (default)
  direction = "up", # test for up-regulated genes only
  lfc = 1, # null hypothesis log-fold-change = 1
  pval.type = "any" # ranking of p-values based on any comparison (default)
)

These two settings yield a more focused set of candidate marker genes that are up-regulated in cluster 8.

# fetching top markers for cluster 8
c8_markers_up_lfc1 <- markers_up_lfc1[[8]]
c8_markers_up_lfc1[c8_markers_up_lfc1$Top <= 3, ]

## DataFrame with 11 rows and 14 columns
##              Top     p.value         FDR summary.logFC   logFC.1   logFC.2
##        <integer>   <numeric>   <numeric>     <numeric> <numeric> <numeric>
## LYZ            1 4.96624e-83 8.79025e-79       4.50487   4.50877   4.49556
## SRGN           1 4.44830e-43 9.84187e-40       2.53638   1.95457   2.53638
## TMSB10         2 1.65475e-66 1.46445e-62       3.45637   1.71638   1.24162
## LGALS1         2 1.86424e-43 4.71386e-40       2.64149   2.10769   2.64149
## CST3           2 1.98630e-38 3.51575e-35       2.40626   1.87677   2.30115
## S100A9         2 7.20947e-28 1.06340e-24       3.51985   3.48385   3.37937
## S100A4         2 2.58728e-47 9.15898e-44       2.99005   2.42096   2.99151
## TYROBP         3 1.94507e-22 2.15173e-19       2.22153   1.94456   2.09246
## FTL            3 4.00049e-46 1.18014e-42       3.08109   1.85099   2.68072
## S100A6         3 3.69288e-40 7.26267e-37       2.76863   2.37368   2.79871
## TMSB4X         3 2.11151e-61 1.24579e-57       4.16022   2.05849   2.04097
##          logFC.3   logFC.4   logFC.5   logFC.6   logFC.7   logFC.9  logFC.10
##        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ      4.03068   4.63857  4.639435  4.605950   4.47345  3.753146 4.5048742
## SRGN     2.27129   2.50368  2.546259  2.293513   2.49336  2.378514 1.1465040
## TMSB10   1.06105   1.24539  0.984864  0.203907   3.45637  2.222570 1.0449568
## LGALS1   2.45700   2.60571  2.579991  2.635851   2.63480  2.320965 2.4797673
## CST3     2.32093   2.38125  2.291110  2.397640   2.29142  2.146728 2.3557640
## S100A9   2.94559   3.51985  3.323654  3.456419   3.40292  2.508729 3.3859858
## S100A4   3.10804   2.99428  2.990050  2.690913   1.75088  2.579884 1.6882065
## TYROBP   2.22153   2.09369  2.082816  2.090002   2.08879  1.980865 1.6696839
## FTL      2.46555   3.08109  2.262029  1.845478   1.84823  0.210337 2.2668814
## S100A6   2.69569   2.76863  2.698411  2.073187   1.22674  2.250100 1.7855722
## TMSB4X   1.51301   1.85123  2.075043  0.180026   4.16022  2.373738 0.0546425
##         logFC.11
##        <numeric>
## LYZ     4.481884
## SRGN    2.446328
## TMSB10  0.560577
## LGALS1  2.638351
## CST3    2.406259
## S100A9  3.348772
## S100A4  2.816049
## TYROBP  2.076899
## FTL     1.704198
## S100A6  2.404407
## TMSB4X  0.676823

Of course, this increased stringency is not without cost. If only up-regulated genes are requested from findMarkers(), any cluster defined by down-regulation of a marker gene will not contain that gene among the top set of features in its DataFrame. This is occasionally relevant for sub-types or other states that are distinguished by high versus low expression of particular genes. Similarly, setting an excessively high log-fold change threshold may discard otherwise useful genes. For example, a gene up-regulated in a small proportion of cells of a cluster will have a small log-fold change but can still be an effective marker if the focus is on specificity rather than sensitivity.

3.1.1.3 Finding cluster-specific markers

By default, findMarkers() will give a high ranking to genes that are differentially expressed in any pairwise comparison. This means that a gene only needs a very low p-value in a single pairwise comparison to achieve a low Top value.

Take the TMSB10 gene, which does seem indeed more highly expressed in cluster 8, but only compared to a couple of other clusters (7 and maybe 9).

While this gene is partially contributing to the distinction between clusters, it is not the most diagnostic gene for cluster 8 (if that is what we were interested in).

A more stringent approach would only consider genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. To achieve this, we set pval.type="all" in findMarkers() to use an intersection-union test (Berger and Hsu 1996) where the combined p-value for each gene is the maximum of the p-values from all pairwise comparisons. A gene will only achieve a low combined p-value if it is strongly DE in all comparisons to other clusters.

# ranking based on the maximum p-value across all pairwise comparisons
markers_up_all <- findMarkers(
  uncorrected, 
  groups = factor(corrected$louvain), # clusters to compare
  block = uncorrected$SampleGroup,    # covariates in statistical model
  test.type = "t",   # t-test (default)
  direction = "up", # test for up-regulated genes only
  lfc = 0, # null hypothesis log-fold-change = 1
  pval.type = "all" # ranking of p-values based on all comparisons
)

In this case, the resulting tables do not include a Top column any more, as the ranking is simply based on the maximum p-value observed across all comparisons. The table is now simply ranked from low-to-high p-value.

# fetching top markers for cluster 8
c8_markers_up_all <- markers_up_all[[8]]
c8_markers_up_all[1:10, ]

## DataFrame with 10 rows and 13 columns
##            p.value         FDR summary.logFC   logFC.1   logFC.2   logFC.3
##          <numeric>   <numeric>     <numeric> <numeric> <numeric> <numeric>
## LYZ    4.58297e-38 8.11186e-34      4.030680  4.508769  4.495563  4.030680
## CFD    8.35781e-31 7.39667e-27      1.145244  1.145244  1.267494  1.491151
## TYROBP 6.64174e-29 3.91863e-25      1.669684  1.944560  2.092456  2.221534
## CSTA   7.13344e-28 3.15655e-24      0.856159  0.868055  0.873356  0.756795
## FCN1   1.86524e-26 6.60296e-23      1.332588  1.473127  1.459802  1.305396
## IFI30  3.70744e-26 1.09370e-22      1.213788  1.299619  1.213788  1.320515
## S100A8 4.24827e-25 1.07421e-21      2.668966  3.449839  3.139597  3.039399
## TSPO   7.06501e-24 1.56313e-20      1.029131  0.973690  1.160254  1.215786
## LGALS1 4.70889e-23 9.26081e-20      2.457002  2.107687  2.641486  2.457002
## S100A9 7.74160e-22 1.37026e-18      2.945589  3.483847  3.379367  2.945589
##          logFC.4   logFC.5   logFC.6   logFC.7   logFC.9  logFC.10  logFC.11
##        <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ     4.638569  4.639435  4.605950  4.473449  3.753146  4.504874   4.48188
## CFD     1.211279  1.197487  1.189063  1.207690  1.229574  1.201536   1.21046
## TYROBP  2.093691  2.082816  2.090002  2.088786  1.980865  1.669684   2.07690
## CSTA    0.892841  0.900836  0.898464  0.876403  0.856159  0.881672   0.89001
## FCN1    1.470012  1.477605  1.486224  1.443313  1.332588  1.431616   1.44512
## IFI30   1.290975  1.219932  1.288449  1.268958  1.270440  1.269647   1.16200
## S100A8  3.553553  3.421135  3.466660  3.353011  2.668966  3.347252   3.27646
## TSPO    1.070974  1.055939  1.071418  0.914459  1.029131  0.986104   1.10817
## LGALS1  2.605708  2.579991  2.635851  2.634804  2.320965  2.479767   2.63835
## S100A9  3.519849  3.323654  3.456419  3.402917  2.508729  3.385986   3.34877

We can confirm that the TMSB10 gene, for example, is now quite far down the list:

which(rownames(c8_markers_up_all) == "TMSB10")

## [1] 172

3.2 Visualization of marker genes

We have already seen how we can use the plotExpression() function to visualise the distribution of expression in our data between clusters.

plotExpression(uncorrected, 
               features = "CST3",
               x = "louvain")

We have also seen how to use plotReducedDim() (and the companion functions plotTSNE() and plotUMAP()) to visualise a gene’s expression on the projected reduced dimensionality space. In this case we used corrected sce object for plotting expression of a gene on tSNE and UMAP plots. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. Use of the corrected values in any quantitative procedure should be treated with caution, and should be backed up by similar results from an analysis on the uncorrected values.

plotTSNE(corrected, colour_by = "CST3", by_exprs_values = "reconstructed")

Another useful type of visualisation is to use heatmaps to show the expression of these genes of interest. We can make two types of heatmap, either based on fold-change differences or on the actual expression of the genes.

# Heatmaps ----

# select some top genes for cluster 8
c8_top10 <- c8_markers_up_lfc1[c8_markers_up_lfc1$Top <= 10, ]

# heatmap of expression values
plotHeatmap(uncorrected, 
            features = rownames(c8_top10),
            order_columns_by = c("louvain", "SampleGroup"))

# heatmap of log-fold-changes
pheatmap(c8_top10[, 5:14], 
         breaks=seq(-5, 5, length.out=101))

4 Alternative Testing Strategies

4.1 Wilcoxon rank-sum test

The Wilcoxon rank-sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test) is another widely-used method for pairwise comparisons between groups of observations. Its strength lies in the fact that it directly assesses separation between the expression distributions of different clusters. The WMW test statistic is proportional to the area-under-the-curve (AUC), a metric which reflects the probability of a randomly-picked cell from one cluster having higher expression than another randomly-picked cell of a different cluster (also known as the concordance probability). In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the t-test only does so indirectly via the difference in the means and the intra-group variance.

We perform WMW tests by again using the findMarkers() function, this time with test="wilcox". This returns a list of DataFrames containing ranked candidate markers for each cluster. The direction=, lfc= and pval.type= arguments can be specified and have the same interpretation as described for t-tests. We demonstrate below by detecting upregulated genes in each cluster with direction="up".

# Alternative testing strategies ----

# Wilcoxon rank-sum test
markers_wilcox_up <- findMarkers(
  uncorrected, 
  groups = uncorrected$louvain, # clusters to compare
  block = uncorrected$SampleGroup,    # covariates in statistical model
  test.type = "wilcox",   # t-test (default)
  direction = "up"
)

To explore the results in more detail, we focus on the DataFrame for cluster 8. The interpretation of Top is the same as described for t-tests, and Simes’ method is again used to combine p-values across pairwise comparisons. If we want more focused sets, we can also change pval.type= as previously described.

c8_markers_wilcox_up <- markers_wilcox_up[[8]]
head(c8_markers_wilcox_up)

## DataFrame with 6 rows and 14 columns
##               Top      p.value         FDR summary.AUC     AUC.1     AUC.2
##         <integer>    <numeric>   <numeric>   <numeric> <numeric> <numeric>
## LYZ             1 2.04715e-102 3.62346e-98    0.992062  0.991700  0.992331
## S100A11         1  4.12903e-62 3.48018e-59    0.831896  0.818706  0.855317
## TMSB4X          1  9.76508e-49 3.67749e-46    0.966239  0.950197  0.943456
## TMSB10          2  3.85789e-39 9.10461e-37    0.958903  0.958903  0.825562
## CST3            2 3.09093e-100 2.73548e-96    0.957929  0.904469  0.947904
## SRGN            2  8.92254e-74 1.57929e-70    0.941599  0.910036  0.950000
##             AUC.3     AUC.4     AUC.5     AUC.6     AUC.7     AUC.9    AUC.10
##         <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## LYZ      0.995435  0.993208  0.992206  0.992062  0.991065  0.965150  0.987452
## S100A11  0.895014  0.840087  0.827618  0.831896  0.843432  0.831546  0.808736
## TMSB4X   0.925913  0.957344  0.966239  0.589537  0.999023  0.958362  0.540314
## TMSB10   0.893258  0.920626  0.820001  0.601207  0.989320  0.936008  0.855671
## CST3     0.971208  0.955383  0.946241  0.957929  0.945728  0.909439  0.953412
## SRGN     0.878862  0.958508  0.965726  0.941599  0.963807  0.930193  0.801394
##            AUC.11
##         <numeric>
## LYZ      0.990860
## S100A11  0.832479
## TMSB4X   0.794249
## TMSB10   0.771344
## CST3     0.958872
## SRGN     0.956559

The DataFrame contains the AUCs from comparing cluster 8 to every other cluster. A value greater than 0.5 indicates that the gene is up-regulated in the current cluster compared to the other cluster, while values less than 0.5 correspond to down-regulation. We would typically expect AUCs of >0.7 for a strongly up-regulated candidate marker (equivalent to a >70% chance that a cell in cluster 8 has higher expression than a cell in the other cluster).

# make a heatmap of AUC values
# we use a custom colour palette that diverges around 0.5
# we optionally do not cluster rows to keep genes in their ranking order
pheatmap(c8_markers_wilcox_up[c8_markers_wilcox_up$Top <= 6, 5:14],
         breaks = seq(0, 1, length.out = 21),
         color = viridis::cividis(21), 
         cluster_rows = FALSE)

4.2 Using a binomial test

The binomial test identifies genes that differ in the proportion of cells expressing a gene between clusters. (For the purposes of this section, a cell is considered to express a gene simply if it has non-zero expression for that gene.) This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle “tuning” effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence might be easier to validate.

We perform pairwise binomial tests between clusters using the findMarkers() function with test="binom". This returns a list of DataFrames containing marker statistics for each cluster such as the Top rank and its p-value. Here, the effect size is reported as the log-fold change in this proportion between each pair of clusters. Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other. We focus on genes that are up-regulated in each cluster compared to the others by setting direction="up".

# Binomial test of proportions
markers_binom_up <- findMarkers(
  uncorrected, 
  groups = uncorrected$louvain, # clusters to compare
  block = uncorrected$SampleGroup,    # covariates in statistical model
  test.type = "binom",   # t-test (default)
  direction = "up"
)

# make a heatmap of expression values for top genes in cluster 8
c8_markers_binom_up <- markers_binom_up[[8]]
plotExpression(uncorrected, 
               x = "louvain",
               features = rownames(c8_markers_binom_up)[1:4])

We can see from the plot above that indeed these top genes show a high proportion of cells being expressed in cluster 8 compared to other clusters, where they are only detected in very few cells.

4.3 Combining multiple marker statistics

On occasion, we might want to combine marker statistics from several testing regimes into a single DataFrame. This allows us to easily inspect multiple statistics at once to verify that a particular gene is a strong candidate marker. For example, a large AUC from the WMW test indicates that the expression distributions are well-separated between clusters, while the log-fold change reported with the t-test provides a more interpretable measure of the magnitude of the change in expression. We use the multiMarkerStats() to merge the results of separate findMarkers() calls into one DataFrame per cluster, with statistics interleaved to facilitate a direct comparison between different test regimes.

# Combining multiple tests ----

markers_combined <- multiMarkerStats(
  t = findMarkers(
    uncorrected,
    groups = uncorrected$louvain,
    direction = "up",
    block = uncorrected$SampleGroup
  ),
  wilcox = findMarkers(
    uncorrected,
    groups = uncorrected$louvain,
    test = "wilcox",
    direction = "up",
    block = uncorrected$SampleGroup
  ),
  binom = findMarkers(
    uncorrected,
    groups = uncorrected$louvain,
    test = "binom",
    direction = "up",
    block = uncorrected$SampleGroup
  )
)

In addition, multiMarkerStats() will compute a number of new statistics by combining the per-regime statistics. The combined Top value is obtained by simply taking the largest Top value across all tests for a given gene, while the reported p.value is obtained by taking the largest p-value. Ranking on either metric focuses on genes with robust differences that are highly ranked and detected by each of the individual testing regimes. Of course, this might be considered an overly conservative approach in practice, so it is entirely permissible to re-rank the DataFrame according to the Top or p.value for an individual regime (effectively limiting the use of the other regimes’ statistics to diagnostics only).

# the first few rows and columns of the combined results table
markers_combined[[8]][1:10 , 1:9]

## DataFrame with 10 rows and 9 columns
##              Top     p.value         FDR     t.Top wilcox.Top binom.Top
##        <integer>   <numeric>   <numeric> <integer>  <integer> <integer>
## CST3           2 5.57951e-64 9.87572e-60         2          2         1
## LGALS1         3 8.14657e-59 7.20971e-55         2          3         3
## LYZ            4 5.84449e-53 3.44825e-49         1          1         4
## TYROBP         4 1.75971e-50 6.22938e-47         2          4         1
## SRGN           5 5.29881e-34 6.69921e-31         1          2         5
## AIF1           5 2.58616e-35 3.81459e-32         5          4         3
## ARPC1B         6 1.25106e-30 9.62775e-28         6          6         3
## FCER1G         6 1.17964e-51 5.21989e-48         6          2         1
## LST1           6 5.55503e-49 1.63873e-45         6          5         1
## CFD           10 1.85873e-34 2.53073e-31        10          9         1
##           t.p.value wilcox.p.value binom.p.value
##           <numeric>      <numeric>     <numeric>
## CST3    1.83702e-70   3.09093e-100   5.57951e-64
## LGALS1  7.37835e-73    2.71831e-97   8.14657e-59
## LYZ    2.39458e-105   2.04715e-102   5.84449e-53
## TYROBP  1.75971e-50    6.99064e-95   5.28540e-68
## SRGN    1.62597e-73    8.92254e-74   5.29881e-34
## AIF1    2.56540e-50    1.01667e-56   2.58616e-35
## ARPC1B  5.58892e-50    2.27130e-38   1.25106e-30
## FCER1G  1.17964e-51    3.92193e-89   1.02973e-59
## LST1    5.55503e-49    3.38337e-79   1.02075e-51
## CFD     1.85873e-34    1.83591e-71   1.31818e-55

It is worth noting that the procedure used here for identifying cluster marker genes, although useful in practice, is in fact statistically flawed. The reason is that the cell clusters are themselves defined based on gene expression, and so asking the question of whether there are differentially expressed genes between those clusters is somewhat circular. This is known as “data dredging”, “fishing” or “data snooping”. To learn more about this issue in the context of single-cell analysis, see the advanced chapter on the OSCA book on \(p\)-value invalidity.

The main thing to remember is that, in practice, this is a valid approach to help us annotate groups of cells based on the expression of genes with known cell-specificity and to find new interesting genes for further experiments and validation (e.g. using microscopy or qPCR). In other words, identifying cluster marker genes should be taken as a way to generate new hypothesis from our data, rather than a valid statistical model to test for differential expression between cell types.

5 Session information

sessionInfo()

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.1               stringr_1.4.0              
##  [3] dplyr_1.0.7                 purrr_0.3.4                
##  [5] readr_2.0.2                 tidyr_1.1.4                
##  [7] tibble_3.1.6                tidyverse_1.3.1            
##  [9] pheatmap_1.0.12             scran_1.20.1               
## [11] scater_1.20.1               ggplot2_3.3.5              
## [13] scuttle_1.2.1               SingleCellExperiment_1.14.1
## [15] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [17] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4        
## [19] IRanges_2.26.0              S4Vectors_0.30.2           
## [21] BiocGenerics_0.38.0         MatrixGenerics_1.4.3       
## [23] matrixStats_0.61.0         
## 
## loaded via a namespace (and not attached):
##  [1] ggbeeswarm_0.6.0          colorspace_2.0-2         
##  [3] ellipsis_0.3.2            rprojroot_2.0.2          
##  [5] bluster_1.2.1             XVector_0.32.0           
##  [7] BiocNeighbors_1.10.0      fs_1.5.0                 
##  [9] rstudioapi_0.13           fansi_0.5.0              
## [11] lubridate_1.8.0           xml2_1.3.2               
## [13] codetools_0.2-18          sparseMatrixStats_1.4.2  
## [15] knitr_1.36                jsonlite_1.7.2           
## [17] broom_0.7.10              cluster_2.1.2            
## [19] dbplyr_2.1.1              compiler_4.1.1           
## [21] httr_1.4.2                dqrng_0.3.0              
## [23] backports_1.3.0           assertthat_0.2.1         
## [25] Matrix_1.3-4              fastmap_1.1.0            
## [27] cli_3.1.0                 limma_3.48.3             
## [29] BiocSingular_1.8.1        htmltools_0.5.2          
## [31] tools_4.1.1               rsvd_1.0.5               
## [33] igraph_1.2.8              gtable_0.3.0             
## [35] glue_1.5.0                GenomeInfoDbData_1.2.6   
## [37] Rcpp_1.0.7                cellranger_1.1.0         
## [39] jquerylib_0.1.4           vctrs_0.3.8              
## [41] DelayedMatrixStats_1.14.3 xfun_0.28                
## [43] beachmat_2.8.1            rvest_1.0.2              
## [45] lifecycle_1.0.1           irlba_2.3.3              
## [47] statmod_1.4.36            edgeR_3.34.1             
## [49] zlibbioc_1.38.0           scales_1.1.1             
## [51] hms_1.1.1                 RColorBrewer_1.1-2       
## [53] yaml_2.2.1                gridExtra_2.3            
## [55] sass_0.4.0                stringi_1.7.5            
## [57] highr_0.9                 ScaledMatrix_1.0.0       
## [59] BiocParallel_1.26.2       rlang_0.4.12             
## [61] pkgconfig_2.0.3           bitops_1.0-7             
## [63] evaluate_0.14             lattice_0.20-44          
## [65] tidyselect_1.1.1          here_1.0.1               
## [67] magrittr_2.0.1            R6_2.5.1                 
## [69] generics_0.1.1            metapod_1.0.0            
## [71] DelayedArray_0.18.0       DBI_1.1.1                
## [73] pillar_1.6.4              haven_2.4.3              
## [75] withr_2.4.2               RCurl_1.98-1.5           
## [77] modelr_0.1.8              crayon_1.4.2             
## [79] utf8_1.2.2                tzdb_0.2.0               
## [81] rmarkdown_2.11            viridis_0.6.2            
## [83] locfit_1.5-9.4            grid_4.1.1               
## [85] readxl_1.3.1              reprex_2.0.1             
## [87] digest_0.6.28             munsell_0.5.0            
## [89] beeswarm_0.4.0            viridisLite_0.4.0        
## [91] vipor_0.4.5               bslib_0.3.1

Introduction to single-cell RNA-seq analysis

Cluster marker genes

Stephane Ballereau, Zeynep Kalender Atak, Hugo Tavares

Nov 2021