1 Cluster marker genes
2 Exercise
3 Extra Material
- 3.1 Invalidity of p-values
  - 3.1.1 From data snooping
  - 3.1.2 Nature of replication
- 3.2 Session information

1 Cluster marker genes

Source: we will follow the OSCA chapter on marker detection (with some of its text copied here with little modification). See also the Hemberg group chapter on differential analysis section.

To interpret our clustering results, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways.

Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have caused separate clustering of cells in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster.

1.1 Learning objectives

identify genes that differentially expressed between clusters,
exclusively or not,
using different methods that test:
- the mean expression level,
- the whole distribution,
- or the proportion of cells expressing the gene
compile a summary table.

1.2 Load data

Always go back to RNA assay (or similar) for doing differential expression.

Altough it’s tempting to use the corrected expression values for gene-based analyses like DE-based marker gene detection, his is not generally recommended. Batch correction algorithms are not obliged to preserve the magnitude (or even direction) of differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches.

For this reason, we will use the uncorrected (normalized) expression values for marker gene identification between clusters obtained after data integration.

library(ggplot2)
library(scater)
library(scran)
library(dplyr)
library(RColorBrewer)
library(pheatmap)
library(glue)

# load both sce objects
uncorrected <- readRDS("~/Course_Materials/scRNAseq/Robjects/DataIntegration_uncorrected.rds")
corrected <- readRDS("~/Course_Materials/scRNAseq/Robjects/caron_postDeconv_5hCellPerSpl_dsi_PBMMC_ETV6-RUNX1_clust.Rds")

plotTSNE(corrected, colour_by="louvain", text_by="louvain")

1.3 Detecting genes differentially expressed between clusters

1.3.1 Differential expression analysis

For each cluster, we will identify genes whose expression differ to that of other clusters, for each pair of cluster, using scran::findMarkers(). The function fits a linear model to the log-expression values for each gene using limma [@doi:10.1093/nar/gkv007] and allows testing for differential expression in each cluster compared to the others while accounting for known, uninteresting factors.

We will first identify genes whose average expression differ between clusters, using the Welch t-test (default) with a log-fold change threshold of 0 (default) and ranking genes based on the outcome of any of the pairwise comparisons (default).

# We add a 'c' prefix (for-cluster) to avoid any confusion: these values are labels, not numerical values

clusters.mnn <- factor(paste0("c",corrected$louvain))

markers.out <- findMarkers(uncorrected, 
                           groups=clusters.mnn, 
                           block=uncorrected$SampleGroup,
                           row.data=rowData(uncorrected))

The default philosophy of findMarkers() is to identify a combination of marker genes that - together - uniquely define one cluster against the rest. To this end, we collect the top DE genes from each pairwise comparison involving a particular cluster to assemble a set of candidate markers for that cluster. We will demonstrate on cluster 8; the relevant DataFrame contains log2-fold changes of expression in cluster 8 over each other cluster, along with several statistics obtained by combining p-values across the pairwise comparisons involving 8.

c8_markers <- markers.out[["c8"]]
colnames(c8_markers)

##  [1] "ID"            "Symbol"        "Type"          "Chromosome"   
##  [5] "mean"          "detected"      "mean"          "detected"     
##  [9] "gene_sparsity" "mean"          "total"         "tech"         
## [13] "bio"           "p.value"       "FDR"           "per.block"    
## [17] "chosenHvg"     "Top"           "p.value"       "FDR"          
## [21] "summary.logFC" "logFC.c1"      "logFC.c10"     "logFC.c11"    
## [25] "logFC.c2"      "logFC.c3"      "logFC.c4"      "logFC.c5"     
## [29] "logFC.c6"      "logFC.c7"      "logFC.c9"

Of particular interest is the Top field. The set of genes with Top ≤ X is the union of the top X genes (ranked by p-value) from each pairwise comparison involving cluster 8. For example, the set of all genes with Top values of 1 contains the gene with the lowest p-value from each comparison. Similarly, the set of genes with Top values less than or equal to 10 contains the top 10 genes from each comparison. The Top field represents findMarkers()’s approach to consolidating multiple pairwise comparisons into a single ranking for each cluster; each DataFrame produced by findMarkers() will order genes based on the Top value by default.

c8_markers[1:10,c(2,7:14)]

## DataFrame with 10 rows and 9 columns
##                      Symbol      mean  detected gene_sparsity    mean.1
##                 <character> <numeric> <numeric>     <numeric> <numeric>
## ENSG00000090382         LYZ  1.767455   9.05455    0.90945455  0.350429
## ENSG00000122862        SRGN  0.854727  24.45455    0.75545455  0.357208
## ENSG00000177954       RPS27 36.327273  99.32727    0.00672727  4.831956
## ENSG00000206172        HBA1 51.327273  45.29091    0.54709091  2.212123
## ENSG00000011600      TYROBP  0.340000   7.58182    0.92418182  0.158158
## ENSG00000051523        CYBA  2.613091  71.20000    0.28800000  1.353937
## ENSG00000100097      LGALS1  0.772364  14.00000    0.86000000  0.249622
## ENSG00000101439        CST3  0.510182  15.09091    0.84909091  0.288454
## ENSG00000111640       GAPDH 12.066545  81.29091    0.18709091  1.962487
## ENSG00000117632       STMN1  6.385636  69.52727    0.30472727  1.724736
##                     total      tech        bio      p.value
##                 <numeric> <numeric>  <numeric>    <numeric>
## ENSG00000090382  1.244370  0.402706 0.84166311 1.63770e-168
## ENSG00000122862  0.664105  0.416949 0.24715619  1.20285e-29
## ENSG00000177954  0.962502  0.956980 0.00552162  5.74505e-01
## ENSG00000206172  6.616669  1.177939 5.43873049  0.00000e+00
## ENSG00000011600  0.379195  0.187756 0.19143864  9.90534e-50
## ENSG00000051523  1.062865  1.008867 0.05399817  2.36426e-01
## ENSG00000100097  0.605734  0.301261 0.30447303  3.05336e-57
## ENSG00000101439  0.534936  0.333955 0.20098151  6.07451e-16
## ENSG00000111640  1.310790  1.080708 0.23008278  4.58288e-07
## ENSG00000117632  1.371312  0.909026 0.46228597  4.85326e-21

We use the Top field to identify a set of genes that is guaranteed to distinguish cluster 8 from any other cluster. Here, we examine the top 6 genes from each pairwise comparison. Some inspection of the most upregulated genes suggest that cluster 8 contains monocytes, based on the expression of CST3 and TYROBP.

c8_markers[c8_markers$Top <= 6,c(2,7:13)]

## DataFrame with 32 rows and 8 columns
##                      Symbol      mean  detected gene_sparsity    mean.1
##                 <character> <numeric> <numeric>     <numeric> <numeric>
## ENSG00000090382         LYZ  1.767455   9.05455    0.90945455  0.350429
## ENSG00000122862        SRGN  0.854727  24.45455    0.75545455  0.357208
## ENSG00000177954       RPS27 36.327273  99.32727    0.00672727  4.831956
## ENSG00000206172        HBA1 51.327273  45.29091    0.54709091  2.212123
## ENSG00000011600      TYROBP  0.340000   7.58182    0.92418182  0.158158
## ...                     ...       ...       ...           ...       ...
## ENSG00000223609         HBD  6.700182   23.1091      0.768909  1.105405
## ENSG00000147454    SLC25A37  1.304000   26.2364      0.737636  0.702754
## ENSG00000158869      FCER1G  0.234545    6.0000      0.940000  0.140600
## ENSG00000197747     S100A10  0.418909   16.2000      0.838000  0.218013
## ENSG00000204472        AIF1  1.486364   49.6000      0.504000  0.810801
##                     total      tech        bio
##                 <numeric> <numeric>  <numeric>
## ENSG00000090382  1.244370  0.402706 0.84166311
## ENSG00000122862  0.664105  0.416949 0.24715619
## ENSG00000177954  0.962502  0.956980 0.00552162
## ENSG00000206172  6.616669  1.177939 5.43873049
## ENSG00000011600  0.379195  0.187756 0.19143864
## ...                   ...       ...        ...
## ENSG00000223609  4.107386  0.696659  3.4107267
## ENSG00000147454  2.497676  0.643589  1.8540871
## ENSG00000158869  0.285596  0.168038  0.1175578
## ENSG00000197747  0.356233  0.262364  0.0938687
## ENSG00000204472  0.644693  0.570455  0.0742385

Each DataFrame also contains several other statistics that may be of interest. The summary.logFC field provides a convenient summary of the direction and effect size for each gene, and is defined here as the log-fold change from the comparison with the lowest p-value. The p.value field contains the combined p-value that is obtained by applying Simes’ method to the pairwise p-values for each gene and represents the evidence against the joint null hypothesis, i.e., that the gene is not DE between cluster 8 and any other cluster. Examination of these statistics permits a quick evaluation of the suitability of a candidate marker; if both of these metrics are poor (small log-fold change, large p-value), the gene can most likely be dismissed.

1.3.2 Tweaking differential expression analysis

1.3.2.1 Focusing on upregulated genes

Our previous findMarkers() call considers both up- and downregulated genes to be potential markers. However, downregulated genes are less appealing as markers as it is more difficult to interpret and experimentally validate an absence of expression. To focus on up-regulated markers, we can instead perform a one-sided t-test to identify genes that are upregulated in each cluster compared to the others. This is achieved by setting direction="up" in the findMarkers() call.

markers_up.out <- findMarkers(uncorrected, 
                           groups=clusters.mnn, 
                           block=uncorrected$SampleGroup,
                           row.data=rowData(uncorrected),
                           direction="up")

c8_up_markers <- markers_up.out[["c8"]]
c8_up_markers[c8_up_markers$Top <= 6,c(2,7:13)]

## DataFrame with 21 rows and 8 columns
##                      Symbol      mean  detected gene_sparsity    mean.1
##                 <character> <numeric> <numeric>     <numeric> <numeric>
## ENSG00000090382         LYZ  1.767455   9.05455     0.9094545  0.350429
## ENSG00000122862        SRGN  0.854727  24.45455     0.7554545  0.357208
## ENSG00000011600      TYROBP  0.340000   7.58182     0.9241818  0.158158
## ENSG00000051523        CYBA  2.613091  71.20000     0.2880000  1.353937
## ENSG00000087086         FTL  6.753273  92.03636     0.0796364  2.572797
## ...                     ...       ...       ...           ...       ...
## ENSG00000204472        AIF1  1.486364   49.6000      0.504000  0.810801
## ENSG00000130429      ARPC1B  1.016545   50.2364      0.497636  0.608580
## ENSG00000158869      FCER1G  0.234545    6.0000      0.940000  0.140600
## ENSG00000197747     S100A10  0.418909   16.2000      0.838000  0.218013
## ENSG00000204482        LST1  1.209091   42.0000      0.580000  0.662297
##                     total      tech        bio
##                 <numeric> <numeric>  <numeric>
## ENSG00000090382  1.244370  0.402706  0.8416631
## ENSG00000122862  0.664105  0.416949  0.2471562
## ENSG00000011600  0.379195  0.187756  0.1914386
## ENSG00000051523  1.062865  1.008867  0.0539982
## ENSG00000087086  1.779432  1.184492  0.5949399
## ...                   ...       ...        ...
## ENSG00000204472  0.644693  0.570455  0.0742385
## ENSG00000130429  0.557629  0.584380 -0.0267515
## ENSG00000158869  0.285596  0.168038  0.1175578
## ENSG00000197747  0.356233  0.262364  0.0938687
## ENSG00000204482  0.607357  0.467348  0.1400087

1.3.2.2 Using log-fold changes

The t-test also allows us to specify a non-zero log-fold change as the null hypothesis. This allows us to consider the magnitude of the log-fold change in our p-value calculations, in a manner that is more rigorous than simply filtering directly on the log-fold changes (McCarthy and Smyth 2009). (Specifically, a simple threshold does not consider the variance and can enrich for genes that have both large log-fold changes and large variances.) We perform this by setting lfc= in our findMarkers() call - when combined with direction=, this tests for genes with log-fold changes that are significantly greater than 1:

markers_up_lfc1.out <- findMarkers(uncorrected, 
                           groups=clusters.mnn, 
                           block=uncorrected$SampleGroup,
                           row.data=rowData(uncorrected),
                           direction="up",
                           lfc=1)

These two settings yield a more focused set of candidate marker genes that are upregulated in cluster 8.

c8_up_lfc1_markers <- markers_up_lfc1.out[["c8"]]
c8_up_lfc1_markers[c8_up_lfc1_markers$Top <= 6,c(2,7:13)]

## DataFrame with 14 rows and 8 columns
##                      Symbol      mean  detected gene_sparsity    mean.1
##                 <character> <numeric> <numeric>     <numeric> <numeric>
## ENSG00000090382         LYZ  1.767455   9.05455     0.9094545  0.350429
## ENSG00000122862        SRGN  0.854727  24.45455     0.7554545  0.357208
## ENSG00000034510      TMSB10 15.103636  97.01818     0.0298182  3.486114
## ENSG00000100097      LGALS1  0.772364  14.00000     0.8600000  0.249622
## ENSG00000101439        CST3  0.510182  15.09091     0.8490909  0.288454
## ...                     ...       ...       ...           ...       ...
## ENSG00000197956      S100A6   1.53164   38.4182     0.6158182  0.634139
## ENSG00000205542      TMSB4X  42.35255   98.3455     0.0165455  4.520266
## ENSG00000051523        CYBA   2.61309   71.2000     0.2880000  1.353937
## ENSG00000143546      S100A8   1.94164   11.2000     0.8880000  0.313754
## ENSG00000111640       GAPDH  12.06655   81.2909     0.1870909  1.962487
##                     total      tech       bio
##                 <numeric> <numeric> <numeric>
## ENSG00000090382  1.244370  0.402706  0.841663
## ENSG00000122862  0.664105  0.416949  0.247156
## ENSG00000034510  1.280803  1.107729  0.173074
## ENSG00000100097  0.605734  0.301261  0.304473
## ENSG00000101439  0.534936  0.333955  0.200982
## ...                   ...       ...       ...
## ENSG00000197956  0.985385  0.667978 0.3174070
## ENSG00000205542  1.467908  0.988859 0.4790488
## ENSG00000051523  1.062865  1.008867 0.0539982
## ENSG00000143546  1.085889  0.345872 0.7400168
## ENSG00000111640  1.310790  1.080708 0.2300828

Of course, this increased stringency is not without cost. If only upregulated genes are requested from findMarkers(), any cluster defined by downregulation of a marker gene will not contain that gene among the top set of features in its DataFrame. This is occasionally relevant for subtypes or other states that are distinguished by high versus low expression of particular genes. Similarly, setting an excessively high log-fold change threshold may discard otherwise useful genes. For example, a gene upregulated in a small proportion of cells of a cluster will have a small log-fold change but can still be an effective marker if the focus is on specificity rather than sensitivity.

1.3.2.3 Finding cluster-specific markers

By default, findMarkers() will give a high ranking to genes that are differentially expressed in any pairwise comparison. This is because a gene only needs a very low p-value in a single pairwise comparison to achieve a low Top value. A more stringent approach would only consider genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. To achieve this, we set pval.type="all" in findMarkers() to use an intersection-union test (Berger and Hsu 1996) where the combined p-value for each gene is the maximum of the p-values from all pairwise comparisons. A gene will only achieve a low combined p-value if it is strongly DE in all comparisons to other clusters.

markers_up_pval_all.out <- findMarkers(uncorrected, 
                           groups=clusters.mnn, 
                           block=uncorrected$SampleGroup,
                           row.data=rowData(uncorrected),
                           direction="up",
                           pval.type="all")

c8_up_pval_all_markers <- markers_up_pval_all.out[["c8"]]
c8_up_pval_all_markers [1:10,c(2,7:13)]

## DataFrame with 10 rows and 8 columns
##                      Symbol      mean  detected gene_sparsity    mean.1
##                 <character> <numeric> <numeric>     <numeric> <numeric>
## ENSG00000090382         LYZ 1.7674545   9.05455      0.909455 0.3504292
## ENSG00000197766         CFD 0.0925455   3.47273      0.965273 0.0732890
## ENSG00000011600      TYROBP 0.3400000   7.58182      0.924182 0.1581577
## ENSG00000121552        CSTA 0.0730909   2.87273      0.971273 0.0535266
## ENSG00000085265        FCN1 0.1690909   3.81818      0.961818 0.0973019
## ENSG00000216490       IFI30 0.3390909  15.45455      0.845455 0.1498867
## ENSG00000143546      S100A8 1.9416364  11.20000      0.888000 0.3137540
## ENSG00000100300        TSPO 0.7583636  38.69091      0.613091 0.4674537
## ENSG00000100097      LGALS1 0.7723636  14.00000      0.860000 0.2496221
## ENSG00000163220      S100A9 1.7034545  11.69091      0.883091 0.3092163
##                     total      tech       bio
##                 <numeric> <numeric> <numeric>
## ENSG00000090382 1.2443695 0.4027064 0.8416631
## ENSG00000197766 0.1297696 0.0881955 0.0415741
## ENSG00000011600 0.3791949 0.1877563 0.1914386
## ENSG00000121552 0.0905776 0.0681162 0.0224614
## ENSG00000085265 0.2327031 0.1195794 0.1131237
## ENSG00000216490 0.2456569 0.1767180 0.0689389
## ENSG00000143546 1.0858886 0.3458718 0.7400168
## ENSG00000100300 0.5160654 0.4899852 0.0260803
## ENSG00000100097 0.6057340 0.3012609 0.3044730
## ENSG00000163220 1.0336349 0.3464199 0.6872150

1.3.3 Visualization of marker genes

1.3.3.1 Violin plot

plotExpression(uncorrected,
               x=I(factor(corrected$louvain)),
               features="ENSG00000101439", # "CST3",
               colour_by="SampleGroup") +
  facet_wrap(~colour_by)

1.3.3.2 Heatmap

Using log-fold changes As for bulk RNA, differences in expression profiles of the top genes can be visualised with a heatmap.

# select some top genes:
#c2_up_lfc1_markers <- markers_up_lfc1.out[["c2"]]

c8_top10 <- c8_up_lfc1_markers[c8_up_lfc1_markers$Top <= 10,]

c8_top10_logFCs <- getMarkerEffects(c8_top10)

c8_top10_logFCs.ens <- rownames(c8_top10_logFCs)
rownames(c8_top10_logFCs) <- rowData(uncorrected)[rownames(c8_top10_logFCs), "Symbol"]

pheatmap(c8_top10_logFCs, breaks=seq(-5, 5, length.out=101))

Using normalized counts

This requires some data wrangling and is not straightforward.

Click to expand!

# have matrix to annotate sample with cluster and sample:
tmpData <- logcounts(uncorrected)[rownames(c8_top10),]

# concat sample and barcode names to make unique name across the whole data set
tmpCellNames <- paste(colData(uncorrected)$SampleName, colData(uncorrected)$Barcode, sep="_")
# use these to namecolumn of matrix the show as heatmap:
colnames(tmpData) <- tmpCellNames # colData(sce)$Barcode         

# columns annotation with cell name:
mat_col <- data.frame(cluster = corrected$louvain,
              sample = uncorrected$SampleName,
              type = uncorrected$SampleGroup
        )
rownames(mat_col) <- colnames(tmpData)
rownames(mat_col) <- tmpCellNames # colData(sce)$Barcode

# Prepare colours for clusters:
colourCount = length(unique(corrected$louvain))
getPalette = colorRampPalette(brewer.pal(10, "Set3"))

mat_colors <- list(group = getPalette(colourCount))
names(mat_colors$group) <- unique(corrected$louvain)

breaksVec = seq(-5, 5, by = 0.1)

# order normalized expression matrix based on clustering 
tmpData<-tmpData[,rownames(mat_col[order(mat_col$cluster),])]

# plot heatmap:
pheatmap(tmpData,
           border_color      = NA,
           show_colnames     = FALSE,
           #show_rownames     = FALSE,
           show_rownames     = TRUE,
           drop_levels       = TRUE,
           labels_row        = rowData(uncorrected)[rownames(tmpData),"Symbol"],
           annotation_col    = mat_col,
           cluster_cols = FALSE, 
           annotation_colors = mat_colors,
           color             = colorRampPalette(
             rev(brewer.pal(n = 7,
                            name = "RdYlBu")))(length(breaksVec)),
           breaks            = breaksVec,
       fontsize_row      = 7
           )

1.3.3.3 tSNE and UMAP

We can use the corrected sce object for plotting expression of a gene on tSNE and UMAP plots. This can be more aesthetically pleasing than uncorrected expression values that may contain large shifts on the colour scale between cells in different batches. Use of the corrected values in any quantitative procedure should be treated with caution, and should be backed up by similar results from an analysis on the uncorrected values.

plotTSNE(corrected, colour_by="CST3", by_exprs_values = "reconstructed")

corrected<-runUMAP(corrected, dimred="corrected")
plotUMAP(corrected, colour_by="CST3", by_exprs_values = "reconstructed")

1.4 Alternative Testing Strategies

1.4.1 Using the Wilcoxon rank sum test

The Wilcoxon rank sum test (also known as the Wilcoxon-Mann-Whitney test, or WMW test) is another widely used method for pairwise comparisons between groups of observations. Its strength lies in the fact that it directly assesses separation between the expression distributions of different clusters. The WMW test statistic is proportional to the area-under-the-curve (AUC), i.e., the concordance probability, which is the probability of a random cell from one cluster having higher expression than a random cell from another cluster. In a pairwise comparison, AUCs of 1 or 0 indicate that the two clusters have perfectly separated expression distributions. Thus, the WMW test directly addresses the most desirable property of a candidate marker gene, while the t-test only does so indirectly via the difference in the means and the intra-group variance.

We perform WMW tests by again using the findMarkers() function, this time with test="wilcox". This returns a list of DataFrames containing ranked candidate markers for each cluster. The direction=, lfc= and pval.type= arguments can be specified and have the same interpretation as described for t-tests. We demonstrate below by detecting upregulated genes in each cluster with direction="up".

markers_wilcox.out <- findMarkers(uncorrected, 
                           groups=clusters.mnn, 
                           block=uncorrected$SampleGroup,
                           test="wilcox",
                           direction="up",
                           row.data=rowData(uncorrected))

To explore the results in more detail, we focus on the DataFrame for cluster 2. The interpretation of Top is the same as described for t-tests, and Simes’ method is again used to combine p-values across pairwise comparisons. If we want more focused sets, we can also change pval.type= as previously described.

c8_wilcox_markers <- markers_wilcox.out[["c8"]]
head(c8_wilcox_markers)

## DataFrame with 6 rows and 31 columns
##                              ID      Symbol            Type  Chromosome
##                     <character> <character>     <character> <character>
## ENSG00000090382 ENSG00000090382         LYZ Gene Expression          12
## ENSG00000163191 ENSG00000163191     S100A11 Gene Expression           1
## ENSG00000205542 ENSG00000205542      TMSB4X Gene Expression           X
## ENSG00000034510 ENSG00000034510      TMSB10 Gene Expression           2
## ENSG00000101439 ENSG00000101439        CST3 Gene Expression          20
## ENSG00000122862 ENSG00000122862        SRGN Gene Expression          10
##                      mean  detected      mean  detected gene_sparsity      mean
##                 <numeric> <numeric> <numeric> <numeric>     <numeric> <numeric>
## ENSG00000090382  1.767455   9.05455  1.767455   9.05455     0.9094545  0.350429
## ENSG00000163191  0.176182   7.09091  0.176182   7.09091     0.9290909  0.105951
## ENSG00000205542 42.352545  98.34545 42.352545  98.34545     0.0165455  4.520266
## ENSG00000034510 15.103636  97.01818 15.103636  97.01818     0.0298182  3.486114
## ENSG00000101439  0.510182  15.09091  0.510182  15.09091     0.8490909  0.288454
## ENSG00000122862  0.854727  24.45455  0.854727  24.45455     0.7554545  0.357208
##                     total      tech       bio      p.value          FDR
##                 <numeric> <numeric> <numeric>    <numeric>    <numeric>
## ENSG00000090382  1.244370  0.402706 0.8416631 1.63770e-168 8.67781e-166
## ENSG00000163191  0.184987  0.131218 0.0537686  4.18293e-87  8.91984e-85
## ENSG00000205542  1.467908  0.988859 0.4790488  4.73419e-18  1.34169e-16
## ENSG00000034510  1.280803  1.107729 0.1730743  4.67116e-02  1.61077e-01
## ENSG00000101439  0.534936  0.333955 0.2009815  6.07451e-16  1.51309e-14
## ENSG00000122862  0.664105  0.416949 0.2471562  1.20285e-29  5.94155e-28
##                                                                                                                      per.block
##                                                                                                                    <DataFrame>
## ENSG00000090382 0.00000000:0.0000000:0.00000000:...:0.0097064:0.0127698:0.0100244:...:0.00479604:0.00604511:0.00529751:...:...
## ENSG00000163191 0.00739892:0.0273720:0.00731773:...:0.0122463:0.0189306:0.0126474:...:0.07461745:0.09929373:0.08229103:...:...
## ENSG00000205542 5.32559825:0.3912276:0.40630953:...:4.9987165:0.3858118:0.4883540:...:4.44435249:1.44087374:1.01068640:...:...
## ENSG00000034510 3.98808164:0.5197594:0.44046639:...:3.6152638:0.5880477:0.6416103:...:3.30570708:1.15424752:1.00834955:...:...
## ENSG00000101439 0.20711905:0.1952032:0.19919373:...:0.2335597:0.2495610:0.2317871:...:0.07322534:0.07317892:0.08076006:...:...
## ENSG00000122862 0.02570220:0.0428098:0.02541181:...:0.0505055:0.0942981:0.0521411:...:0.22162743:0.29398924:0.23750108:...:...
##                 chosenHvg       Top      p.value         FDR summary.AUC
##                 <logical> <integer>    <numeric>   <numeric>   <numeric>
## ENSG00000090382      TRUE         1 2.04715e-102 3.62346e-98    0.992062
## ENSG00000163191      TRUE         1  4.12903e-62 3.48018e-59    0.831896
## ENSG00000205542      TRUE         1  9.76508e-49 3.67749e-46    0.966239
## ENSG00000034510      TRUE         2  3.85789e-39 9.10461e-37    0.958903
## ENSG00000101439      TRUE         2 3.09093e-100 2.73548e-96    0.957929
## ENSG00000122862      TRUE         2  8.92254e-74 1.57929e-70    0.941599
##                    AUC.c1   AUC.c10   AUC.c11    AUC.c2    AUC.c3    AUC.c4
##                 <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000090382  0.991700  0.987452  0.990860  0.992331  0.995435  0.993208
## ENSG00000163191  0.818706  0.808736  0.832479  0.855317  0.895014  0.840087
## ENSG00000205542  0.950197  0.540314  0.794249  0.943456  0.925913  0.957344
## ENSG00000034510  0.958903  0.855671  0.771344  0.825562  0.893258  0.920626
## ENSG00000101439  0.904469  0.953412  0.958872  0.947904  0.971208  0.955383
## ENSG00000122862  0.910036  0.801394  0.956559  0.950000  0.878862  0.958508
##                    AUC.c5    AUC.c6    AUC.c7    AUC.c9
##                 <numeric> <numeric> <numeric> <numeric>
## ENSG00000090382  0.992206  0.992062  0.991065  0.965150
## ENSG00000163191  0.827618  0.831896  0.843432  0.831546
## ENSG00000205542  0.966239  0.589537  0.999023  0.958362
## ENSG00000034510  0.820001  0.601207  0.989320  0.936008
## ENSG00000101439  0.946241  0.957929  0.945728  0.909439
## ENSG00000122862  0.965726  0.941599  0.963807  0.930193

The DataFrame contains the AUCs from comparing cluster 2 to every other cluster. A value greater than 0.5 indicates that the gene is upregulated in the current cluster compared to the other cluster, while values less than 0.5 correspond to downregulation. We would typically expect AUCs of 0.7-0.8 for a strongly upregulated candidate marker.

c8_AUCs <- getMarkerEffects(c8_wilcox_markers[c8_wilcox_markers$Top<=5,], prefix="AUC")

rownames(c8_AUCs) <- rowData(uncorrected)[rownames(c8_AUCs), "Symbol"]

pheatmap(c8_AUCs,
     breaks = seq(0, 1, length.out=21),
     color = viridis::viridis(21))

1.5 Using a binomial test

The binomial test identifies genes that differ in the proportion of expressing cells between clusters. (For the purposes of this section, a cell is considered to express a gene simply if it has non-zero expression for that gene.) This represents a much more stringent definition of marker genes compared to the other methods, as differences in expression between clusters are effectively ignored if both distributions of expression values are not near zero. The premise is that genes are more likely to contribute to important biological decisions if they were active in one cluster and silent in another, compared to more subtle “tuning” effects from changing the expression of an active gene. From a practical perspective, a binary measure of presence/absence is easier to validate.

We perform pairwise binomial tests between clusters using the findMarkers() function with test="binom". This returns a list of DataFrames containing marker statistics for each cluster such as the Top rank and its p-value. Here, the effect size is reported as the log-fold change in this proportion between each pair of clusters. Large positive log-fold changes indicate that the gene is more frequently expressed in one cluster compared to the other. We focus on genes that are upregulated in each cluster compared to the others by setting direction="up".

markers_binom.out <- findMarkers(uncorrected, 
                           groups=clusters.mnn, 
                           block=uncorrected$SampleGroup,
                           test="binom",
                           direction="up",
                           row.data=rowData(uncorrected))

c8_binom_markers <- markers_binom.out[["c8"]]

The plot below confirms that the top genes exhibit strong differences in the proportion of expressing cells in cluster 8 compared to the others.

top.genes <- head(rownames(c8_binom_markers))
#plotExpression(sce, x="clusterStg", features=top.genes)

plotExpression(uncorrected, x=I(factor(corrected$louvain)),
               colour_by=I(factor(corrected$louvain)),
               features=top.genes[1:4] )

1.6 Combining multiple marker statistics

On occasion, we might want to combine marker statistics from several testing regimes into a single DataFrame. This allows us to easily inspect multiple statistics at once to verify that a particular gene is a strong candidate marker. For example, a large AUC from the WMW test indicates that the expression distributions are well-separated between clusters, while the log-fold change reported with the t-test provides a more interpretable measure of the magnitude of the change in expression. We use the multiMarkerStats() to merge the results of separate findMarkers() calls into one DataFrame per cluster, with statistics interleaved to facilitate a direct comparison between different test regimes.

combined <- multiMarkerStats(
    t=findMarkers(uncorrected, groups=clusters.mnn, direction="up", block=uncorrected$SampleGroup),
    wilcox=findMarkers(uncorrected, groups=clusters.mnn, test="wilcox", direction="up", block=uncorrected$SampleGroup),
    binom=findMarkers(uncorrected, groups=clusters.mnn, test="binom", direction="up", block=uncorrected$SampleGroup)
)

In addition, multiMarkerStats() will compute a number of new statistics by combining the per-regime statistics. The combined Top value is obtained by simply taking the largest Top value across all tests for a given gene, while the reported p.value is obtained by taking the largest p-value. Ranking on either metric focuses on genes with robust differences that are highly ranked and detected by each of the individual testing regimes. Of course, this might be considered an overly conservative approach in practice, so it is entirely permissible to re-rank the DataFrame according to the Top or p.value for an individual regime (effectively limiting the use of the other regimes’ statistics to diagnostics only).

head(combined[["c8"]])[,1:9]

## DataFrame with 6 rows and 9 columns
##                       Top     p.value         FDR     t.Top wilcox.Top
##                 <integer>   <numeric>   <numeric> <integer>  <integer>
## ENSG00000101439         2 5.57951e-64 9.87572e-60         2          2
## ENSG00000100097         3 8.14657e-59 7.20971e-55         2          3
## ENSG00000090382         4 5.84449e-53 3.44825e-49         1          1
## ENSG00000011600         4 1.75971e-50 6.22938e-47         2          4
## ENSG00000122862         5 5.29881e-34 6.69921e-31         1          2
## ENSG00000204472         5 2.58616e-35 3.81459e-32         5          4
##                 binom.Top    t.p.value wilcox.p.value binom.p.value
##                 <integer>    <numeric>      <numeric>     <numeric>
## ENSG00000101439         1  1.83702e-70   3.09093e-100   5.57951e-64
## ENSG00000100097         3  7.37835e-73    2.71831e-97   8.14657e-59
## ENSG00000090382         4 2.39458e-105   2.04715e-102   5.84449e-53
## ENSG00000011600         1  1.75971e-50    6.99064e-95   5.28540e-68
## ENSG00000122862         5  1.62597e-73    8.92254e-74   5.29881e-34
## ENSG00000204472         3  2.56540e-50    1.01667e-56   2.58616e-35

2 Exercise

Find marker genes for cluster 11. Find the challenge markdown for this section in the course materials folder.

Visualize one of the top marker genes in a violin and tSNE plot.
Take top 5 genes from each pairwise comparison and create a heatmap (hint: use the Top field)

3 Extra Material

Click to reveal

3.1 Invalidity of p-values

3.1.1 From data snooping

All of our DE strategies for detecting marker genes between clusters are statistically flawed to some extent. The DE analysis is performed on the same data used to obtain the clusters, which represents “data dredging” (also known as fishing or data snooping). The hypothesis of interest - are there differences between clusters? - is formulated from the data, so we are more likely to get a positive result when we re-use the data set to test that hypothesis.

The practical effect of data dredging is best illustrated with a simple simulation. We simulate i.i.d. normal values, perform k-means clustering and test for DE between clusters of cells with findMarkers(). The resulting distribution of p-values is heavily skewed towards low values. Thus, we can detect “significant” differences between clusters even in the absence of any real substructure in the data. This effect arises from the fact that clustering, by definition, yields groups of cells that are separated in expression space. Testing for DE genes between clusters will inevitably yield some significant results as that is how the clusters were defined.

Distribution of \(p\)-values from a DE analysis between two clusters in a simulation with no true subpopulation structure:

library(scran)
set.seed(0)
y <- matrix(rnorm(100000), ncol=200)
clusters <- kmeans(t(y), centers=2)$cluster
out <- findMarkers(y, clusters)
hist(out[[1]]$p.value, col="grey80", xlab="p-value")

For marker gene detection, this effect is largely harmless as the p-values are used only for ranking. However, it becomes an issue when the p-values are used to define “significant differences” between clusters with respect to an error rate threshold. Meaningful interpretation of error rates require consideration of the long-run behavior, i.e., the rate of incorrect rejections if the experiment were repeated many times. The concept of statistical significance for differences between clusters is not applicable if clusters and their interpretations are not stably reproducible across (hypothetical) replicate experiments.

3.1.2 Nature of replication

The naive application of DE analysis methods will treat counts from the same cluster of cells as replicate observations. This is not the most relevant level of replication when cells are derived from the same biological sample (i.e., cell culture, animal or patient). DE analyses that treat cells as replicates fail to properly model the sample-to-sample variability (Lun and Marioni 2017). The latter is arguably the more important level of replication as different samples will necessarily be generated if the experiment is to be replicated. Indeed, the use of cells as replicates only masks the fact that the sample size is actually one in an experiment involving a single biological sample. This reinforces the inappropriateness of using the marker gene p-values to perform statistical inference.

“We strongly recommend selecting some markers for use in validation studies with an independent replicate population of cells. A typical strategy is to identify a corresponding subset of cells that express the upregulated markers and do not express the downregulated markers. Ideally, a different technique for quantifying expression would also be used during validation, e.g., fluorescent in situ hybridisation or quantitative PCR. This confirms that the subpopulation genuinely exists and is not an artifact of the scRNA-seq protocol or the computational analysis.”

See the OSCA chapter on Marker gene detection

3.2 Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
##  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
##  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
## [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] glue_1.4.2                  pheatmap_1.0.12            
##  [3] RColorBrewer_1.1-2          dplyr_1.0.7                
##  [5] scran_1.20.1                scater_1.20.1              
##  [7] scuttle_1.2.0               SingleCellExperiment_1.14.1
##  [9] SummarizedExperiment_1.22.0 Biobase_2.52.0             
## [11] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1        
## [13] IRanges_2.26.0              S4Vectors_0.30.0           
## [15] BiocGenerics_0.38.0         MatrixGenerics_1.4.0       
## [17] matrixStats_0.59.0          ggplot2_3.3.5              
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              tools_4.1.0              
##  [3] bslib_0.2.5.1             utf8_1.2.1               
##  [5] R6_2.5.0                  irlba_2.3.3              
##  [7] vipor_0.4.5               uwot_0.1.10              
##  [9] DBI_1.1.1                 colorspace_2.0-2         
## [11] withr_2.4.2               tidyselect_1.1.1         
## [13] gridExtra_2.3             compiler_4.1.0           
## [15] BiocNeighbors_1.10.0      DelayedArray_0.18.0      
## [17] labeling_0.4.2            sass_0.4.0               
## [19] scales_1.1.1              stringr_1.4.0            
## [21] digest_0.6.27             rmarkdown_2.9            
## [23] XVector_0.32.0            pkgconfig_2.0.3          
## [25] htmltools_0.5.1.1         sparseMatrixStats_1.4.0  
## [27] highr_0.9                 limma_3.48.1             
## [29] rlang_0.4.11              FNN_1.1.3                
## [31] DelayedMatrixStats_1.14.0 jquerylib_0.1.4          
## [33] generics_0.1.0            farver_2.1.0             
## [35] jsonlite_1.7.2            BiocParallel_1.26.1      
## [37] RCurl_1.98-1.3            magrittr_2.0.1           
## [39] BiocSingular_1.8.1        GenomeInfoDbData_1.2.6   
## [41] Matrix_1.3-4              Rcpp_1.0.7               
## [43] ggbeeswarm_0.6.0          munsell_0.5.0            
## [45] fansi_0.5.0               viridis_0.6.1            
## [47] lifecycle_1.0.0           stringi_1.7.3            
## [49] yaml_2.2.1                edgeR_3.34.0             
## [51] zlibbioc_1.38.0           grid_4.1.0               
## [53] dqrng_0.3.0               crayon_1.4.1             
## [55] lattice_0.20-44           cowplot_1.1.1            
## [57] beachmat_2.8.0            locfit_1.5-9.4           
## [59] metapod_1.0.0             knitr_1.33               
## [61] pillar_1.6.1              igraph_1.2.6             
## [63] ScaledMatrix_1.0.0        evaluate_0.14            
## [65] vctrs_0.3.8               gtable_0.3.0             
## [67] purrr_0.3.4               assertthat_0.2.1         
## [69] xfun_0.24                 rsvd_1.0.5               
## [71] RSpectra_0.16-0           viridisLite_0.4.0        
## [73] tibble_3.1.2              beeswarm_0.4.0           
## [75] cluster_2.1.2             bluster_1.2.1            
## [77] statmod_1.4.36            ellipsis_0.3.2

CRUK CI Summer School 2021 - Introduction to single-cell RNA-seq analysis

Cluster marker genes

Stephane Ballereau, Zeynep Kalender Atak