Load data
Assess cluster behaviour metrics
Visualise the clusters using tSNE plots
Cluster-wise silhouette widths
- Leiden k = 60
B Cell marker genes
A look at the three louvain clusterings with Cluster Tree
Comparing two sets of clusters
Cluster marker gene detection
- Session information

Load data

Cluster sweep has already been run on the QC’d, filtered, normalised and batch corrected Caron data. Please see the script ClusterSweep.R for details.

Here the clusterSweep results object, cluster behaviour metrics and the SCE object containing corrected data and the clusters are loaded.

out <- readRDS("../Robjects/clusterSweep.out.rds")
df <- read_tsv("../Robjects/clusterSweep.metrics_df.tsv")

## Rows: 36 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): cluster.fun
## dbl (4): k, num.clusters, silhouette, wcss
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sce <- readRDS("../Robjects/clusterSweep.sce.rds")

Cluster sweep was run with the following combinations of parameters:

out$parameters %>%
  as.data.frame() %>% 
  select(`Cluster function`=cluster.fun, k) %>% 
  rownames_to_column("colData column") %>% 
  datatable(rownames=FALSE)

Assess cluster behaviour metrics

We will consider the number of clusters, the mean silhouette width and the sum of the within-cluster sum of squares as an initial assessment of cluster behaviour. To assess the behaviour of the clusterings we can plot these metrics against the k.

nclPlot <- ggplot(df, aes(x = k, y = num.clusters)) + 
                  geom_line(aes(colour=cluster.fun), lwd=2)

silPlot <- ggplot(df, aes(x = k, y = silhouette)) + 
                  geom_line(aes(colour=cluster.fun), lwd=2)

wcssPlot <- ggplot(df, aes(x = k, y = wcss)) + 
                  geom_line(aes(colour=cluster.fun), lwd=2)

nclPlot + silPlot + wcssPlot + plot_layout(guides = "collect")

df %>%
  filter(num.clusters>=10 & num.clusters<=15)

## # A tibble: 6 × 5
##       k cluster.fun num.clusters silhouette  wcss
##   <dbl> <chr>              <dbl>      <dbl> <dbl>
## 1    60 leiden                12      0.234 1207.
## 2    10 louvain               15      0.172 1065.
## 3    15 louvain               14      0.173 1078.
## 4    20 louvain               11      0.178 1180.
## 5    25 louvain               11      0.174 1181.
## 6    30 louvain               10      0.185 1275.

Visualise the clusters using tSNE plots

We can now visualise our selected clusterings of interest on tSNE plots.

plotTSNE(sce, colour_by = "k.60_cluster.fun.leiden")

plotTSNE(sce, colour_by = "k.15_cluster.fun.louvain")

plotTSNE(sce, colour_by = "k.20_cluster.fun.louvain")

plotTSNE(sce, colour_by = "k.25_cluster.fun.louvain")

table(sce$k.60_cluster.fun.leiden)

## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 12570  3789  5600   407  2581  1696   988  1230     7   168   333   165

table(sce$k.15_cluster.fun.louvain)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 2028 2696 3605 2260 4079 4157  723 2570 1635 1754 1428 1265 1221  113

table(sce$k.20_cluster.fun.louvain)

## 
##    1    2    3    4    5    6    7    8    9   10   11 
## 2025 4896 4374 4205 5523 2576 1630 1387 1501 1225  192

table(sce$k.25_cluster.fun.louvain)

## 
##    1    2    3    4    5    6    7    8    9   10   11 
## 2015 4946 3793 6093 4221 1786 1613 1370 1297  111 2289

Cluster-wise silhouette widths

plotSilhouette <- function(clusters, plotTitle){
  approxSilhouette(reducedDim(sce, "corrected"), 
                                clusters=clusters) %>% 
  as.data.frame() %>% 
  mutate(closestCluster = ifelse(width > 0, cluster, other) %>% factor()) %>%
  ggplot(aes(x=cluster, y=width, colour=closestCluster)) +
    ggbeeswarm::geom_quasirandom(method="smiley") +
    ggtitle(plotTitle)
}

Leiden k = 60

plotSilhouette(sce$k.60_cluster.fun.leiden, "Leiden k=60")

plotSilhouette(sce$k.15_cluster.fun.louvain, "Louvain k=15")

plotSilhouette(sce$k.20_cluster.fun.louvain, "Louvain k=20")

plotSilhouette(sce$k.25_cluster.fun.louvain, "Louvain k=25")

B Cell marker genes

rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)

p1 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "MS4A1",
               text_by = "k.60_cluster.fun.leiden")
p2 <- plotTSNE(sce, by_exprs_values = "reconstructed"
               , colour_by = "CD79A",
               text_by = "k.60_cluster.fun.leiden")
p3 <- plotTSNE(sce, by_exprs_values = "reconstructed"
               , colour_by = "TNFRSF13C",
               text_by = "k.60_cluster.fun.leiden")
p4 <- plotTSNE(sce, by_exprs_values = "reconstructed"
               , colour_by = "BCL6",
               text_by = "k.60_cluster.fun.leiden")
(p1 + p2) / (p3 + p4)

A look at the three louvain clusterings with Cluster Tree

We can look at how cells move between clusters using clustree

combined <- cbind(k.15=sce$k.15_cluster.fun.louvain,
                  k.20=sce$k.20_cluster.fun.louvain,
                  k.25=sce$k.25_cluster.fun.louvain)

library(clustree)

## Loading required package: ggraph

set.seed(1111)
clustree(combined, prefix="k.", edge_arrow=FALSE)

## Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## Please use the `.add` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Comparing two sets of clusters

jacc.mat <- linkClustersMatrix(sce$k.20_cluster.fun.louvain, sce$k.60_cluster.fun.leiden)
rownames(jacc.mat) <- paste("Louvain", rownames(jacc.mat))
colnames(jacc.mat) <- paste("Leiden", colnames(jacc.mat))
pheatmap(jacc.mat, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE)

plotTSNE(sce, 
         colour_by = "k.50_cluster.fun.leiden",
         text_by = "k.50_cluster.fun.leiden")

plotSilhouette(sce$k.50_cluster.fun.leiden, "Leiden k=50")

out.leiden <- clusterSweep(reducedDim(sce, "corrected"),
                    BLUSPARAM = NNGraphParam(),
                    k = as.integer(51:59),
                    cluster.fun = "leiden",
                    BPPARAM=BiocParallel::MulticoreParam(5))
saveRDS(out.leiden, "../Robjects/clusterSweep.out.leiden.rds")

out.leiden <- readRDS("../Robjects/clusterSweep.out.leiden.rds")
colData(sce) <- cbind(colData(sce), DataFrame(out.leiden$clusters))

plotTSNE(sce, colour_by = "k.52_cluster.fun.leiden")

plotTSNE(sce, colour_by = "k.54_cluster.fun.leiden")

plotTSNE(sce, colour_by = "k.56_cluster.fun.leiden")

plotTSNE(sce, colour_by = "k.57_cluster.fun.leiden")

plotTSNE(sce, colour_by = "k.58_cluster.fun.leiden")

I am going to move forward with Leiden & k=57 for now.

colLabels(sce) <- sce$k.57_cluster.fun.leiden

Cluster marker gene detection

First we need to load the uncorrected data set. We’ll start from the object Caron_filtered_genes.rds. This contains all of the samples from the complete data set, so we’ll to need to subset that to the just the 7 Caron samples we’re using here. Also, there are additional genes - these were filtered out after subsetting to the 7 samples based on expression - so we’ll need to subset the genes as well.

uncorrected <- readRDS("../Robjects/Caron_filtered_genes.rds")

# subset samples and genes to match current data set
samples <- colData(uncorrected)$Sample %in% colData(sce)$SampleName
genes <- rowData(sce)$ID

uncorrected <- uncorrected[genes, samples]

# Normalisation
clust <- quickCluster(uncorrected)
uncorrected <- computePooledFactors(uncorrected, cluster=clust, min.mean=0.1)
uncorrected <- logNormCounts(uncorrected)

colnames(colData(uncorrected))[1] <- "SampleName"

saveRDS(uncorrected, "../Robjects/LogNormalised.AllCells.Rds")

uncorrected <- readRDS("../Robjects/LogNormalised.AllCells.Rds")

We’ll switch the row names to be the gene symbols.

rownames(uncorrected) <- uniquifyFeatureNames(rowData(uncorrected)$ID,
                                              rowData(uncorrected)$Symbol)

Just check everything matches.

all(colnames(sce)==colnames(uncorrected))

## [1] TRUE

all(rownames(sce)==rownames(uncorrected))

## [1] TRUE

Add the clusters and then run scoreMarkers.

colLabels(uncorrected) <- colLabels(sce)
markers <- scoreMarkers(uncorrected, 
                        groups = factor(uncorrected$label), 
                        block =uncorrected$SampleName)

I asked you to identify this cluster:

Which is made up of clusters 3 and 7.

plotTSNE(sce, colour_by = "label", text_by = "label")

Let’s have a look at the top marker genes for cluster 3:

topRanked <- markers[["3"]] %>%
  as.data.frame() %>% 
  select(contains("cohen")) %>% 
  filter(rank.logFC.cohen <= 5) %>% 
  arrange(desc(mean.logFC.cohen))
topRanked

##        mean.logFC.cohen min.logFC.cohen median.logFC.cohen max.logFC.cohen
## RPS27          2.572383      0.73811891          2.4678557        4.425421
## RPS14          2.504531      0.91800401          1.6680303       10.263202
## RPL34          2.226986      1.20387675          2.0874785        3.628712
## RPS28          2.223351      0.81957934          1.5530260       10.649329
## RPS29          2.184457      1.23844928          1.8699763        3.488855
## RPL13          2.178147      1.12524233          2.0910725        3.304402
## RPL32          2.108779      0.99367983          1.9071179        3.633007
## RPL21          2.095168      0.64562257          1.9648405        3.668052
## TMSB4X         2.017420     -0.33431776          1.9750941        5.546751
## RPL30          1.992500      0.72713360          1.4985417        7.335081
## RPL10          1.967514      0.87392734          1.9332442        3.068205
## RPL17          1.930607      0.29472097          1.4880187        7.734110
## RPS18          1.897726      0.12548850          1.9047422        3.231535
## RPS6           1.853538      0.34703458          1.9948865        2.868671
## RPS23          1.742300     -0.29760012          1.3556752        7.649132
## RPL23A         1.718174      0.62701569          1.9031250        2.672944
## RPL3           1.673382      0.28363940          1.9353661        2.588791
## CD3D           1.576817      0.08674796          1.8876481        2.029542
## TMSB10         1.546154     -0.21179328          1.3563138        4.806533
## TRAC           1.543701      0.14978159          1.6722240        2.191058
## IFITM1         1.513202      0.01348136          1.5324712        2.363521
## CD3E           1.460382      0.20038530          1.5835656        1.992863
## IL32           1.411235     -0.61452751          1.6330326        1.998481
## CD52           1.283165     -0.65492626          1.3122820        2.424752
## LDHB           1.190076     -0.18989791          1.2109401        2.080541
## MALAT1         1.182334     -2.69269597          1.2913541        2.578865
## B2M            1.170893     -0.95805287          0.9815617        3.075659
##        rank.logFC.cohen
## RPS27                 1
## RPS14                 2
## RPL34                 1
## RPS28                 1
## RPS29                 2
## RPL13                 1
## RPL32                 2
## RPL21                 1
## TMSB4X                1
## RPL30                 5
## RPL10                 4
## RPL17                 3
## RPS18                 3
## RPS6                  5
## RPS23                 4
## RPL23A                5
## RPL3                  3
## CD3D                  1
## TMSB10                2
## TRAC                  4
## IFITM1                2
## CD3E                  3
## IL32                  2
## CD52                  4
## LDHB                  5
## MALAT1                4
## B2M                   3

There’s lots of ribosomal genes at the top of this list. Let’s check the expression of a few:

plotExpression(uncorrected, 
               features = "RPS27", 
               x = "label")

plotExpression(uncorrected, 
               features = "RPL34", 
               x = "label")

plotExpression(uncorrected, 
               features = "RPL21", 
               x = "label")

These aren’t really going to be useful as marker genes.

On the other hand, further down we do have CD3D, CD3E and TRAC, which are T-cell markers, and IL32, which is an indicator of T-cell activation.

Let’s take a look at these:

p1 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "CD3D",
               text_by = "label")
p2 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "CD3E",
               text_by = "label")
p3 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "TRAC",
               text_by = "label")
p4 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "IL32",
               text_by = "label")
(p1 + p2) / (p3 + p4)

p1 <- plotExpression(uncorrected, 
               features = "CD3D", 
               x = "label")
p2 <- plotExpression(uncorrected, 
               features = "CD3E", 
               x = "label")
p3 <- plotExpression(uncorrected, 
               features = "TRAC", 
               x = "label")
p4 <- plotExpression(uncorrected, 
               features = "IL32", 
               x = "label")
(p1 + p2) / (p3 + p4)

Let’s take look at cluster 7’s marker genes.

topRanked <- markers[["7"]] %>%
  as.data.frame() %>% 
  select(contains("cohen")) %>% 
  filter(rank.logFC.cohen <= 5) %>% 
  arrange(desc(mean.logFC.cohen))
topRanked

##        mean.logFC.cohen min.logFC.cohen median.logFC.cohen max.logFC.cohen
## CCL5          2.2075687      1.44465697        2.254568824        2.642929
## IL32          2.0232574      0.61452751        2.189260723        2.654019
## TMSB4X        1.9586329     -0.37143535        1.831034175        5.495243
## B2M           1.9346700      0.26294176        1.743716089        3.823616
## NKG7          1.8215103      1.01236401        1.847630298        2.087532
## GZMA          1.6038448      1.13438673        1.643564267        1.820297
## HCST          1.5463851      0.63486712        1.748000251        2.202524
## IFITM1        1.4361641     -0.01348136        1.357749356        2.941717
## HLA-A         1.4265400      0.57382122        1.163071797        2.473989
## HLA-B         1.3934305      0.06325969        1.362796204        3.024164
## CST7          1.3691316      0.80135676        1.437293770        1.574874
## CD3D          1.3610269     -0.08674796        1.618695713        2.007717
## SRGN          1.2375019     -1.29274016        1.577943992        2.126550
## HLA-C         1.2252207      0.25916376        1.210141161        2.146057
## HLA-E         1.1453457      0.17720057        0.870915825        2.457511
## PTPRC         1.0667799     -0.82358684        1.330497134        2.243389
## CD69          1.0574469      0.08815270        1.073796927        1.906044
## MALAT1        0.7207896     -2.35712611        0.887094565        2.180593
## TMSB10        0.4218817     -1.33358545        0.394393791        2.782997
## RPS28         0.3181123     -1.57334610       -0.026046316        5.403688
## RPL30         0.2724373     -1.45985460        0.225232975        4.133590
## RPL17         0.2316713     -1.46855799        0.003035691        4.209861
## RPS14         0.2241283     -1.85091254       -0.034933131        4.694910
## RPL37A       -0.1104011     -2.04367468       -0.349667710        3.723050
##        rank.logFC.cohen
## CCL5                  1
## IL32                  2
## TMSB4X                1
## B2M                   1
## NKG7                  2
## GZMA                  3
## HCST                  4
## IFITM1                1
## HLA-A                 2
## HLA-B                 2
## CST7                  4
## CD3D                  5
## SRGN                  5
## HLA-C                 5
## HLA-E                 4
## PTPRC                 3
## CD69                  3
## MALAT1                4
## TMSB10                4
## RPS28                 1
## RPL30                 4
## RPL17                 3
## RPS14                 2
## RPL37A                5

IL32, NKG7, GZMA and CD69 might suggest that these are activated T-cells.

p1 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "IL32",
               text_by = "label")

p2 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "NKG7",
               text_by = "label")
p3 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "GZMA",
               text_by = "label")
p4 <- plotTSNE(sce, by_exprs_values = "reconstructed",
               colour_by = "CD69",
               text_by = "label")
(p1 + p2) / (p3 + p4)

Session information

sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] clustree_0.4.4              ggraph_2.0.5               
##  [3] DT_0.21                     forcats_0.5.1              
##  [5] stringr_1.4.0               dplyr_1.0.8                
##  [7] purrr_0.3.4                 readr_2.1.2                
##  [9] tidyr_1.2.0                 tibble_3.1.6               
## [11] tidyverse_1.3.1             patchwork_1.1.1            
## [13] pheatmap_1.0.12             igraph_1.2.11              
## [15] cluster_2.1.3               bluster_1.4.0              
## [17] scran_1.22.1                scater_1.22.0              
## [19] ggplot2_3.3.5               scuttle_1.4.0              
## [21] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [23] Biobase_2.54.0              GenomicRanges_1.46.1       
## [25] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [27] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [29] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0          colorspace_2.0-3         
##   [3] ellipsis_0.3.2            XVector_0.34.0           
##   [5] BiocNeighbors_1.12.0      fs_1.5.2                 
##   [7] rstudioapi_0.13           farver_2.1.0             
##   [9] graphlayouts_0.8.0        bit64_4.0.5              
##  [11] ggrepel_0.9.1             fansi_1.0.2              
##  [13] lubridate_1.8.0           xml2_1.3.3               
##  [15] sparseMatrixStats_1.6.0   knitr_1.39               
##  [17] polyclip_1.10-0           jsonlite_1.8.0           
##  [19] broom_0.7.12              dbplyr_2.1.1             
##  [21] ggforce_0.3.3             compiler_4.1.3           
##  [23] httr_1.4.2                dqrng_0.3.0              
##  [25] backports_1.4.1           assertthat_0.2.1         
##  [27] Matrix_1.4-0              fastmap_1.1.0            
##  [29] limma_3.50.1              cli_3.2.0                
##  [31] tweenr_1.0.2              BiocSingular_1.10.0      
##  [33] htmltools_0.5.2           tools_4.1.3              
##  [35] rsvd_1.0.5                gtable_0.3.0             
##  [37] glue_1.6.2                GenomeInfoDbData_1.2.7   
##  [39] Rcpp_1.0.8                cellranger_1.1.0         
##  [41] jquerylib_0.1.4           vctrs_0.3.8              
##  [43] crosstalk_1.2.0           DelayedMatrixStats_1.16.0
##  [45] xfun_0.29                 beachmat_2.10.0          
##  [47] rvest_1.0.2               lifecycle_1.0.1          
##  [49] irlba_2.3.5               statmod_1.4.36           
##  [51] edgeR_3.36.0              MASS_7.3-56              
##  [53] zlibbioc_1.40.0           scales_1.1.1             
##  [55] tidygraph_1.2.0           vroom_1.5.7              
##  [57] hms_1.1.1                 parallel_4.1.3           
##  [59] RColorBrewer_1.1-2        yaml_2.3.5               
##  [61] gridExtra_2.3             sass_0.4.0               
##  [63] stringi_1.7.6             highr_0.9                
##  [65] ScaledMatrix_1.2.0        checkmate_2.0.0          
##  [67] BiocParallel_1.28.3       rlang_1.0.1              
##  [69] pkgconfig_2.0.3           bitops_1.0-7             
##  [71] evaluate_0.15             lattice_0.20-45          
##  [73] labeling_0.4.2            htmlwidgets_1.5.4        
##  [75] cowplot_1.1.1             bit_4.0.4                
##  [77] tidyselect_1.1.2          magrittr_2.0.2           
##  [79] R6_2.5.1                  generics_0.1.2           
##  [81] metapod_1.2.0             DelayedArray_0.20.0      
##  [83] DBI_1.1.2                 pillar_1.7.0             
##  [85] haven_2.4.3               withr_2.4.3              
##  [87] RCurl_1.98-1.6            modelr_0.1.8             
##  [89] crayon_1.5.0              utf8_1.2.2               
##  [91] tzdb_0.2.0                rmarkdown_2.11           
##  [93] viridis_0.6.2             locfit_1.5-9.4           
##  [95] grid_4.1.3                readxl_1.3.1             
##  [97] reprex_2.0.1              digest_0.6.29            
##  [99] munsell_0.5.0             beeswarm_0.4.0           
## [101] viridisLite_0.4.0         vipor_0.4.5              
## [103] bslib_0.3.1

Introduction to single-cell RNA-seq analysis

Clustering and Marker Genes

Ashley Sawle

May 2022