Cluster sweep has already been run on the QC’d, filtered, normalised and batch corrected Caron data. Please see the script ClusterSweep.R for details.
Here the clusterSweep
results object, cluster behaviour metrics and the SCE object containing corrected data and the clusters are loaded.
out <- readRDS("../Robjects/clusterSweep.out.rds")
df <- read_tsv("../Robjects/clusterSweep.metrics_df.tsv")
## Rows: 36 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): cluster.fun
## dbl (4): k, num.clusters, silhouette, wcss
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sce <- readRDS("../Robjects/clusterSweep.sce.rds")
Cluster sweep was run with the following combinations of parameters:
out$parameters %>%
as.data.frame() %>%
select(`Cluster function`=cluster.fun, k) %>%
rownames_to_column("colData column") %>%
datatable(rownames=FALSE)
We will consider the number of clusters, the mean silhouette width and the sum of the within-cluster sum of squares as an initial assessment of cluster behaviour. To assess the behaviour of the clusterings we can plot these metrics against the k.
nclPlot <- ggplot(df, aes(x = k, y = num.clusters)) +
geom_line(aes(colour=cluster.fun), lwd=2)
silPlot <- ggplot(df, aes(x = k, y = silhouette)) +
geom_line(aes(colour=cluster.fun), lwd=2)
wcssPlot <- ggplot(df, aes(x = k, y = wcss)) +
geom_line(aes(colour=cluster.fun), lwd=2)
nclPlot + silPlot + wcssPlot + plot_layout(guides = "collect")
df %>%
filter(num.clusters>=10 & num.clusters<=15)
## # A tibble: 6 × 5
## k cluster.fun num.clusters silhouette wcss
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 60 leiden 12 0.234 1207.
## 2 10 louvain 15 0.172 1065.
## 3 15 louvain 14 0.173 1078.
## 4 20 louvain 11 0.178 1180.
## 5 25 louvain 11 0.174 1181.
## 6 30 louvain 10 0.185 1275.
We can now visualise our selected clusterings of interest on tSNE plots.
plotTSNE(sce, colour_by = "k.60_cluster.fun.leiden")
plotTSNE(sce, colour_by = "k.15_cluster.fun.louvain")
plotTSNE(sce, colour_by = "k.20_cluster.fun.louvain")
plotTSNE(sce, colour_by = "k.25_cluster.fun.louvain")
table(sce$k.60_cluster.fun.leiden)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 12570 3789 5600 407 2581 1696 988 1230 7 168 333 165
table(sce$k.15_cluster.fun.louvain)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 2028 2696 3605 2260 4079 4157 723 2570 1635 1754 1428 1265 1221 113
table(sce$k.20_cluster.fun.louvain)
##
## 1 2 3 4 5 6 7 8 9 10 11
## 2025 4896 4374 4205 5523 2576 1630 1387 1501 1225 192
table(sce$k.25_cluster.fun.louvain)
##
## 1 2 3 4 5 6 7 8 9 10 11
## 2015 4946 3793 6093 4221 1786 1613 1370 1297 111 2289
plotSilhouette <- function(clusters, plotTitle){
approxSilhouette(reducedDim(sce, "corrected"),
clusters=clusters) %>%
as.data.frame() %>%
mutate(closestCluster = ifelse(width > 0, cluster, other) %>% factor()) %>%
ggplot(aes(x=cluster, y=width, colour=closestCluster)) +
ggbeeswarm::geom_quasirandom(method="smiley") +
ggtitle(plotTitle)
}
plotSilhouette(sce$k.60_cluster.fun.leiden, "Leiden k=60")
plotSilhouette(sce$k.15_cluster.fun.louvain, "Louvain k=15")
plotSilhouette(sce$k.20_cluster.fun.louvain, "Louvain k=20")
plotSilhouette(sce$k.25_cluster.fun.louvain, "Louvain k=25")
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
p1 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "MS4A1",
text_by = "k.60_cluster.fun.leiden")
p2 <- plotTSNE(sce, by_exprs_values = "reconstructed"
, colour_by = "CD79A",
text_by = "k.60_cluster.fun.leiden")
p3 <- plotTSNE(sce, by_exprs_values = "reconstructed"
, colour_by = "TNFRSF13C",
text_by = "k.60_cluster.fun.leiden")
p4 <- plotTSNE(sce, by_exprs_values = "reconstructed"
, colour_by = "BCL6",
text_by = "k.60_cluster.fun.leiden")
(p1 + p2) / (p3 + p4)
We can look at how cells move between clusters using clustree
combined <- cbind(k.15=sce$k.15_cluster.fun.louvain,
k.20=sce$k.20_cluster.fun.louvain,
k.25=sce$k.25_cluster.fun.louvain)
library(clustree)
## Loading required package: ggraph
set.seed(1111)
clustree(combined, prefix="k.", edge_arrow=FALSE)
## Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## Please use the `.add` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
jacc.mat <- linkClustersMatrix(sce$k.20_cluster.fun.louvain, sce$k.60_cluster.fun.leiden)
rownames(jacc.mat) <- paste("Louvain", rownames(jacc.mat))
colnames(jacc.mat) <- paste("Leiden", colnames(jacc.mat))
pheatmap(jacc.mat, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE)
plotTSNE(sce,
colour_by = "k.50_cluster.fun.leiden",
text_by = "k.50_cluster.fun.leiden")
plotSilhouette(sce$k.50_cluster.fun.leiden, "Leiden k=50")
out.leiden <- clusterSweep(reducedDim(sce, "corrected"),
BLUSPARAM = NNGraphParam(),
k = as.integer(51:59),
cluster.fun = "leiden",
BPPARAM=BiocParallel::MulticoreParam(5))
saveRDS(out.leiden, "../Robjects/clusterSweep.out.leiden.rds")
out.leiden <- readRDS("../Robjects/clusterSweep.out.leiden.rds")
colData(sce) <- cbind(colData(sce), DataFrame(out.leiden$clusters))
plotTSNE(sce, colour_by = "k.52_cluster.fun.leiden")
plotTSNE(sce, colour_by = "k.54_cluster.fun.leiden")
plotTSNE(sce, colour_by = "k.56_cluster.fun.leiden")
plotTSNE(sce, colour_by = "k.57_cluster.fun.leiden")
plotTSNE(sce, colour_by = "k.58_cluster.fun.leiden")
I am going to move forward with Leiden & k=57 for now.
colLabels(sce) <- sce$k.57_cluster.fun.leiden
First we need to load the uncorrected data set. We’ll start from the object Caron_filtered_genes.rds
. This contains all of the samples from the complete data set, so we’ll to need to subset that to the just the 7 Caron samples we’re using here. Also, there are additional genes - these were filtered out after subsetting to the 7 samples based on expression - so we’ll need to subset the genes as well.
uncorrected <- readRDS("../Robjects/Caron_filtered_genes.rds")
# subset samples and genes to match current data set
samples <- colData(uncorrected)$Sample %in% colData(sce)$SampleName
genes <- rowData(sce)$ID
uncorrected <- uncorrected[genes, samples]
# Normalisation
clust <- quickCluster(uncorrected)
uncorrected <- computePooledFactors(uncorrected, cluster=clust, min.mean=0.1)
uncorrected <- logNormCounts(uncorrected)
colnames(colData(uncorrected))[1] <- "SampleName"
saveRDS(uncorrected, "../Robjects/LogNormalised.AllCells.Rds")
uncorrected <- readRDS("../Robjects/LogNormalised.AllCells.Rds")
We’ll switch the row names to be the gene symbols.
rownames(uncorrected) <- uniquifyFeatureNames(rowData(uncorrected)$ID,
rowData(uncorrected)$Symbol)
Just check everything matches.
all(colnames(sce)==colnames(uncorrected))
## [1] TRUE
all(rownames(sce)==rownames(uncorrected))
## [1] TRUE
Add the clusters and then run scoreMarkers
.
colLabels(uncorrected) <- colLabels(sce)
markers <- scoreMarkers(uncorrected,
groups = factor(uncorrected$label),
block =uncorrected$SampleName)
I asked you to identify this cluster:
Which is made up of clusters 3 and 7.
plotTSNE(sce, colour_by = "label", text_by = "label")
Let’s have a look at the top marker genes for cluster 3:
topRanked <- markers[["3"]] %>%
as.data.frame() %>%
select(contains("cohen")) %>%
filter(rank.logFC.cohen <= 5) %>%
arrange(desc(mean.logFC.cohen))
topRanked
## mean.logFC.cohen min.logFC.cohen median.logFC.cohen max.logFC.cohen
## RPS27 2.572383 0.73811891 2.4678557 4.425421
## RPS14 2.504531 0.91800401 1.6680303 10.263202
## RPL34 2.226986 1.20387675 2.0874785 3.628712
## RPS28 2.223351 0.81957934 1.5530260 10.649329
## RPS29 2.184457 1.23844928 1.8699763 3.488855
## RPL13 2.178147 1.12524233 2.0910725 3.304402
## RPL32 2.108779 0.99367983 1.9071179 3.633007
## RPL21 2.095168 0.64562257 1.9648405 3.668052
## TMSB4X 2.017420 -0.33431776 1.9750941 5.546751
## RPL30 1.992500 0.72713360 1.4985417 7.335081
## RPL10 1.967514 0.87392734 1.9332442 3.068205
## RPL17 1.930607 0.29472097 1.4880187 7.734110
## RPS18 1.897726 0.12548850 1.9047422 3.231535
## RPS6 1.853538 0.34703458 1.9948865 2.868671
## RPS23 1.742300 -0.29760012 1.3556752 7.649132
## RPL23A 1.718174 0.62701569 1.9031250 2.672944
## RPL3 1.673382 0.28363940 1.9353661 2.588791
## CD3D 1.576817 0.08674796 1.8876481 2.029542
## TMSB10 1.546154 -0.21179328 1.3563138 4.806533
## TRAC 1.543701 0.14978159 1.6722240 2.191058
## IFITM1 1.513202 0.01348136 1.5324712 2.363521
## CD3E 1.460382 0.20038530 1.5835656 1.992863
## IL32 1.411235 -0.61452751 1.6330326 1.998481
## CD52 1.283165 -0.65492626 1.3122820 2.424752
## LDHB 1.190076 -0.18989791 1.2109401 2.080541
## MALAT1 1.182334 -2.69269597 1.2913541 2.578865
## B2M 1.170893 -0.95805287 0.9815617 3.075659
## rank.logFC.cohen
## RPS27 1
## RPS14 2
## RPL34 1
## RPS28 1
## RPS29 2
## RPL13 1
## RPL32 2
## RPL21 1
## TMSB4X 1
## RPL30 5
## RPL10 4
## RPL17 3
## RPS18 3
## RPS6 5
## RPS23 4
## RPL23A 5
## RPL3 3
## CD3D 1
## TMSB10 2
## TRAC 4
## IFITM1 2
## CD3E 3
## IL32 2
## CD52 4
## LDHB 5
## MALAT1 4
## B2M 3
There’s lots of ribosomal genes at the top of this list. Let’s check the expression of a few:
plotExpression(uncorrected,
features = "RPS27",
x = "label")
plotExpression(uncorrected,
features = "RPL34",
x = "label")
plotExpression(uncorrected,
features = "RPL21",
x = "label")
These aren’t really going to be useful as marker genes.
On the other hand, further down we do have CD3D, CD3E and TRAC, which are T-cell markers, and IL32, which is an indicator of T-cell activation.
Let’s take a look at these:
p1 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "CD3D",
text_by = "label")
p2 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "CD3E",
text_by = "label")
p3 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "TRAC",
text_by = "label")
p4 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "IL32",
text_by = "label")
(p1 + p2) / (p3 + p4)
p1 <- plotExpression(uncorrected,
features = "CD3D",
x = "label")
p2 <- plotExpression(uncorrected,
features = "CD3E",
x = "label")
p3 <- plotExpression(uncorrected,
features = "TRAC",
x = "label")
p4 <- plotExpression(uncorrected,
features = "IL32",
x = "label")
(p1 + p2) / (p3 + p4)
Let’s take look at cluster 7’s marker genes.
topRanked <- markers[["7"]] %>%
as.data.frame() %>%
select(contains("cohen")) %>%
filter(rank.logFC.cohen <= 5) %>%
arrange(desc(mean.logFC.cohen))
topRanked
## mean.logFC.cohen min.logFC.cohen median.logFC.cohen max.logFC.cohen
## CCL5 2.2075687 1.44465697 2.254568824 2.642929
## IL32 2.0232574 0.61452751 2.189260723 2.654019
## TMSB4X 1.9586329 -0.37143535 1.831034175 5.495243
## B2M 1.9346700 0.26294176 1.743716089 3.823616
## NKG7 1.8215103 1.01236401 1.847630298 2.087532
## GZMA 1.6038448 1.13438673 1.643564267 1.820297
## HCST 1.5463851 0.63486712 1.748000251 2.202524
## IFITM1 1.4361641 -0.01348136 1.357749356 2.941717
## HLA-A 1.4265400 0.57382122 1.163071797 2.473989
## HLA-B 1.3934305 0.06325969 1.362796204 3.024164
## CST7 1.3691316 0.80135676 1.437293770 1.574874
## CD3D 1.3610269 -0.08674796 1.618695713 2.007717
## SRGN 1.2375019 -1.29274016 1.577943992 2.126550
## HLA-C 1.2252207 0.25916376 1.210141161 2.146057
## HLA-E 1.1453457 0.17720057 0.870915825 2.457511
## PTPRC 1.0667799 -0.82358684 1.330497134 2.243389
## CD69 1.0574469 0.08815270 1.073796927 1.906044
## MALAT1 0.7207896 -2.35712611 0.887094565 2.180593
## TMSB10 0.4218817 -1.33358545 0.394393791 2.782997
## RPS28 0.3181123 -1.57334610 -0.026046316 5.403688
## RPL30 0.2724373 -1.45985460 0.225232975 4.133590
## RPL17 0.2316713 -1.46855799 0.003035691 4.209861
## RPS14 0.2241283 -1.85091254 -0.034933131 4.694910
## RPL37A -0.1104011 -2.04367468 -0.349667710 3.723050
## rank.logFC.cohen
## CCL5 1
## IL32 2
## TMSB4X 1
## B2M 1
## NKG7 2
## GZMA 3
## HCST 4
## IFITM1 1
## HLA-A 2
## HLA-B 2
## CST7 4
## CD3D 5
## SRGN 5
## HLA-C 5
## HLA-E 4
## PTPRC 3
## CD69 3
## MALAT1 4
## TMSB10 4
## RPS28 1
## RPL30 4
## RPL17 3
## RPS14 2
## RPL37A 5
IL32, NKG7, GZMA and CD69 might suggest that these are activated T-cells.
p1 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "IL32",
text_by = "label")
p2 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "NKG7",
text_by = "label")
p3 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "GZMA",
text_by = "label")
p4 <- plotTSNE(sce, by_exprs_values = "reconstructed",
colour_by = "CD69",
text_by = "label")
(p1 + p2) / (p3 + p4)
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] clustree_0.4.4 ggraph_2.0.5
## [3] DT_0.21 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.8
## [7] purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.6
## [11] tidyverse_1.3.1 patchwork_1.1.1
## [13] pheatmap_1.0.12 igraph_1.2.11
## [15] cluster_2.1.3 bluster_1.4.0
## [17] scran_1.22.1 scater_1.22.0
## [19] ggplot2_3.3.5 scuttle_1.4.0
## [21] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [23] Biobase_2.54.0 GenomicRanges_1.46.1
## [25] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [27] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [29] MatrixGenerics_1.6.0 matrixStats_0.61.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 colorspace_2.0-3
## [3] ellipsis_0.3.2 XVector_0.34.0
## [5] BiocNeighbors_1.12.0 fs_1.5.2
## [7] rstudioapi_0.13 farver_2.1.0
## [9] graphlayouts_0.8.0 bit64_4.0.5
## [11] ggrepel_0.9.1 fansi_1.0.2
## [13] lubridate_1.8.0 xml2_1.3.3
## [15] sparseMatrixStats_1.6.0 knitr_1.39
## [17] polyclip_1.10-0 jsonlite_1.8.0
## [19] broom_0.7.12 dbplyr_2.1.1
## [21] ggforce_0.3.3 compiler_4.1.3
## [23] httr_1.4.2 dqrng_0.3.0
## [25] backports_1.4.1 assertthat_0.2.1
## [27] Matrix_1.4-0 fastmap_1.1.0
## [29] limma_3.50.1 cli_3.2.0
## [31] tweenr_1.0.2 BiocSingular_1.10.0
## [33] htmltools_0.5.2 tools_4.1.3
## [35] rsvd_1.0.5 gtable_0.3.0
## [37] glue_1.6.2 GenomeInfoDbData_1.2.7
## [39] Rcpp_1.0.8 cellranger_1.1.0
## [41] jquerylib_0.1.4 vctrs_0.3.8
## [43] crosstalk_1.2.0 DelayedMatrixStats_1.16.0
## [45] xfun_0.29 beachmat_2.10.0
## [47] rvest_1.0.2 lifecycle_1.0.1
## [49] irlba_2.3.5 statmod_1.4.36
## [51] edgeR_3.36.0 MASS_7.3-56
## [53] zlibbioc_1.40.0 scales_1.1.1
## [55] tidygraph_1.2.0 vroom_1.5.7
## [57] hms_1.1.1 parallel_4.1.3
## [59] RColorBrewer_1.1-2 yaml_2.3.5
## [61] gridExtra_2.3 sass_0.4.0
## [63] stringi_1.7.6 highr_0.9
## [65] ScaledMatrix_1.2.0 checkmate_2.0.0
## [67] BiocParallel_1.28.3 rlang_1.0.1
## [69] pkgconfig_2.0.3 bitops_1.0-7
## [71] evaluate_0.15 lattice_0.20-45
## [73] labeling_0.4.2 htmlwidgets_1.5.4
## [75] cowplot_1.1.1 bit_4.0.4
## [77] tidyselect_1.1.2 magrittr_2.0.2
## [79] R6_2.5.1 generics_0.1.2
## [81] metapod_1.2.0 DelayedArray_0.20.0
## [83] DBI_1.1.2 pillar_1.7.0
## [85] haven_2.4.3 withr_2.4.3
## [87] RCurl_1.98-1.6 modelr_0.1.8
## [89] crayon_1.5.0 utf8_1.2.2
## [91] tzdb_0.2.0 rmarkdown_2.11
## [93] viridis_0.6.2 locfit_1.5-9.4
## [95] grid_4.1.3 readxl_1.3.1
## [97] reprex_2.0.1 digest_0.6.29
## [99] munsell_0.5.0 beeswarm_0.4.0
## [101] viridisLite_0.4.0 vipor_0.4.5
## [103] bslib_0.3.1