Some of the materials originate in the Hemberg group course material with some of the text copied with a few edits. Also see the OSCA book’s “Basic” and “Advanced” chapters on clustering. In particular, please read the overview with regard to the comments on the “correctness” of any given clustering result.
Once we have normalized the data and removed confounders we can carry out analyses that are relevant to the biological questions at hand. The exact nature of the analysis depends on the data set. One of the most promising applications of scRNA-seq is de novo discovery and annotation of cell-types based on transcription profiles. This requires the identification of groups of cells based on the similarities of the transcriptomes without any prior knowledge of the labels, or unsupervised clustering. To avoid the challenges caused by the noise and high dimensionality of the scRNA-seq data, clustering is performed after feature selection and dimensionality reduction, for data that has not required batch correction this would usually on the PCA output. As our data has required batch correction we will use the “corrected” reducedDims data.
We will focus graph-based clustering, however, it is also possible to apply hierarchical clustering and k-means clustering on smaller data sets - see the OSCA book for details. Graph-base clustering is a more recent development and better suited for scRNA-seq, especially large data sets.
We will use the data set generated in the previous session. This contains 7 samples from the Caron data set. For the purposes of these materials, in the interests of time, each sample has been downsampled to only contain 500 cells.
sce <- readRDS("Robjects/DataIntegration_mnn.Rds")
Graph-based clustering entails building a nearest-neighbour (NN) graph using cells as nodes and their similarity as edges, then identifying ‘communities’ of cells within the network. A graph-based clustering method has three key parameters:
Two types of NN graph may be use “K nearest-neighbour” (KNN) and “shared nearest-neighbour” (SNN). In an KNN graph two nodes (cells), say A and B, are connected by an edge if the distance between them is amongst the k smallest distances from A to other cells. In an SNN graph A and B are connected if the distance is amongst the k samllest distances from A to other cells and also among the k smallest distance from B to other cells.
In the figure above, if k is 5, then A and B would be connected in a KNN graph as B is one of the 5 closest cells to A, however, they would not be connected in an SNN graph as B has 5 other cells that are closer to it than A.
The value of k can be roughly interpreted as the anticipated size of the smallest subpopulation” (see scran
’s buildSNNGraph()
manual).
The plot below shows the same data set as a network built using three different numbers of neighbours: 5, 15 and 25 (from here).
The edges between nodes (cells) can be weighted based on the similarity of the cells; edges connecting cells that are more closely related will have a higher weight. The three common methods for this weighting are (see the bluster package documentation for the makeSNNGraph
function):
Clusters are identified using an algorithm that interprets the connections of the graph to find groups of highly interconnected cells. A variety of different algorithms are available to do this, in these materials we will focus on three methods: walktrap, louvain and leiden. See the OSCA book for details of others available in scran.
The implementation of clustering in R is carried out using functions from a number of different packages, in particular the bluster and igraph packages. scran provides a handy “wrapper” function clusterCells
that allows us use a variety of different algorithms with one simple command.
By default clusterCells
just returns a vector containing the cluster number for each cell. We can also retrieve the intermediate statistics (varying according to the algorithm used) and the SNN graph by specifying the bluster argument full = TRUE
. If you are only interested in retrieving the clusters, this isn’t necessary but in this first instance we will retrieve the graph and visualise it.
clustering1 <- clusterCells(sce, use.dimred="corrected", full=TRUE)
This has defined 22 clusters with varying numbers of cells:
table(clustering1$clusters)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 53 31 51 903 16 23 115 42 501 24 225 13 614 100 153 41 74 110 178 112
## 21 22
## 104 17
The number of cells in the data set is large and plotting all the cells would take too long, so we randomly choose 1000 nodes (cells) in the network before plotting the resulting smaller network. Adding sample data to the graph and plotting the results are done using the igraph package. Cells can be color-coded by sample type:
# extract the graph
snn.gr <- clustering1$objects$graph
# Add Sample group to vertices (nodes, ie cells)
V(snn.gr)$SampleGroup <- as.character(colData(sce)$SampleGroup)
# pick 1000 nodes randomly
set.seed(1423)
selectedNodes <- sample(3500, 1000)
# subset graph for these 1000 randomly chosen nodes
snn.gr.subset <- subgraph(snn.gr, selectedNodes)
# set colors for clusters
grps <- V(snn.gr.subset)$SampleGroup
cols <- c("dodgerblue", "lightyellow")[as.numeric(factor(grps))]
names(cols) <- grps
# plot graph
plot.igraph(snn.gr.subset,
layout = layout_with_fr(snn.gr.subset),
vertex.size = 3,
vertex.label = NA,
vertex.color = cols,
frame.color = cols,
main = "default parameters"
)
# add legend
legend('bottomright',
legend=unique(names(cols)),
pch=21,
pt.bg=unique(cols),
pt.cex=1, cex=.6, bty="n", ncol=1)
More commonly we will visualise the clusters by superimposing them on a t-SNE or UMAP plot. We can store the clusters in the sce
object colData
.
sce$Clusters1 <- clustering1$clusters
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="Clusters1",
text_by = "Clusters1")
Several methods to detect clusters (‘communities’) in networks rely on a metric called “modularity”. For a given partition of cells into clusters, modularity measures how separated clusters are from each other, based on the difference between the observed and expected weight of edges between nodes. For the whole graph, the closer to 1 the better.
The walktrap method relies on short random walks (a few steps) through the network. These walks tend to be ‘trapped’ in highly-connected regions of the network. Node similarity is measured based on these walks. Nodes are first each assigned their own community. Pairwise distances are computed and the two closest communities are grouped. These steps are repeated a given number of times to produce a dendrogram. Hierarchical clustering is then applied to the distance matrix. The best partition is that with the highest modularity. The original article describing the algorithm is Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph Algorithms Appl 10(2):191–218
Walktrap is the default algorithm for clusterCells
, k is set to 10 by default and the default edge weighting is “rank”. To explicitly request a specific algorithm and to set the k to a different number of nearest neighbours, we use a SNNGraphParam
object from the bluster package (which is the package clusterCells is using under the hood).
Let’s set the k to 15 but keep the other parameters the same. This time we will just return the clusters:
sce$walktrap15 <- clusterCells(sce,
use.dimred = "corrected",
BLUSPARAM = SNNGraphParam(k = 15,
cluster.fun = "walktrap"))
This time we have defined 12 clustering. As a general rule, increasing k will tend to decrease the number of clusters (not always, but generally).
table(sce$walktrap15)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 223 451 72 206 60 1329 53 36 237 124 87 622
We can visualise the assignment of cells from different samples to the clusters using a heatmap.
table(sce$walktrap15, sce$SampleName) %>%
pheatmap(cluster_rows = FALSE, cluster_cols = FALSE)
Most clusters comprise cells from several replicates of a same sample type, cluster 6 appears to be predominantly cells from the ETV6-RUNX samples.
We can visualise this on the TSNE:
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="walktrap15",
text_by = "walktrap15")
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="walktrap15",
text_by = "walktrap15",
other_fields = list("SampleGroup")) +
facet_wrap("SampleGroup")
The different clustering algorithms may have additional parameters, specific to the algorithm, that can be adjusted. With the walktrap algorithm we could also tweak the number of “steps” in each walk. The default is 4, but we could, for example, change this to 10 by adding the parameter cluster.args = list(steps = 10)
to the SNNGraphParam
object in the clusterCells
command.
With the Louvain method, nodes are also first assigned their own community. This hierarchical agglomerative method then progresses in two-step iterations:
These two steps are repeated until modularity stops increasing. The diagram below is copied from this article.
We now apply the Louvain approach, store its outcome in the SCE object and show cluster sizes.
sce$louvain15 <- clusterCells(sce,
use.dimred = "corrected",
BLUSPARAM = SNNGraphParam(k = 15, cluster.fun = "louvain"))
table(sce$louvain15)
##
## 1 2 3 4 5 6 7 8
## 534 785 659 618 129 222 384 169
The t-SNE plot shows cells color-coded by cluster membership:
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by = "louvain15",
text_by = "louvain15")
If we split by sample type we can see differences in the clusters between the sample groups:
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="louvain15",
text_by = "louvain15",
other_fields = list("SampleGroup")) +
facet_wrap("SampleGroup")
The Leiden method improves on the Louvain method by guaranteeing that at each iteration clusters are connected and well-separated. The method includes an extra step in the iterations: after nodes are moved (step 1), the resulting partition is refined (step2) and only then the new aggregate network made, and refined (step 3). The diagram below is copied from this article.
For this exercise please run the clustering again, this time using the “leiden”, method.
Set the k to 20 and add the results of the clustering to the sce
object in a new column called “leiden20”.
How many clusters does this result in?
Visualize the clusters by plotting the t-SNE with the cells coloured according to your new clustering.
You will need to change the k
parameter and the cluster.fun
parameter in the SNNGraphParam object used in the clusterCells
function.
First run the clustering with clusterCells
:
sce$leiden20 <- clusterCells(sce,
use.dimred = "corrected",
BLUSPARAM = SNNGraphParam(k = 20,
cluster.fun = "leiden"))
We can quickly look at the results by summarising using table
.
table(sce$leiden20)
##
## 1 2 3 4 5 6 7 8 9 10 11
## 491 1256 198 22 632 1 115 236 270 114 165
There are 11 clusters, although cluster 6 contains only 1 cell.
The t-SNE plot shows cells color-coded by cluster membership:
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by = "leiden20",
text_by = "leiden20")
A variety of metrics are available to aid us in assessing the behaviour of a particular clustering method on our data. These can help us in assessing how well defined different clusters within a single clustering are in terms of the relatedness of cells within the cluster and the how well separated that cluster is from cells in other clusters, and to compare the results of different clustering methods or parameter values (e.g. different values for k).
We will consider “Silhouette width” and “Modularity”. Further details and other metrics are described in the “Advanced” section of the OSCA book.
The silhouette width (so named after the look of the traditional graph for plotting the results) is a measure of how closely related cells within cluster are to one another versus how closely related cells in the cluster are to cells in other clusters. This allows us to assess cluster separation.
For each cell in the cluster we calculate the the average distance to all other cells in the cluster and the average distance to all cells not in the cluster. The cells silhouette width is the difference between these divided by the maximum of the two values. Cells with a large silhouette are strongly related to cells in the cluster, cells with a negative silhouette width are more closely related to other clusters.
We will use the approxSilhouette
function from the bluster package. The resulting table gives us the silhouette width for each cell, the cluster it belongs to, and which other cluster it is most closely related to.
sil.approx <- approxSilhouette(reducedDim(sce, "corrected"),
clusters=sce$leiden20)
sil.approx
## DataFrame with 3500 rows and 3 columns
## cluster other width
## <factor> <factor> <numeric>
## 1_AAACGGGCAGTTCATG-1 1 3 0.0694903
## 1_AAACGGGGTTCACCTC-1 1 2 0.2792964
## 1_AAAGATGAGCGATGAC-1 2 3 0.2005121
## 1_AAAGATGCAGCCAATT-1 2 3 -0.0565009
## 1_AAAGTAGCAATGCCAT-1 3 2 0.1841003
## ... ... ... ...
## 12_TTTCCTCGTCCAAGTT-1 5 7 0.401217
## 12_TTTGCGCCAGGCTCAC-1 2 8 0.137483
## 12_TTTGGTTTCCAAGCCG-1 8 2 0.138663
## 12_TTTGTCACAATGAAAC-1 5 3 0.416572
## 12_TTTGTCATCAGTTGAC-1 5 7 0.135120
We can view the results in as a beeswarm plot. We colour each cell according to either its current cluster or, if the cell has a negative silhouette width, the cluster that it is closest to.
plotSilBeeswarm <- function(silDat){
silTab <- silDat %>%
as.data.frame() %>%
mutate(closestCluster = ifelse(width > 0, cluster, other) %>% factor())
plt <- silTab %>%
ggplot(aes(x=cluster, y=width, colour=closestCluster)) +
ggbeeswarm::geom_quasirandom(method="smiley", alpha=0.6) +
theme_bw()
plt <- scater:::.resolve_plot_colours(plt, silTab$closestCluster, "closestCluster")
}
p1 <- plotSilBeeswarm(sil.approx)
p2 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="leiden20",
text_by = "leiden20")
p1 + p2
We could also look at the correspondence between different clusters by plotting these numbers on a grid showing for each cluster number of cells in that cluster that are closer to another cluster, colouring each tile by the proportion of the total cells in the cluster that it contains. Ideally we would like to see a strong diagonal band and only a few off-diagonal tiles containing small number of cells.
plotSilGrid <- function(silDat){
silDat %>%
as.data.frame() %>%
mutate(closestCluster = ifelse(width > 0, cluster, other) %>% factor()) %>%
count(cluster, closestCluster, name="olap") %>%
group_by(cluster) %>%
mutate(total = sum(olap)) %>%
mutate(proportion = olap / total) %>%
mutate(proportion = ifelse(cluster == closestCluster, proportion, -proportion)) %>%
ggplot(aes(x = cluster, y = closestCluster)) +
geom_tile(aes(fill = proportion)) +
geom_text(aes(label = olap), size=5) +
scale_fill_gradientn(colors = c("#fc8d59", "#ffffbf", "#91cf60"),
limits = c(-1, 1)) +
geom_vline(xintercept=seq(0.5, 30.5, by=1)) +
geom_hline(yintercept=seq(0.5, 30.5, by=1), colour="lightgrey", linetype=2) +
guides(fill = "none") +
theme(
aspect.ratio = 1,
panel.background = element_blank())
}
plotSilGrid(sil.approx)
From these two plots we can see that clusters 5 and 6 appear to have a good degree of separation, however, cluster 6 only contains 1 cell, whilst there are 49 cells in other clusters that appear closer to it than they are to their assigned cluster. Perhaps cluster 6 needs to be merged with cluster 2. In contrast, clusters 1 and 2 have a large number of cells that appear to be closer to cluster 3. It is possible clusters 1-3 contain a degree of heterogeneity that suggests further splitting of these clusters may be required.
Let’s do the same plots with the walktrap clusters generated with k=15.
sil.approx <- approxSilhouette(reducedDim(sce, "corrected"),
clusters=sce$walktrap15)
wp1 <- plotSilBeeswarm(sil.approx)
wp2 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="walktrap15",
text_by = "walktrap15")
wp3 <- plotSilGrid(sil.approx)
wp1 + wp2 + wp3
This clustering appears to have generated a set of clusters with better separatedness than the Leiden method with a k of 20.
And again with the louvain clusters
sil.approx <- approxSilhouette(reducedDim(sce, "corrected"),
clusters=sce$louvain15)
lp1 <- plotSilBeeswarm(sil.approx)
lp2 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by="louvain15",
text_by = "louvain15")
lp3 <- plotSilGrid(sil.approx)
lp1 + lp2 + lp3
There seems to be a greater degree of overlap between these clusters, perhaps more resolution would improve this clustering - we might consider reducing k.
As mentioned early, the modularity metric is used in evaluating the separatedness between clusters. Some of the clustering algorithms, e.g. Louvain, seek to optimise this for the entire NN graph as part of their cluster detection. Modularity is a ratio between the observed weights of the edges within a cluster versus the expected weights if the edges were randomly distributed between all nodes. Rather than calculating a single modularity value for the whole graph, we can instead calculate a pair-wise modularity value between each pair of clusters using the pairwiseModularity
function from the bluster package. For this we need to have the graph from the clustering, so we will rerun the walktrap clustering with k=15 to obtain this. We can plot the resulting ratios on a heatmap. We would expect the highest modularity values to be on the diagonal.
walktrap15 <- clusterCells(sce,
use.dimred = "corrected",
BLUSPARAM = SNNGraphParam(k = 15, cluster.fun = "walktrap"),
full = TRUE)
g <- walktrap15$objects$graph
ratio <- pairwiseModularity(g, walktrap15$clusters, as.ratio=TRUE)
hm1 <- pheatmap(log2(ratio+1),
cluster_rows=FALSE,
cluster_cols=FALSE,
color=colorRampPalette(c("white", "blue"))(100))
We can compare this to the silhuoette width grid
wp4 <- ggplotify::as.ggplot(hm1)
wp2 + wp3 + wp4
Largely, this reflects what we saw from the silhouette widths, but also reveals some additional inter-connectedness between other clusters e.g. cluster 3 and cluster 8. We can also visualise this as network graph where nodes are clusters and the edge weights are the modularity.
cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1),
mode="upper",
weighted=TRUE, diag=FALSE)
set.seed(11001010)
plot(cluster.gr,
edge.width=igraph::E(cluster.gr)$weight*5,
layout=igraph::layout_with_lgl)
The wide variety of clustering algorithms available and their differing outputs reflect the different ways that it is possible to view out data in high-dimensional space. It may often be useful to assess the concordance between different clustering methods in order to assess how clusters relate to each other - e.g. does one cluster from one method equate to just one cluster in the other or is it a combination of different clusters. This may be revealing about the underlying biology. We will use the Jaccard index as measure of concordance between clusters. A value of 1 represents perfect concordance between clusters (i.e. they contain exactly the same cells).
jacc.mat <- linkClustersMatrix(sce$louvain15, sce$walktrap15)
rownames(jacc.mat) <- paste("Louvain", rownames(jacc.mat))
colnames(jacc.mat) <- paste("Walktrap", colnames(jacc.mat))
pheatmap(jacc.mat, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE)
We can see that Louvain clusters 4, 5, 6 and 7 are equivalent to walktrap clusters 12, 10, and 4. The remaining Louvain clusters are combinations of cells from various walktrap clusters. We may want to look at marker genes for these clusters to assess what these two different views are telling us about the biology.
As we have seen, there are a number of different parameters we can change to alter the final clustering result - primarily the k used to build the NN graph, the edge weighting method and the clustering algorithm. There is no one gold standard that will fit all data, so, in most cases, it is necessary to assess a number of different clusterings to obtain one that provides a view of the data that suits our biological interpretations. The clusterSweep
function allows us to apply a range of different parameters in one go and obtain the clustering for each.
For example, suppose we wish to assess the effect of different values of k on the walktrap clustering. We can parallelize this process to make it faster.
out <- clusterSweep(reducedDim(sce, "corrected"),
BLUSPARAM = NNGraphParam(),
k = as.integer(c(5, 10, 15, 20, 25)),
cluster.fun = "walktrap",
BPPARAM=BiocParallel::MulticoreParam(7))
The resulting object is list containing a DataFrame with the clusters for each combination of the clustering parameters and a corresponding DataFrame showing the parameters used to generate each of these:
out$clusters[,1:4]
## DataFrame with 3500 rows and 4 columns
## k.5_cluster.fun.walktrap k.10_cluster.fun.walktrap
## <factor> <factor>
## 1 4 14
## 2 4 14
## 3 6 4
## 4 6 9
## 5 11 9
## ... ... ...
## 3496 18 13
## 3497 6 4
## 3498 10 19
## 3499 18 13
## 3500 8 7
## k.15_cluster.fun.walktrap k.20_cluster.fun.walktrap
## <factor> <factor>
## 1 9 8
## 2 9 8
## 3 6 5
## 4 6 5
## 5 6 5
## ... ... ...
## 3496 12 6
## 3497 6 5
## 3498 4 9
## 3499 12 6
## 3500 12 6
out$parameters
## DataFrame with 5 rows and 2 columns
## k cluster.fun
## <integer> <character>
## k.5_cluster.fun.walktrap 5 walktrap
## k.10_cluster.fun.walktrap 10 walktrap
## k.15_cluster.fun.walktrap 15 walktrap
## k.20_cluster.fun.walktrap 20 walktrap
## k.25_cluster.fun.walktrap 25 walktrap
We can then combine this cluster sweep with the metrics for assessing cluster behaviour in order to get a overview of the effects of these parameter changes that may enable us to make some decisions as to which clustering or clusterings we may wish to investigate further.
Here we will just look at the mean silhouette width and the number of clusters.
df <- as.data.frame(out$parameters)
# get the number of clusters
df$num.clusters <- apply(out$clusters, 2, max)
# get the mean silhouette width
getMeanSil <- function(cluster) {
sil <- approxSilhouette(reducedDim(sce, "corrected"), cluster)
mean(sil$width)
}
df$silhouette <- map_dbl(as.list(out$clusters), getMeanSil)
nclPlot <- ggplot(df, aes(x = k, y = num.clusters)) +
geom_line(lwd=2)
silPlot <- ggplot(df, aes(x = k, y = silhouette)) +
geom_line(lwd=2)
nclPlot + silPlot
Based on our previous analysis and knowledge of the biology we may feel that 12 clusters represents a good number clusters, but we can see here that k = 15, ahd k = 20 provide this, but k = 20 gives us a slightly better silhouette score. On the other hand, perhaps k = 10 provides greater resolution of cell types, with more clusters with only a slight decrease in the silhouette score.
Earlier we looked at the Jaccard index as a means of comparing two different clusterings. We could apply the same method here:
jacc.mat <- linkClustersMatrix(out$clusters$k.10_cluster.fun.walktrap,
out$clusters$k.15_cluster.fun.walktrap)
rownames(jacc.mat) <- paste("Walktrap_10", rownames(jacc.mat))
colnames(jacc.mat) <- paste("Walktrap_15", colnames(jacc.mat))
pheatmap(jacc.mat, color=viridis::viridis(100), cluster_cols=FALSE, cluster_rows=FALSE)
The OSCA book provides some additional methods for comparing different clusterings that can be combined with the cluster sweep results to assess cluster behaviour under different parameters.
In this section, we have just done a sweep changing the k, but it is also possible to combine this with multiple clustering algorithms and multiple edge weightings.
For this exercise, you will rerun clusterSweep
with additional parameters.This time:
This will test 12 different clusterings.
This time, as well as setting the k
parameter to a vector that contains 10, 15, 20 and 25, you will also need to provide the cluster.fun
parameter with a vector containing the names of the three different clustering methods.
out <- clusterSweep(reducedDim(sce, "corrected"),
BLUSPARAM = NNGraphParam(),
k = as.integer(c(10, 15, 20, 25)),
cluster.fun = c("walktrap", "louvain", "leiden"),
BPPARAM=BiocParallel::MulticoreParam(7))
Now plot the number of clusters generated by each clustering and the mean silhouette width. You will need to adjust the plotting parameters to plot a different coloured line for each clustering algorithm (walktap, louvain and leiden).
df <- as.data.frame(out$parameters)
# get the number of clusters
df$num.clusters <- apply(out$clusters, 2, max)
# get the mean silhouette width
getMeanSil <- function(cluster) {
sil <- approxSilhouette(reducedDim(sce, "corrected"), cluster)
mean(sil$width)
}
df$silhouette <- map_dbl(as.list(out$clusters), getMeanSil)
nclPlot <- FIXME
silPlot <- FIXME
nclPlot + silPlot
aes(colour = cluster.fun)
into the geom_line
function. This will instruct ggplot2 to plot a different coloured line for clustering function.
df <- as.data.frame(out$parameters)
# get the number of clusters
df$num.clusters <- apply(out$clusters, 2, max)
# get the mean silhouette width
getMeanSil <- function(cluster) {
sil <- approxSilhouette(reducedDim(sce, "corrected"), cluster)
mean(sil$width)
}
df$silhouette <- map_dbl(as.list(out$clusters), getMeanSil)
nclPlot <- ggplot(df, aes(x = k, y = num.clusters)) +
geom_line(aes(colour=cluster.fun), lwd=2)
silPlot <- ggplot(df, aes(x = k, y = silhouette)) +
geom_line(aes(colour=cluster.fun), lwd=2)
nclPlot + silPlot
Bonus: if you have time, select 1 or 2 of the clusterings, calculate the silhouette widths and examine them further using the beeswarm plot and the silhouette grid.
approxSilhouette
you will need to use the clustering directly clusters
object in the out
list generated by clusterSweep
.
For example, we could take a look at the louvain with k = 10.
sil.approx <- approxSilhouette(reducedDim(sce, "corrected"),
clusters=out$clusters$k.10_cluster.fun.louvain)
plotSilBeeswarm(sil.approx)
plotSilGrid(sil.approx)
Finally, we can add all (or a subset) of the clusterings from clusterSweep
to our SCE object.
colData(sce) <- cbind(colData(sce), DataFrame(out$clusters))
When you have come to a decision about which clustering to use it is convenient to add it to colData
column called “label” using the colLabels
function. This means downstream code does not need to be changed should you later decide to switch to a different clustering (and makes the code easily re-usable for different analyses).
For now we will use the walktrap k=20 clustering.
colLabels(sce) <- sce$k.20_cluster.fun.walktrap
If we expect our clusters to represent known cell types for which there are well established marker genes, we can now start to investigate the clusters by plotting in parallel the expression of these genes. This can also help us in assessing if our clustering has satisfactorily partitioned our cells.
plotReducedDim(sce,
dimred = "TSNE_corrected",
colour_by = "label",
text_by = "label") +
ggtitle("Walktrap clusters")
Having identified clusters, we now display the level of expression of cell type marker genes to quickly match clusters with cell types. For each marker we will plot its expression on a t-SNE, and show distribution across each cluster on a violin plot.
We will be using gene symbols to identify the marker genes, so we will switch the rownames in the SCE object to be gene symbols. We use the scater function uniquifyFeatureNames
to do this as there are a few duplicated gene symbols.
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
p1 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
by_exprs_values = "logcounts",
colour_by = "MS4A1",
text_by = "label")
p2 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
by_exprs_values = "logcounts",
colour_by = "CD79A",
text_by = "label")
p1 + p2
plotExpression(sce,
exprs_values = "logcounts",
x = "label",
colour_by = "label",
features=c("MS4A1", "CD79A"))
p1 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
by_exprs_values = "logcounts",
colour_by = "FCGR3A",
text_by = "label")
p2 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
by_exprs_values = "logcounts",
colour_by = "MS4A7",
text_by = "label")
p1 + p2
plotExpression(sce,
exprs_values = "logcounts",
x = "label",
colour_by = "label",
features=c("FCGR3A", "MS4A7"))
p1 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
by_exprs_values = "logcounts",
colour_by = "FCER1A",
text_by = "label")
p2 <- plotReducedDim(sce,
dimred = "TSNE_corrected",
by_exprs_values = "logcounts",
colour_by = "CST3",
text_by = "label")
p1 + p2
plotExpression(sce,
exprs_values = "logcounts",
x = "label",
colour_by = "label",
features=c("FCER1A", "CST3"))
Write SCE object to file.
saveRDS(sce, file="Robjects/Caron_clustering_material.rds")
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DT_0.23 forcats_0.5.1
## [3] stringr_1.4.0 dplyr_1.0.9
## [5] purrr_0.3.4 readr_2.1.2
## [7] tidyr_1.2.0 tibble_3.1.7
## [9] tidyverse_1.3.1 patchwork_1.1.1
## [11] pheatmap_1.0.12 igraph_1.3.2
## [13] cluster_2.1.3 bluster_1.6.0
## [15] scran_1.24.0 scater_1.24.0
## [17] ggplot2_3.3.6 scuttle_1.6.2
## [19] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [21] Biobase_2.56.0 GenomicRanges_1.48.0
## [23] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [25] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [27] MatrixGenerics_1.8.0 matrixStats_0.62.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 colorspace_2.0-3
## [3] ellipsis_0.3.2 XVector_0.36.0
## [5] BiocNeighbors_1.14.0 fs_1.5.2
## [7] rstudioapi_0.13 farver_2.1.0
## [9] ggrepel_0.9.1 fansi_1.0.3
## [11] lubridate_1.8.0 xml2_1.3.3
## [13] codetools_0.2-18 sparseMatrixStats_1.8.0
## [15] knitr_1.39 jsonlite_1.8.0
## [17] broom_0.8.0 dbplyr_2.2.0
## [19] compiler_4.2.0 httr_1.4.3
## [21] dqrng_0.3.0 backports_1.4.1
## [23] assertthat_0.2.1 Matrix_1.4-1
## [25] fastmap_1.1.0 limma_3.52.2
## [27] cli_3.3.0 BiocSingular_1.12.0
## [29] htmltools_0.5.2 tools_4.2.0
## [31] rsvd_1.0.5 gtable_0.3.0
## [33] glue_1.6.2 GenomeInfoDbData_1.2.8
## [35] Rcpp_1.0.8.3 cellranger_1.1.0
## [37] jquerylib_0.1.4 vctrs_0.4.1
## [39] crosstalk_1.2.0 DelayedMatrixStats_1.18.0
## [41] xfun_0.31 beachmat_2.12.0
## [43] rvest_1.0.2 lifecycle_1.0.1
## [45] irlba_2.3.5 statmod_1.4.36
## [47] edgeR_3.38.1 zlibbioc_1.42.0
## [49] scales_1.2.0 hms_1.1.1
## [51] parallel_4.2.0 RColorBrewer_1.1-3
## [53] yaml_2.3.5 gridExtra_2.3
## [55] yulab.utils_0.0.5 sass_0.4.1
## [57] stringi_1.7.6 highr_0.9
## [59] ScaledMatrix_1.4.0 BiocParallel_1.30.3
## [61] rlang_1.0.2 pkgconfig_2.0.3
## [63] bitops_1.0-7 evaluate_0.15
## [65] lattice_0.20-45 labeling_0.4.2
## [67] htmlwidgets_1.5.4 cowplot_1.1.1
## [69] tidyselect_1.1.2 magrittr_2.0.3
## [71] R6_2.5.1 generics_0.1.2
## [73] metapod_1.4.0 DelayedArray_0.22.0
## [75] DBI_1.1.3 pillar_1.7.0
## [77] haven_2.5.0 withr_2.5.0
## [79] RCurl_1.98-1.7 modelr_0.1.8
## [81] crayon_1.5.1 utf8_1.2.2
## [83] tzdb_0.3.0 rmarkdown_2.14
## [85] viridis_0.6.2 locfit_1.5-9.5
## [87] grid_4.2.0 readxl_1.4.0
## [89] reprex_2.0.1 digest_0.6.29
## [91] gridGraphics_0.5-1 munsell_0.5.0
## [93] ggplotify_0.1.0 beeswarm_0.4.0
## [95] viridisLite_0.4.0 vipor_0.4.5
## [97] bslib_0.3.1