In order to aid the interpretation of the clustering results we covered in the previous section, it is helpful to identify genes that are contributing to the separation of cells into those clusters.
The main approach to achieve this, is to identify genes that are differently expressed between clusters. These may be, for example, exclusively expressed in a single cluster or perhaps differentiate between a few different clusters. There are different methods to identify differentially expressed genes, focusing on differences between the mean expression level, or the probability of expression between clusters or take the whole distribution of expression into account.
Our main objective in this section is to cover some of the methods that can be used to achieve this goal, and obtain a summary table of results.
Before we start, let’s load our packages and read our data in.
# Load packages ----
library(scater)
library(scran)
library(pheatmap)
library(tidyverse) # always load tidyverse after other packages
We will load two SingleCellExperiment objects generated in previous sections:
batchelor::fastMNN()
algorithm, including clusters covered in the Clustering section.# Read data ----
# normalised within batches without batch-integration correction
uncorrected <- readRDS("Robjects/DataIntegration_uncorrected.Rds")
rownames(uncorrected) <- uniquifyFeatureNames(rownames(uncorrected), rowData(uncorrected)$Symbol)
# data corrected using batch integration with mutual nearest neighbours
corrected <- readRDS("Robjects/caron_postDeconv_5hCellPerSpl_dsi_PBMMC_ETV6-RUNX1_clust.Rds")
rownames(corrected) <- uniquifyFeatureNames(rownames(corrected), rowData(corrected)$Symbol)
Note that we also replace the gene names (rownames) of our objects to use common gene names instead of Ensembl IDs. Using the uniquifyFeatureNames()
function makes this safely, avoiding duplicate gene names.
As a reminder, the corrected
object contains the cell cluster assignments obtained, for example, with the “louvain” algorithm:
# visualise cluster assignments on the corrected data
plotUMAP(corrected,
colour_by = "louvain",
text_by = "louvain")
It will also be helpful for data visualisation purposes to copy the cluster assignment in the corrected data to our uncorrected (normalised) SCE object:
# copy cluster assignments to the uncorrected object
# first make sure that the cell names are in the same order
all(colnames(uncorrected) == colnames(corrected))
## [1] TRUE
colData(uncorrected)$louvain <- factor(colData(corrected)$louvain)
Our objective is to identify genes that distinguish between these clusters. For example genes such as the “CST3” gene, which is a known monocyte marker:
# visualise a previously known marker gene (for monocytes)
plotTSNE(corrected,
colour_by = "CST3",
text_by = "louvain",
by_exprs_values = "reconstructed")