In this section we are going to cover the basics of dimensionality reduction methods to represent our multi-dimensional data (with thousands of cells and thousands of genes) in a reduced set of dimensions for visualisation and more efficient downstream analysis. Before we dive into these methods, we will see how to select genes with biologically meaningful signal, to avoid complicating our analysis with potentially “noisy” data.
Before we start, let’s load our packages and read our data in.
# load packages ----
library(scater)
library(scran)
library(PCAtools)
library(tidyverse) # always load tidyverse after other packages
We will load the SingleCellExperiment object generated in the Normalisation section, which contains normalised counts for 500 cells per sample.
# read data ----
sce <- readRDS("Robjects/caron_postDeconv_5hCellPerSpl.Rds")
sce
## class: SingleCellExperiment
## dim: 28377 5500
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
## ENSG00000278817
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames(5500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
## 12_TTTGTCAAGTGGTAGC-1 12_TTTGTCAGTACAGCAG-1
## colData names(17): SampleName Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
To make some of our plots later on easier to interpret, we will replace the rownames of the object (containing Ensembl gene IDs) with the common gene names. Sometimes it happens that there is no common gene name, or different genes have the same common name. In both cases, we should instead keep the Ensembl ID, to guarantee unique gene names in our object. A safe way to do this, is to use the uniquifyFeatureNames()
function:
# use common gene names instead of Ensembl gene IDs
rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$Symbol)
We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between every pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.
The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs).
Some assays allow the inclusion of known molecules in a known amount covering a wide range, from low to high abundance: spike-ins. The technical noise is assessed based on the amount of spike-ins used, the corresponding read counts obtained and their variation across cells. The variance in expression can then be decomposed into the biological and technical components.
UMI-based assays do not (yet?) allow spike-ins. But one can still identify highly variable genes (HVGs), which likely capture biological variation. Assuming that, for most genes, the observed variance across cells is due to technical noise, we can assess technical variation by fitting a trend line between the mean-variance relationship across all genes. Genes that substantially deviate from this trend may then be considered as highly-variable, i.e. capturing biologically interesting variation.
# Mean-variance model ----
# fit the model; the output is a DataFrame object
gene_var <- modelGeneVar(sce)
# plot the variance against mean of expression with fitted trend line
gene_var %>%
# convert to tibble for ggplot
as_tibble() %>%
# make the plot
ggplot(aes(mean, total)) +
geom_point() +
geom_line(aes(y = tech), colour = "dodgerblue", size = 1) +
labs(x = "Mean of log-expression", y = "Variance of log-expression")
Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. Commonly applied strategies are:
Top 1000 genes: getTopHVGs(gene_var, n=1000)
Top 10% genes: getTopHVGs(gene_var, prop=0.1)
getTopHVGs(gene_var, fdr.threshold = 0.05)
getTopHVGs(gene_var, var.threshold = 0)
In our example, we will define ‘HVGs’ as the top 10% of genes with the highest biological component. This is a fairly arbitrary choice. A common practice is to pick an arbitrary threshold (either based on number of proportion) and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value.
# identify the top 10% most variable genes
hvgs <- getTopHVGs(gene_var, prop=0.1)
length(hvgs) # check how many genes we have
## [1] 1237
hvgs[1:10] # check the first 10 of those genes
## [1] "HBB" "HBA2" "HBA1" "HBD" "CD74" "HBM"
## [7] "AHSP" "HLA-DRA" "CA1" "SLC25A37"
The result is a vector of gene IDs ordered by their biological variance (i.e. highest deviation from the trend line shown above). We can use this with functions that accept a list of genes as option to restrict their analysis to that subset of genes (e.g. when we do PCA later on).
For example, we may want to visualise the expression of the top most-variable genes determined in this way. We can do this with a violin plot for each gene, using the plotExpression()
function:
# visualise expression of top HGVs
# point_alpha adds transparency to the points
# jitter option determines the way in which the points are "jittered"
# try running this with and without those options to see the difference
plotExpression(sce, features = hvgs[1:20], point_alpha = 0.05, jitter = "jitter")