1 Introduction

In this section we are going to cover the basics of feature selection and dimensionality reduction. These methods allow us to represent our multi-dimensional data (with thousands of cells and thousands of genes) in a reduced set of dimensions for visualisation and more efficient downstream analysis.

In feature selection the principle is to remove those genes which are uninteresting or uninformative to both improve computation and because this ‘noise’ reduction will hopefully enable us to more clearly see the true biology in our data. We make the assumption that most of the low level variance is not caused by real biology and is due to stochastic sampling in the single cell protocol and various other technical effects. The genes which have the most variance are therefore the ones that reflect the real biological difference and are what we want to focus on. This is obviously not a perfect assumption but it is a good way to make informed interpretations from your data that you can hopefully have a little more confidence in.

In dimensionality reduction we attempt to find a way to represent all the the information we have in expression space in a way that we can easily interpret. High dimensional data has several issues. There is a high computational requirement to performing analysis on 30,000 genes and 48,000 cells (in the Caron dataset). Humans live in a 3D world; we can’t easily visualise all the dimensions. And then there is sparsity. As the number of dimensions increases the distance between two data points increases and becomes more invariant. This invariance causes us problems when we try to cluster the data into biologically similar groupings. By applying some dimensionality reducing methods we can solve these issues to a level where we can make interpretations.

2 Setup

Before we start, let’s load our packages and read our data in.

library(scater) 
library(scran)
library(PCAtools)
library(tidyverse)

We will load the SingleCellExperiment object generated in the Normalisation section, which contains normalised counts for 500 cells per sample. For demonstration purposes we are not using the full dataset but you would in your analyses.

sce <- readRDS("R_objects/Caron_normalized.500.rds")
sce
## class: SingleCellExperiment 
## dim: 33102 5500 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33102): ENSG00000243485 ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(5500): 1_CGACTTCGTCCAGTTA-1 1_AGAATAGCATACGCTA-1 ...
##   8_AGCGTCGAGATGTAAC-1 8_CATGACAAGATAGGAG-1
## colData names(11): Sample Barcode ... total sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

To make some of our plots later on easier to interpret, we will replace the rownames of the object (containing Ensembl gene IDs) with the gene symbol. Sometimes it happens that there is no gene symbol and in some cases there are multiple genes with the same symbol (see e.g. RGS5). A safe way to handle these cases is to use the uniquifyFeatureNames() function, which will add the Ensembl gene id to symbols where necessary to distinguish between different genes.

rownames(sce) <- uniquifyFeatureNames(rownames(sce), rowData(sce)$Symbol)

3 Feature Selection

We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between every pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs).

3.1 Quantifying per-gene variation

Some assays allow the inclusion of known molecules in a known amount covering a wide range, from low to high abundance: spike-ins. The technical noise is assessed based on the amount of spike-ins used, the corresponding read counts obtained and their variation across cells. The variance in expression can then be decomposed into the biological and technical components. There is discussion over whether this step is necessary but like all of the decisions you make in your analyses it would be wise to optimise your parameters and steps over several iterations with a view to existing biological knowlegde and controls.

The commonly used UMI-based assays do not (yet?) allow spike-ins. But one can still identify highly variable genes (HVGs), which likely capture biological variation. Assuming that, for most genes, the observed variance across cells is due to technical noise, we can assess technical variation by fitting a trend line between the mean-variance relationship across all genes. Genes that substantially deviate from this trend may then be considered as highly-variable, i.e. capturing biologically interesting variation. The scran function modelGeneVar with carry out this estimation for us.

The resulting output gives us a dataframe with all our genes and several columns. It has modeled the mean-variance relationship and from there it has estimated our total variance along with what it considers is the biological and technical variance.

gene_var <- modelGeneVar(sce)

gene_var
## DataFrame with 33102 rows and 6 columns
##                        mean       total        tech          bio    p.value
##                   <numeric>   <numeric>   <numeric>    <numeric>  <numeric>
## MIR1302-2HG     0.000296568 0.000483740 0.000359017  1.24723e-04 0.00218142
## ENSG00000238009 0.001885572 0.002130691 0.002282621 -1.51930e-04 0.70752449
## ENSG00000239945 0.000148617 0.000121478 0.000179911 -5.84331e-05 0.99615196
## ENSG00000241860 0.005441182 0.006111014 0.006586943 -4.75929e-04 0.72337243
## ENSG00000286448 0.000000000 0.000000000 0.000000000  0.00000e+00        NaN
## ...                     ...         ...         ...          ...        ...
## ENSG00000277856 0.000352207 0.000371456 0.000426372 -5.49152e-05 0.85471363
## ENSG00000275063 0.002842271 0.004530285 0.003440774  1.08951e-03 0.00468417
## ENSG00000275869 0.000000000 0.000000000 0.000000000  0.00000e+00        NaN
## ENSG00000276017 0.000000000 0.000000000 0.000000000  0.00000e+00        NaN
## ENSG00000278817 0.004954633 0.007312049 0.005997940  1.31411e-03 0.03610314
##                       FDR
##                 <numeric>
## MIR1302-2HG     0.0212537
## ENSG00000238009 1.0000000
## ENSG00000239945 1.0000000
## ENSG00000241860 1.0000000
## ENSG00000286448       NaN
## ...                   ...
## ENSG00000277856 1.0000000
## ENSG00000275063 0.0406691
## ENSG00000275869       NaN
## ENSG00000276017       NaN
## ENSG00000278817 0.2082726

We can then plot variance vs mean for each gene and visualise the limit of technical variance according to the fitted trend line. We would regard those features below the trend line as technical noise.

gene_var %>% 
  as.data.frame() %>% 
  ggplot(aes(mean, total)) +
  geom_point() +
  geom_line(aes(y = tech), colour = "dodgerblue", size = 1) +
  labs(x = "Mean of log-expression", y = "Variance of log-expression")

3.2 Selecting highly variable genes

Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. Commonly applied strategies are:

  • take top X genes with largest (biological) variation

Top 1000 genes: getTopHVGs(gene_var, n=1000)

Top 10% genes: getTopHVGs(gene_var, prop=0.1)

  • based on significance

getTopHVGs(gene_var, fdr.threshold = 0.05)

  • keeping all genes above the trend

getTopHVGs(gene_var, var.threshold = 0)

  • selecting a priori genes of interest

In our example, we will define ‘HVGs’ as the top 10% of genes with the highest biological component. This is a fairly arbitrary choice. A common practice is to pick an arbitrary threshold (either based on number of proportion) and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value.

hvgs <- getTopHVGs(gene_var, prop=0.1)
length(hvgs)
## [1] 1449
hvgs[1:10]
##  [1] "HBB"      "HBA2"     "HBA1"     "HBD"      "HBM"      "CD74"    
##  [7] "AHSP"     "HLA-DRA"  "CA1"      "SLC25A37"

The result is a vector of gene IDs ordered by their biological variance (i.e. highest deviation from the trend line shown above). We can use this with functions that accept a list of genes as option to restrict their analysis to that subset of genes (e.g. when we do PCA later on).

We can visualise the expression of the top most-variable genes with a violin plot for each gene using the plotExpression() function:

plotExpression(sce, features = hvgs[1:20], point_alpha = 0.05)

4 Dimensionality Reduction

Many scRNA-seq analysis procedures involve comparing cells based on their expression values across thousands of genes. Thus, each individual gene represents a dimension of the data (and the total number of genes represents the “dimensionality” of the data). More intuitively, if we had a scRNA-seq data set with only two genes, we could visualise our data in a two-dimensional scatterplot, with each axis representing the expression of a gene and each point in the plot representing a cell. Intuitively, we can imagine the same for 3 genes, represented as a 3D plot. Although it becomes harder to imagine, this concept can be extended to data sets with thousands of genes (dimensions), where each cell’s expression profile defines its location in the high-dimensional expression space.

As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007). This reduces computational work in downstream analyses like clustering, as calculations only need to be performed for a few dimensions, rather than thousands. It also reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data. And finally it enables effective visualisation of the data, for those of us who are not capable of visualizing more than 2 or 3 dimensions.

Here, we will cover three methods that are most commonly used in scRNA-seq analysis:

Before we go into the details of each method, it is important to mention that while the first method (PCA) can be used for downstream analysis of the data (such as cell clustering), the latter two methods (t-SNE and UMAP) should only be used for visualisation and not for any other kind of analysis.

4.1 Principal Components Analysis

One of the most used and well-known methods of dimensionality reduction is principal components analysis (PCA). This method performs a linear transformation of the data, such that a set of variables (genes) are turned into new variables called Principal Components (PCs). These principal components combine information across several genes in a way that best captures the variability observed across samples (cells).

Watch the video linked below for more details of how PCA works:

StatQuest: PCA

After performing a PCA, there is no data loss, i.e. the total number of variables does not change. Only the fraction of variance captured by each variable differs.

Each PC represents a dimension in the new expression space. The first PC explains the highest proportion of variance possible. The second PC explains the highest proportion of variance not explained by the first PC. And so on: successive PCs each explain a decreasing amount of variance not captured by the previous ones.

The advantage of using PCA is that the total amount of variance explained by the first few PCs is usually enough to capture most of the signal in the data. Therefore, we can exclude the remaining PCs without much loss of information. The stronger the correlation between the initial variables, the stronger the reduction in dimensionality. We will see below how we can choose how many PCs to retain for our downstream analysis.

4.1.1 Running PCA

As discussed in the Preprocessing and QC section, SingleCellExperiment objects contain a slot that can store representations of our data in reduced dimensions. This is useful as we can keep all the information about our single-cell data within a single object.

The runPCA() function can be used to run PCA on a SCE object, and returns an updated version of the single cell object with the PCA result added to the reducedDim slot.

Importantly, we can also restrict the PCA to use only some of the features (rows) of the object, which in this case we do by using the highly variable genes we identified earlier.

sce <- runPCA(sce, subset_row = hvgs)
sce
## class: SingleCellExperiment 
## dim: 33102 5500 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33102): MIR1302-2HG ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(5500): 1_CGACTTCGTCCAGTTA-1 1_AGAATAGCATACGCTA-1 ...
##   8_AGCGTCGAGATGTAAC-1 8_CATGACAAGATAGGAG-1
## colData names(11): Sample Barcode ... total sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

We can see that the output shows a new reducedDimNames value called “PCA”. We can access it by using the reducedDim() function:

reducedDim(sce, "PCA")[1:10, 1:5]
##                           PC1       PC2        PC3       PC4        PC5
## 1_CGACTTCGTCCAGTTA-1 4.857944 -8.678251 -1.9769310  7.763731 -1.8736044
## 1_AGAATAGCATACGCTA-1 6.975464 -6.821056  0.8777444  7.226113 -2.0904117
## 1_TGACTAGAGAACTCGG-1 5.418403 -6.046118 -1.4431368  8.104965 -1.8340643
## 1_CTTAACTGTTATGCGT-1 6.710139 -8.502934 11.4443355 -8.604738  0.5148573
## 1_CCCAGTTTCAAGCCTA-1 6.394137 -9.307770  0.5732515  6.068439 -2.0552334
## 1_TACAGTGCACGACGAA-1 4.711232 -7.563851  3.0215653  7.903445 -0.9057246
## 1_CTTTGCGTCCCATTTA-1 7.314671 -9.222312  1.8459769  3.971201 -2.3524221
## 1_GCGCAGTCACGGTGTC-1 5.489230 -9.750849 -1.4120589  7.026349 -2.8158097
## 1_AGCTTGAAGGAGTACC-1 5.283623 -8.215514 -1.1146285  7.258839 -2.8973718
## 1_AGGTCATGTAAGCACG-1 5.451176 -7.573401  0.6106959  6.623330 -1.6791321

By default, runPCA() returns the first 50 PCs, but you can change this number by specifying the ncomponents option.

One of the first things to investigate after doing a PCA is how much variance is explained by each PC. This information is stored as an “attribute” (think of it as additional information) attached to the PCA matrix above. The typical way to view this information is using what is known as a “scree plot”.

percent.var <- attr(reducedDim(sce), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")

We can see how the two first PCs explain a substantial amount of the variance, and very little variation is explained beyond 10-15 PCs.

To visualise our cells in the reduced dimension space defined by PC1 and PC2, we can use the plotReducedDim() function.

plotReducedDim(sce, dimred = "PCA", colour_by = "SampleName")

The proximity of cells in this plot reflects the similarity of their expression profiles.

We can also plot several PCs at once, using the ncomponents option:

plotReducedDim(sce, dimred = "PCA", ncomponents = 3, colour_by = "SampleName")

Although these plotting functions are useful for quickly visualising our data, more customised visualisations can be used by using the ggcells() function, which extends the regular ggplot() function, but to work directly from the SCE object. We can use it in a similar manner as we use the regular ggplot() function, except we can define aesthetics both from our reducedDim slot as well as colData and even assays (to plot particular gene’s expression). Here is an example, where we facet our plot:

ggcells(sce, aes(x = PCA.1, y = PCA.2, colour = SampleName)) +
  geom_point(size = 0.5) +
  facet_wrap(~ SampleName) +
  labs(x = "PC1", y = "PC2", colour = "Sample")

4.1.2 PCA Diagnostics

There are a large number of potential confounders, artifacts and biases in scRNA-seq data. One of the main challenges stems from the fact that it is difficult to carry out true technical replication to distinguish biological and technical variability. Here we will continue to explore how experimental artifacts can be identified and removed.

One of the ways to achieve this is to calculate the association between our PC scores and different variables associated with our cells such as sample groups, number of detected genes, total reads per cell, percentage of mitochondrial genes, etc. We can achieve this using the getExplanatoryPCs() function (and associated plotting function), which calculates the variance in each PC explained by those variables we choose:

explain_pcs <- getExplanatoryPCs(sce,
                                variables = c("sum",
                                              "detected",
                                              "SampleGroup",
                                              "SampleName",
                                              "subsets_Mito_percent")
                                )

plotExplanatoryPCs(explain_pcs/100)

We can see that PC1 can be explained mostly by individual samples (SampleName), mitochondrial expression and mutation group (SampleGroup).

We can also compute the marginal R2 for each variable when fitting a linear model regressing expression values for each gene against just that variable, and display a density plot of the gene-wise marginal R2 values for the variables.

plotExplanatoryVariables(sce,
                         variables = c(
                           "sum",
                           "detected",
                           "SampleGroup",
                           "SampleName",
                           "subsets_Mito_percent"
                         ))

This analysis indicates that individual and subtype have the highest explanatory power for many genes, and we don’t see technical covariates having as high correlations. If that were the case, we might need to repeat the normalization step while conditioning out for these covariates, or we would include them in downstream analysis.

4.1.3 Chosing the number of PCs

The choice of the number of PCs we retain for downstream analyses is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will retain more biological signal at the cost of including more noise that might mask said signal. Much like the choice of the number of HVGs, it is hard to determine whether an “optimal” choice exists for the number of PCs. Even if we put aside the technical variation that is almost always uninteresting, there is no straightforward way to automatically determine which aspects of biological variation are relevant; one analyst’s biological signal may be irrelevant noise to another analyst with a different scientific question.

Most practitioners will simply set the number of PCs to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory as the later PCs explain so little variance that their inclusion or omission has no major effect. For example, in our dataset, few PCs explain more than 1% of the variance in the entire dataset.

table(percent.var > 1)
## 
## FALSE  TRUE 
##    40    10

The most commonly used strategies to choose PCs for downstream analysis include:

  • selecting the top X PCs (with X typically ranging from 10 to 50)
  • using the elbow point in the scree plot
  • using technical noise
  • using permutation

4.1.3.1 Elbow point

To choose the elbow point in our scree plot, we can use the following:

chosen_elbow <- findElbowPoint(percent.var)
chosen_elbow
## [1] 8

Here is our scree plot again, but this time with a vertical line indicating the elbow point:

plot(percent.var)
abline(v=chosen_elbow, col="dodgerblue")

4.1.3.2 Denoising PCA

The assumption of this method is that the biology drives most of the variance and hence should be captured by the first few PCs, while technical noise affects each gene independently, hence it should be captured by later PCs. Therefore, our aim in this approach is to find the minimum number of PCs that explains more variance than the total technical variance across genes (estimated from our mean-variance trend).

This method is implemented in the denoisePCA() function:

sce.denoised <- denoisePCA(sce, technical = gene_var)

If we look at our PCA result, we can see that it has 6 columns.

ncol(reducedDim(sce.denoised, "PCA"))
## [1] 6

4.1.3.3 Permutation

We do not demonstrate this method, as it is more code intensive. The idea is to permute (or “shuffle”) a subset of the data, rerun the PCA and calculate the percentage of variance explained by each PC on this “random” dataset. We can then compare the observed variance explained in the original data with this null or random expectation and determine a cutoff where the observed variance explained drops to similar values as the variance explained in the shuffled data.

In this example (which is for illustration only) we may define a threshold at PC8, since after that the variance explained in the observed data drops below the levels in the permuted (randomised) data.

4.2 t-SNE: t-Distributed Stochastic Neighbor Embedding

The t-Distributed Stochastic Neighbor Embedding (t-SNE) approach addresses the main shortcoming of PCA, which is that it can only capture linear transformations of the original variables (genes). Instead, t-SNE allows for non-linear transformations, while preserving the local structure of the data. This means that neighbourhoods of similar samples (cells, in our case) will appear together in a t-SNE projection, with distinct cell clusters separating from each other.

As you can see below, compared to PCA we get much tighter clusters of samples, with more structure in the data captured by two t-SNE axis, compared to the two first principal components.

We will not go into the details of the algorithm here, but briefly it involves two main steps:

  • Calculating a similarity matrix between every pair of samples. This similarity is scaled by a Normal distribution, such that points that are far away from each other are “penalised” with a very low similarity score. The variance of this normal distribution can be thought of as a “neighbourhood” size when computing similarities between cells, and is parameterised by a term called perplexity.

  • Then, samples are projected on a low-dimensional space (usually two dimensions) such that the similarities between the points in this new space are as close as possible to the similarities in the original high-dimensional space. This step involves a stochastic algorithm that “moves” the points around until it converges on a stable solution. In this case, the similarity between samples is scaled by a t-distribution (that’s where the “t” in “t-SNE” comes from), which is used instead of the Normal to guarantee that points within a cluster are still distinguishable from each other in the 2D-plane (the t-distribution has “fatter” tails than the Normal distribution).

Watch this video if you want to learn more about how t-SNE works:

StatQuest: t-SNE

There are two important points to remember:

  • the perplexity parameter, which indicates the relative importance of the local and global patterns in the structure of the data set. The default value in the functions we will use is 50, but different values should be tested to ensure consistency in the results.
  • stochasticity: because the t-SNE algorithm is stochastic, running the analysis multiple times will produce slightly different results each time (unless we set a “seed” for the random number generator).

See this interactive article on “How to Use t-SNE Effectively”, which illustrates how changing these parameters can lead to widely different results.

Importantly, because of the non-linear nature of this algorithm, strong interpretations based on how distant different groups of cells are from each other on a t-SNE plot are discouraged, as they are not necessarily meaningful. This is why it is often the case that the x- and y-axis scales are omitted from these plots (as in the example above), as they are largely uninterpretable. Therefore, the results of a t-SNE projection should be used for visualisation only and not for downstream analysis (such as cell clustering).

4.2.1 Running t-SNE

Similarly to how we did with PCA, there are functions that can run a t-SNE directly on our SingleCellExperiment object. We will leave this exploration for you to do in the following exercises, but the basic code is very similar to that used with PCA. For example, the following would run t-SNE with default options:

sce <- runTSNE(sce)

plotTSNE(sce, colour_by = "SampleName")

One thing to note here is that, because of the computational demands of this algorithm, the general practice is to run it on the PCA results rather than on the full matrix of normalised counts. The reason is that, as we’ve seen earlier, the first few PCs should capture most of the biologically meaningful variance in the data, thus reducing the influence of technical noise in our analysis, as well as substantially reducing the computational time to run the analysis.

Let’s explore more about t-SNE in the following exercise.

See the accompanying worksheet for this session, which includes the code below for the exercise. We want to achieve the following:

  • Add a t-SNE projection of the data to our SCE object
  • Explore how the stochasticity of the algorithm affects the results
  • Customise our visualisation to display gene expression on top of the t-SNE
  • Explore how the main tuneable parameter of the algorithm - perplexity - affects the results
# Run t-SNE ----

# add the t-SNE result to the reducedDim slot of the SCE object
# we name this reducedDim "TSNE_perplex50"
# we set perplexity = 50 (which is the default if we don't specify it)
# we run t-SNE based on the PCA we ran previously
# we will use the first 10 principle components
set.seed(123) # set a random seed to ensure reproducibility
sce <- runTSNE(sce, 
               name = "TSNE_perplex50",
               perplexity = 50, 
               dimred = "PCA",
               n_dimred = 10)

# Make a custom visualisation using ggcells
ggcells(sce, aes(x = TSNE_perplex50.1, y = TSNE_perplex50.2, 
                 colour = SampleName)) +
  geom_point()

# Part A
# Re-run the algorithm but change the random seed number. 
# Do the results change dramatically between runs?
FIXME

# Part B
# Instead of colouring by SampleName, colour by expression of known cell markers
# CD79A (B cells)
# CST3 (monocytes)
# CD3D (T cells) 
# HBA1 (erythrocytes)
FIXME

# Part C
# Facet these plots by SampleName to better understand where each marker is mostly expressed
FIXME

# Part D
# Explore different perplexity values (for example 5 and 500)
# Do you get tighter or looser clusters?
FIXME
Hint A

Look at the set.seed() function

Hint B

You can replace what we colour by with any of the gene names in our dataset as they are stored as the rownames in our object.

Hint C

The function facet_wrap() can be used to modify ggplots as we did earlier.

Hint D

You can replace values in the perplexity argument of the runTSNE() function.

Answer

Here is the complete script:

# Run t-SNE ----

# add the t-SNE result to the reducedDim slot of the SCE object
# we name this reducedDim "TSNE_perplex50"
# we set perplexity = 50 (which is the default if we don't specify it)
# we run t-SNE based on the PCA we ran previously
set.seed(123) # set a random seed to ensure reproducibility
sce <- runTSNE(sce, 
               name = "TSNE_perplex50",
               perplexity = 50, 
               dimred = "PCA",
               n_dimred = 10)

# Make a custom visualisation using ggcells
ggcells(sce, aes(x = TSNE_perplex50.1, y = TSNE_perplex50.2, 
                 colour = SampleName)) +
  geom_point()

# Re-run the algorithm but change the random seed number. 
# Do the results change dramatically between runs?
set.seed(321)
sce <- runTSNE(sce, 
               name = "TSNE_perplex50_seed321",
               perplexity = 50, 
               dimred = "PCA",
               n_dimred = 10)

ggcells(sce, aes(x = TSNE_perplex50_seed321.1, y = TSNE_perplex50_seed321.2, 
                 colour = SampleName)) +
  geom_point()


# Modify the visualisation to colour the points based on logcounts of known cell markers
# CD79A (B cells)
# CST3 (monocytes)
# CD3D (T cells) 
# HBA1 (erythrocytes)
ggcells(sce, aes(x = TSNE_perplex50_seed321.1, y = TSNE_perplex50_seed321.2, 
                 colour = CD79A)) +
  geom_point()

# Facet these plots by SampleName to better understand where each marker is mostly expressed
ggcells(sce, aes(x = TSNE_perplex50_seed321.1, y = TSNE_perplex50_seed321.2, 
                 colour = CD79A)) +
  geom_point() +
  facet_wrap(~ SampleName)

# Explore different perplexity values (for example 5 and 500)
# Do you get tighter or looser clusters?
set.seed(321)
sce <- runTSNE(sce, 
               name = "TSNE_perplex5",
               perplexity = 5, 
               dimred = "PCA",
               n_dimred = 10)
sce <- runTSNE(sce, 
               name = "TSNE_perplex500",
               perplexity = 500, 
               dimred = "PCA",
               n_dimred = 10)

# visualise
ggcells(sce, aes(x = TSNE_perplex5.1, y = TSNE_perplex5.2, 
                 colour = SampleName)) +
  geom_point() +
  labs(title = "Perplexity 5")

ggcells(sce, aes(x = TSNE_perplex50.1, y = TSNE_perplex50.2, 
                 colour = SampleName)) +
  geom_point() +
  labs(title = "Perplexity 50")

ggcells(sce, aes(x = TSNE_perplex500.1, y = TSNE_perplex500.2, 
                 colour = SampleName)) +
  geom_point() +
  labs(title = "Perplexity 500")

Some things to note from our data exploration:

  • Changing the random seed doesn’t seem to qualitatively change the results dramatically. The same groups of cells are still seen. However, their relative position on the axis might change from run to run.
  • Low perplexities will favour resolution of finer structure, possibly to the point that the visualization is compromised by random noise. Thus, it is advisable to test different perplexity values to ensure that the choice of perplexity does not drive the interpretation of the plot.
  • Exploring the expression of known marker genes (from literature or previous experiments we might have done) may help us interpret these plots and judge whether a particular choice of perplexity is adequate to represent the expected cell types in our data.

4.3 UMAP: Uniform Manifold Approximation and Projection

Simiarly to t-SNE, UMAP performs a non-linear transformation of the data to project it down to lower dimensions. One difference to t-SNE is that this method claims to preserve both local and global structures (i.e. the relative positions of clusters are, most of the times, meaningful). However, it is worth mentioning that there is some debate as to whether this is always the case, as is explored in this recent paper by Chari, Banerjee and Pachter (2021).

Compared to t-SNE, the UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than t -SNE. From a practical perspective, UMAP is much faster than t-SNE, which may be an important consideration for large datasets.

Similarly to t-SNE, since this is a non-linear method of dimensionality reduction, the results of a UMAP projection should be used for visualisation only and not for downstream analysis (such as cell clustering).

4.3.1 Running UMAP

Running UMAP is very similar to what we’ve seen already for PCA and t-SNE, only the function name changes:

set.seed(123)
sce <- runUMAP(sce)

plotUMAP(sce, colour_by = "SampleName")

Because this UMAP also involves a series of randomization steps, setting the random-generator seed (as we did above) is critical if we want to obtain reproducible results after each run.

Like t-SNE, UMAP has its own suite of hyperparameters that affect the visualization. Of these, the number of neighbors (n_neighbors) and the minimum distance between embedded points (min_dist) have the greatest effect on the granularity of the output. If these values are too low, random noise will be incorrectly treated as high-resolution structure, while values that are too high will discard fine structure altogether in favor of obtaining an accurate overview of the entire dataset. Again, it is a good idea to test a range of values for these parameters to ensure that they do not compromise any conclusions drawn from a UMAP plot.

See this interactive article that goes into more depth about the underlying methods, and explores the impacts of changing the n_neighbours and min_dist parameters: Understanding UMAP.

Similarly to what we did with t-SNE, we will explore this further in the following exercise.

The following code can be found in the accompanying worksheet for this section. Our main objectives are:

  • Add a UMAP projection of the data to our SCE object
  • Explore how the main tuneable parameter of the algorithm - neighbourhood size - affects the results
  • Compare how UMAP compares to t-SNE in terms of revealing structure in the data
# Run UMAP ----

# Part A
# run the UMAP with 50 neighbours
set.seed(123) # set seed for reproducibility
sce <- runUMAP(sce, 
               name = "UMAP_neighbors50",
               dimred = "PCA",
               FIXME)

# Part B
# visualise the resulting UMAP projection (colour cells by sample)
FIXME

# Part C
# run the UMAP with 5 and 500 neighbours and compare the results
FIXME

# Part D
# compare the UMAP projection with the t-SNE projections 
# would you prefer one over the other?
FIXME
Hint A

Check out the ?runUMAP help page, there should be an argument for fixing the number of nearest neighbours.

Hint B

You can use the same code as for plotting the TSNE but replacing the arguments to display the UMAP output data.

Hint C

Change the number of neighbours argument in the runUMAP() function to see what changes.

Hint C

Redraw the TSNE and compare it with your UMAP.

Answer

Here is the complete script:

# Run UMAP ----

# run the UMAP with 50 neighbours
set.seed(123) # set seed for reproducibility
sce <- runUMAP(sce, 
               name = "UMAP_neighbors50",
               dimred = "PCA",
               n_neighbors = 50)

# visualise the resulting UMAP projection
# colour cells by sample
ggcells(sce, aes(x = UMAP_neighbors50.1, y = UMAP_neighbors50.2, 
                 colour = SampleName)) +
  geom_point()

# run the UMAP with 5 and 500 neighbours and compare the results
set.seed(123) # set seed for reproducibility
sce <- runUMAP(sce, 
               name = "UMAP_neighbors5",
               dimred = "PCA",
               n_neighbors = 5)
sce <- runUMAP(sce, 
               name = "UMAP_neighbors500",
               dimred = "PCA",
               n_neighbors = 500)

ggcells(sce, aes(x = UMAP_neighbors5.1, y = UMAP_neighbors5.2, 
                 colour = SampleName)) +
  geom_point() +
  labs(title = "Neighbours = 5")
ggcells(sce, aes(x = UMAP_neighbors500.1, y = UMAP_neighbors500.2, 
                 colour = SampleName)) +
  geom_point() +
  labs(title = "Neighbours = 500")

# compare the UMAP projection with the t-SNE projections 
# would you prefer one over the other?
ggcells(sce, aes(x = TSNE_perplex50.1, y = TSNE_perplex50.2, 
                 colour = SampleName)) +
  geom_point()
ggcells(sce, aes(x = UMAP_neighbors50.1, y = UMAP_neighbors50.2, 
                 colour = SampleName)) +
  geom_point()

To answer some of those questions:

  • Increasing the number of neighbours used in the analysis results in some apparent groups of cells being “merged” with each other. Conversely, picking too small a size creates many groups of cells, possibly picking up on some noise. However, qualitatively the results are relatively robust to the number of neighbours in this dataset, with the major groups of cells consistently appearing together.
  • Both t-SNE and UMAP show similar patterns in the data, so which one to pick in this case seems to be a matter of personal preference. However, UMAP is a more modern algorithm with better properties when it comes to keeping relative positions between clusters.

5 Save SCE object

Optionally, we can save our object, which now contains our dimensionality reduction analysis in it. This is useful as it will save time next time we pick up on our analysis.

saveRDS(sce, "Robjects/caron_dimRed.Rds")

6 Extra: feature selection and dimensionality reduction with Seurat

Using the Seurat package

Seurat is another very popular scRNA-seq analysis package and provides useful tools for analysing PCA loadings. We will convert our SCE object to a Seurat object to use these tools.

# read our original SCE object
sce <- readRDS("R_objects/Caron_normalized.500.rds")

# convert SCE object to a Seurat object 
library(Seurat)
seurat_obj <- as.Seurat(sce, counts = "counts", data = "logcounts")

# identify the 2000 most highly-variable genes
seurat_hvgs <- FindVariableFeatures(seurat_obj, 
                                    selection.method = "vst", 
                                    nfeatures = 2000)

# scale the data for PCA for the highly variable genes 
seurat_obj <- ScaleData(seurat_obj, features = VariableFeatures(seurat_hvgs))

# run PCA 
seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(seurat_hvgs), 
                     verbose = FALSE)

# PCA Plot 
DimPlot(seurat_obj, reduction = "pca")

We can see that the PCA is similar to what we obtained before (there are some differences, which are related to how the variable genes were chosen in this case).

To visualise the top most correlated genes with each PC, we can use the following:

VizDimLoadings(seurat_obj, dims=1:2)

This shows the genes that have highest loading (correlation) with each PC axis. If we know something about the biology of the system we are studying, these may give some clues as to what is driving the separation of cells along each PC axis.

The loadings can also be visualised as a heatmap, showing the cells ordered by the PC scores and the genes ordered by their loading along that PC:

DimHeatmap(seurat_obj, dims = 1:2, cells = 500, balanced = TRUE)

Look at Seurat’s excellent documentation to learn more about this and other uses of this package.

7 Key Points and Resources

Key Points:

  • Dimensionality reduction methods allow us to represent high-dimensional data in lower dimensions, while retaining biological signal.
  • The most common methods used in scRNA-seq analysis are PCA, t-SNE and UMAP.
  • PCA uses a linear transformation of the data, which aims at defining new dimensions (axis) that capture most of the variance observed in the original data. This allows to reduce the dimension of our data from thousands of genes to 10-20 principal components.
  • The results of PCA can be used in downstream analysis such as cell clustering, trajectory analysis and even as input to non-linear dimensionality reduction methods such as t-SNE and UMAP.
  • t-SNE and UMAP are both non-linear methods of dimensionality reduction. They aim at keeping similar cells together and dissimilar clusters of cells apart from each other.
  • Because these methods are non-linear, they should only be used for data visualisation, and not for downstream analysis.

Further reading:

8 Session information

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] sp_1.5-0                    SeuratObject_4.1.1         
##  [3] Seurat_4.1.1                forcats_0.5.2              
##  [5] stringr_1.4.1               dplyr_1.0.10               
##  [7] purrr_0.3.4                 readr_2.1.2                
##  [9] tidyr_1.2.0                 tibble_3.1.8               
## [11] tidyverse_1.3.2             PCAtools_2.8.0             
## [13] ggrepel_0.9.1               scran_1.24.0               
## [15] scater_1.24.0               ggplot2_3.3.6              
## [17] scuttle_1.6.3               SingleCellExperiment_1.18.0
## [19] SummarizedExperiment_1.26.1 Biobase_2.56.0             
## [21] GenomicRanges_1.48.0        GenomeInfoDb_1.32.3        
## [23] IRanges_2.30.1              S4Vectors_0.34.0           
## [25] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
## [27] matrixStats_0.62.0         
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                reticulate_1.26          
##   [3] tidyselect_1.1.2          htmlwidgets_1.5.4        
##   [5] grid_4.2.0                BiocParallel_1.30.3      
##   [7] Rtsne_0.16                munsell_0.5.0            
##   [9] ScaledMatrix_1.4.0        codetools_0.2-18         
##  [11] ica_1.0-3                 statmod_1.4.37           
##  [13] future_1.27.0             miniUI_0.1.1.1           
##  [15] withr_2.5.0               spatstat.random_2.2-0    
##  [17] colorspace_2.0-3          progressr_0.11.0         
##  [19] highr_0.9                 knitr_1.40               
##  [21] rstudioapi_0.14           ROCR_1.0-11              
##  [23] tensor_1.5                listenv_0.8.0            
##  [25] labeling_0.4.2            GenomeInfoDbData_1.2.8   
##  [27] polyclip_1.10-0           farver_2.1.1             
##  [29] parallelly_1.32.1         vctrs_0.4.1              
##  [31] generics_0.1.3            xfun_0.32                
##  [33] R6_2.5.1                  ggbeeswarm_0.6.0         
##  [35] rsvd_1.0.5                locfit_1.5-9.6           
##  [37] bitops_1.0-7              spatstat.utils_2.3-1     
##  [39] cachem_1.0.6              DelayedArray_0.22.0      
##  [41] assertthat_0.2.1          promises_1.2.0.1         
##  [43] scales_1.2.1              googlesheets4_1.0.1      
##  [45] rgeos_0.5-9               beeswarm_0.4.0           
##  [47] gtable_0.3.1              beachmat_2.12.0          
##  [49] globals_0.16.1            goftest_1.2-3            
##  [51] rlang_1.0.5               splines_4.2.0            
##  [53] lazyeval_0.2.2            gargle_1.2.0             
##  [55] spatstat.geom_2.4-0       broom_1.0.1              
##  [57] yaml_2.3.5                reshape2_1.4.4           
##  [59] abind_1.4-5               modelr_0.1.9             
##  [61] backports_1.4.1           httpuv_1.6.5             
##  [63] tools_4.2.0               ellipsis_0.3.2           
##  [65] spatstat.core_2.4-4       jquerylib_0.1.4          
##  [67] RColorBrewer_1.1-3        ggridges_0.5.3           
##  [69] Rcpp_1.0.9                plyr_1.8.7               
##  [71] sparseMatrixStats_1.8.0   zlibbioc_1.42.0          
##  [73] RCurl_1.98-1.8            rpart_4.1.16             
##  [75] deldir_1.0-6              pbapply_1.5-0            
##  [77] viridis_0.6.2             cowplot_1.1.1            
##  [79] zoo_1.8-10                haven_2.5.1              
##  [81] cluster_2.1.3             fs_1.5.2                 
##  [83] magrittr_2.0.3            data.table_1.14.2        
##  [85] scattermore_0.8           lmtest_0.9-40            
##  [87] reprex_2.0.2              RANN_2.6.1               
##  [89] googledrive_2.0.0         fitdistrplus_1.1-8       
##  [91] hms_1.1.2                 patchwork_1.1.2          
##  [93] mime_0.12                 evaluate_0.16            
##  [95] xtable_1.8-4              readxl_1.4.1             
##  [97] gridExtra_2.3             compiler_4.2.0           
##  [99] KernSmooth_2.23-20        crayon_1.5.1             
## [101] htmltools_0.5.3           mgcv_1.8-40              
## [103] later_1.3.0               tzdb_0.3.0               
## [105] lubridate_1.8.0           DBI_1.1.3                
## [107] dbplyr_2.2.1              MASS_7.3-57              
## [109] Matrix_1.4-1              cli_3.3.0                
## [111] parallel_4.2.0            metapod_1.4.0            
## [113] igraph_1.3.4              pkgconfig_2.0.3          
## [115] plotly_4.10.0             spatstat.sparse_2.1-1    
## [117] xml2_1.3.3                vipor_0.4.5              
## [119] bslib_0.4.0               dqrng_0.3.0              
## [121] XVector_0.36.0            rvest_1.0.3              
## [123] digest_0.6.29             sctransform_0.3.4        
## [125] RcppAnnoy_0.0.19          spatstat.data_2.2-0      
## [127] rmarkdown_2.16            cellranger_1.1.0         
## [129] leiden_0.4.2              uwot_0.1.14              
## [131] edgeR_3.38.4              DelayedMatrixStats_1.18.0
## [133] shiny_1.7.2               nlme_3.1-157             
## [135] lifecycle_1.0.1           jsonlite_1.8.0           
## [137] BiocNeighbors_1.14.0      viridisLite_0.4.1        
## [139] limma_3.52.2              fansi_1.0.3              
## [141] pillar_1.8.1              lattice_0.20-45          
## [143] fastmap_1.1.0             httr_1.4.4               
## [145] survival_3.2-13           glue_1.6.2               
## [147] png_0.1-7                 bluster_1.6.0            
## [149] stringi_1.7.8             sass_0.4.2               
## [151] BiocSingular_1.12.0       irlba_2.3.5              
## [153] future.apply_1.9.0