September 2022
In single-cell data we typically have thousands of genes across thousands (or millions!) of cells.
Solution: collapse the number of dimensions to a more manageable number, while preserving information.
Select genes which capture biologically-meaningful variation, while reducing the number of genes which only contribute to technical noise
It’s a linear algebraic method of dimensionality reduction
Finds principal components (PCs) of the data
When data is very highly-dimensional, we can select the most important PCs only, and use them for downstream analysis (e.g. clustering cells)
This reduces the dimensionality of the data from ~20,000 genes to maybe 20-50 PCs
Each PC represents a robust ‘metagene’ that combines information across a correlated gene set
Prior to PCA we scale the data so that genes have equal weight in downstream analysis and highly expressed genes don’t dominate
After PCA we are still left with as many dimensions in our data as we started
But our principal components progressively capture less variation in the data
How do we select the number of PCs to retain for downstream analysis?
Because PC1 and PC2 capture most of the variance of the data, it is common to visualise the data projected onto those two new dimensions.
Gene expression patterns will be captured by PCs → PCA can separate cell types
Note that PCA can also capture other things, like sequencing depth or cell heterogeneity/complexity!
However, PC1 + PC2 are usually not enough to visualise all the diversity of cell types in single-cell data (usually we need to use PC3, PC4, etc…) → not so good for visualisation, so…
Graph-based, non-linear methods: UMAP and t-SNE
These methods can run on the output of the PCA, which speeds their computation and can make the results more robust to noise
t-SNE and UMAP should only be used for visualisation, not as input for downstream analysis
It has a stochastic step (results vary every time you run it)
Only local distances are preserved, while distances between groups are not always meaningful
Some parameters dramatically affect the resulting projection (in particular “perplexity”)
Learn more about how t-SNE works from this video: StatQuest: t-SNE, Clearly Explained
Main parameter in t-SNE is the perplexity (~ number of neighbours each point is “attracted” to)
Exploring different perplexity values that best represent the biological diversity of cells is recommended.
Non-linear graph-based dimension reduction method like t-SNE
Newer & efficient = fast
Runs on top of PCs
Based on topological structures in multidimensional space
Faster and less computationally intensive than tSNE
Preserves the global structure better than t-SNE
Main parameter in UMAP is n_neighbors
(the number of neighbours used to construct the initial graph).
Another common parameter is min_dist
(minimum distance between points)
n_neighbors
, although playing with both parameters can be beneficialExploring different number of neighbours that best represent the biological diversity of cells is recommended.
Slides are adapted from Paulo Czarnewski and Zeynep Kalender-Atak
References (image sources):