26.07.2021
Cells are characterized by the expression values of all genes –> thousands of dimensions
Simplify complexity, so it becomes easier to work with (reduce the number of features/genes).
Making clustering step easier
Making visualization easier
Remove redundancies in the data
Expression of many genes are correlated, we don’t need so many dimensions to distinguish cell types
Identify the most relevant information and overcome the extensive technical noise in scRNA-seq data
Reduce computational time for downstream procedures
It’s a linear algebraic method of dimensionality reduction
Finds principal components (PCs) of the data
Directions where the data is most spread out = where there is most variance
PC1 explains most of the variance in the data, then PC2, PC3, ..
We will select the most important PCs and use them for clustering cells
Instead of 20,000 genes we have now maybe 10 PCs
Essentially, each PC represents a robust ‘metagene’ that combines information across a correlated gene set
Prior to PCA we scale the data so that genes have equal weight in downstream analysis and highly expressed genes don’t dominate
Visualize top genes associated with principal components
Which genes are important for PC1 ?
Which genes correspond to seperating cells?
Both cells and genes are ordered according to their PCA scores. Plots the extreme cells on both ends of the spectrum.
Gene expression patterns will be captured by PCs -> PCA can seperate cell types
Note that PCA can also capture other things, like sequencing depth or cell heterogeneity/complexity!
It is important to select the significant PCs for clustering analysis
Common practices include:
Using Elbow plot
Using technical noise
Trying downstream analysis with different number of PCs (10, 20, or even 50)
Graph-based, non-linear methods like tSNE and UMAP
PCA, tSNE and UMAP available as options in most tools
We use PCA for dimension reduction before clustering, and tSNE and UMAP for visualization
Graph-based
Non-linear
Stochastic
(Only) local distances are preserved: distance between groups are not meaningful
Can be run on top of PCs
Many parameters to optimize
Non-linear graph-based dimension reduction method like tSNE
Newer & efficient = fast
Runs on top of PCs
Based on topological structures in multidimensional space
Unlike tSNE, you can compute the structure once (no randomization)
faster
you could add data points without starting over
Presever the global structure better than tSNE
Find variable genes: getTopHVGs
Calculate PCA: runPCA
Find optimum number of PCs
Calculate tSNE and UMAP: runTSNE
, runUMAP
Slides are adapted from Paulo Czarnewski