Feature Selection and Dimensionality Reduction

26.07.2021

Single Cell RNAseq Analysis Workflow

Cells are characterized by the expression values of all genes –> thousands of dimensions
Simplify complexity, so it becomes easier to work with (reduce the number of features/genes).
- Making clustering step easier
- Making visualization easier
Remove redundancies in the data
- Expression of many genes are correlated, we don’t need so many dimensions to distinguish cell types
- Identify the most relevant information and overcome the extensive technical noise in scRNA-seq data
Reduce computational time for downstream procedures

It’s a linear algebraic method of dimensionality reduction
Finds principal components (PCs) of the data
- Directions where the data is most spread out = where there is most variance
- PC1 explains most of the variance in the data, then PC2, PC3, ..
We will select the most important PCs and use them for clustering cells
- Instead of 20,000 genes we have now maybe 10 PCs
- Essentially, each PC represents a robust ‘metagene’ that combines information across a correlated gene set
Prior to PCA we scale the data so that genes have equal weight in downstream analysis and highly expressed genes don’t dominate

Visualize top genes associated with principal components

Which genes are important for PC1 ?

Which genes correspond to seperating cells?

Both cells and genes are ordered according to their PCA scores. Plots the extreme cells on both ends of the spectrum.

Gene expression patterns will be captured by PCs -> PCA can seperate cell types

Note that PCA can also capture other things, like sequencing depth or cell heterogeneity/complexity!

It is important to select the significant PCs for clustering analysis
However, estimating the true dimensionality of a dataset is challenging
Common practices include:
- Using Elbow plot
- Using technical noise
- Trying downstream analysis with different number of PCs (10, 20, or even 50)

Graph-based, non-linear methods like tSNE and UMAP

PCA, tSNE and UMAP available as options in most tools

We use PCA for dimension reduction before clustering, and tSNE and UMAP for visualization

Graph-based

Non-linear

Stochastic

(Only) local distances are preserved: distance between groups are not meaningful

Can be run on top of PCs

Many parameters to optimize

Non-linear graph-based dimension reduction method like tSNE
Newer & efficient = fast
Runs on top of PCs
Based on topological structures in multidimensional space
Unlike tSNE, you can compute the structure once (no randomization)
- faster
- you could add data points without starting over
Presever the global structure better than tSNE

Slides are adapted from Paulo Czarnewski