Feature Selection

Source: Dimensionality Reduction (https://bioconductor.org/books/release/OSCA/dimensionality-reduction.html) chapter in OSCA Book.

We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select an appropriate set of highly variable genes (HVGs).

Load packages

library(scater) 
library(scran)
library(PCAtools)

Load data

We will load the R file keeping the SCE object with the normalised counts for 500 cells per sample.

sce <- readRDS("~/Course_Materials/scRNAseq/Robjects/caron_postDeconv_5hCellPerSpl.Rds")
sce
## class: SingleCellExperiment 
## dim: 28377 5500 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
##   ENSG00000278817
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames(5500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
##   12_TTTGTCAAGTGGTAGC-1 12_TTTGTCAGTACAGCAG-1
## colData names(17): SampleName Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Quantifying per-gene variation

Some assays allow the inclusion of known molecules in a known amount covering a wide range, from low to high abundance: spike-ins. The technical noise is assessed based on the amount of spike-ins used, the corresponding read counts obtained and their variation across cells. The variance in expression can then be decomposed into the biological and technical components.

UMI-based assays do not (yet?) allow spike-ins. But one can still identify HVGs, that is genes with the highest biological component. Assuming that expression does not vary across cells for most genes, the total variance for these genes mainly reflects technical noise. The latter can thus be assessed by fitting a trend to the variance in expression. The fitted value will be the estimate of the technical component.

Let’s fit a trend to the variance, using modelGeneVar().

dec.sce <- modelGeneVar(sce)

Let’s plot variance against mean of expression (log scale) and the mean-dependent trend fitted to the variance:

var.fit <- metadata(dec.sce)
plot(var.fit$mean, var.fit$var, xlab="Mean of log-expression", ylab="Variance of log-expression")
curve(var.fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)

Selecting highly variable genes

Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. A larger subset will reduce the risk of discarding interesting biological signal by retaining more potentially relevant genes, at the cost of increasing noise from irrelevant genes that might obscure said signal. It is difficult to determine the optimal trade-off for any given application as noise in one context may be useful signal in another. Commonly applied strategies are:

  • take top X genes with largest (biological) variation

Top 1000 genes: getTopHVGs(dec.sce, n=1000)

Top 10% genes: getTopHVGs(dec.sce, prop=0.1)

  • based on significance

getTopHVGs(dec.sce, fdr.threshold = 0.05)

  • keeping all genes above the trend

getTopHVGs(dec.sce, var.threshold = 0)

  • selecting a priori genes of interest

In our example, we will define ‘HVGs’ as top 10% of genes with the highest biological components. This is a fairly arbitrary choise. The common practise is to pick an arbitrary threshold (either based on number of proportion) and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value.

hvgs <- getTopHVGs(dec.sce, prop=0.1)
length(hvgs)
## [1] 1237

HVGs may be driven by outlier cells. So let’s plot the distribution of expression values for the genes with the largest biological components.

First, get gene names to replace ensembl IDs on plot.

# the count matrix rows are named with ensembl gene IDs. Let's label gene with their name instead:
# row indices of genes in rowData(sce)
o <- order(dec.sce$bio, decreasing=TRUE)
chosen.genes.index <- o[1:20]
tmpInd <- which(rowData(sce)$ID %in% rownames(dec.sce)[chosen.genes.index])
# check:
rowData(sce)[tmpInd,c("ID","Symbol")]
## DataFrame with 20 rows and 2 columns
##                              ID      Symbol
##                     <character> <character>
## ENSG00000211592 ENSG00000211592        IGKC
## ENSG00000145335 ENSG00000145335        SNCA
## ENSG00000019582 ENSG00000019582        CD74
## ENSG00000204287 ENSG00000204287     HLA-DRA
## ENSG00000147454 ENSG00000147454    SLC25A37
## ...                         ...         ...
## ENSG00000206172 ENSG00000206172        HBA1
## ENSG00000169877 ENSG00000169877        AHSP
## ENSG00000171223 ENSG00000171223        JUNB
## ENSG00000090013 ENSG00000090013       BLVRB
## ENSG00000158578 ENSG00000158578       ALAS2
# store names:
tmpName <- rowData(sce)[tmpInd,"Symbol"]
# the gene name may not be known, so keep the ensembl gene ID in that case:
tmpName[tmpName==""] <- rowData(sce)[tmpInd,"ID"][tmpName==""]
tmpName[is.na(tmpName)] <- rowData(sce)[tmpInd,"ID"][is.na(tmpName)]
rm(tmpInd)

Now show a violin plot for each gene, using plotExpression() and label genes with their name:

g <- plotExpression(sce, rownames(dec.sce)[chosen.genes.index], 
    point_alpha=0.05, jitter="jitter") 
g <- g + scale_x_discrete(breaks=rownames(dec.sce)[chosen.genes.index],
        labels=tmpName)
g

rm(tmpName)

Dimensionality Reduction

Many scRNA-seq analysis procedures involve comparing cells based on their expression values across multiple genes. For example, clustering aims to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. In these applications, each individual gene represents a dimension of the data. More intuitively, if we had a scRNA-seq data set with two genes, we could make a two-dimensional plot where each axis represents the expression of one gene and each point in the plot represents a cell. This concept can be extended to data sets with thousands of genes where each cell’s expression profile defines its location in the high-dimensional expression space.

As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension, e.g., an “eigengene” (Langfelder and Horvath 2007). This reduces computational work in downstream analyses like clustering, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data; and enables effective plotting of the data, for those of us who are not capable of visualizing more than 3 dimensions.

Principal Component Analysis

In a single cell RNA-seq (scRNASeq) data set, each cell is described by the expression level of thoushands of genes.

The total number of genes measured is referred to as dimensionality. Each gene measured is one dimension in the space characterising the data set. Many genes will little vary across cells and thus be uninformative when comparing cells. Also, because some genes will have correlated expression patterns, some information is redundant. Moreover, we can represent data in three dimensions, not more. So reducing the number of useful dimensions is necessary.

The data set: a matrix with one row per sample and one variable per column. Here samples are cells and each variable is the normalised read count for a given gene.

The space: each cell is associated to a point in a multi-dimensional space where each gene is a dimension.

The aim: to find a new set of variables defining a space with fewer dimensions while losing as little information as possible.

Out of a set of variables (read counts), PCA defines new variables called Principal Components (PCs) that best capture the variability observed amongst samples (cells).

The number of variables does not change. Only the fraction of variance captured by each variable differs. The first PC explains the highest proportion of variance possible (bound by prperties of PCA). The second PC explains the highest proportion of variance not explained by the first PC. PCs each explain a decreasing amount of variance not explained by the previous ones. Each PC is a dimension in the new space.

The total amount of variance explained by the first few PCs is usually such that excluding remaining PCs, ie dimensions, loses little information. The stronger the correlation between the initial variables, the stronger the reduction in dimensionality. PCs to keep can be chosen as those capturing at least as much as the average variance per initial variable or using a scree plot, see below.

PCs are linear combinations of the initial variables. PCs represent the same amount of information as the initial set and enable its restoration. The data is not altered. We only look at it in a different way.

About the mapping function from the old to the new space:

  • it is linear
  • it is inverse, to restore the original space
  • it relies on orthogonal PCs so that the total variance remains the same.

Two transformations of the data are necessary:

  • center the data so that the sample mean for each column is 0 so the covariance matrix of the intial matrix takes a simple form
  • scale variance to 1, ie standardize, to avoid PCA loading on variables with large variance.

PCA

Perform PCA, keep outcome in same object. runPCA calculates 50 PCs by default, you can change this number by specifying ncomponents option.

sce <- runPCA(sce,subset_row=hvgs)
sce
## class: SingleCellExperiment 
## dim: 28377 5500 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
##   ENSG00000278817
## rowData names(7): ID Symbol ... detected gene_sparsity
## colnames(5500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
##   12_TTTGTCAAGTGGTAGC-1 12_TTTGTCAGTACAGCAG-1
## colData names(17): SampleName Barcode ... sizeFactor cell_sparsity
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

Percentage of variance explained by successive PCs

percent.var <- attr(reducedDim(sce), "percentVar")
plot(percent.var, xlab="PC", ylab="Variance explained (%)")

Display cells on a plot for the first 2 PCs, colouring by ‘Sample’ and setting size to match ‘total_features’.

The proximity of cells reflects the similarity of their expression profiles.

g <- plotPCA(sce,colour_by = "SampleName")         
g

One can also split the plot by sample.

g +  facet_wrap(~ sce$SampleName)

Or plot several PCs at once, using plotReducedDim():

plotReducedDim(sce, dimred="PCA", ncomponents=3, colour_by = "SampleName")

PCA Diagnostics

There is a large number of potential confounders, artifacts and biases in scRNA-seq data. One of the main challenges in analysing scRNA-seq data stems from the fact that it is difficult to carry out a true technical replication to distinguish biological and technical variability. Here we will continue to explore how experimental artifacts can be identified and removed.

The plot below shows, for each of the first 10 PCs, the variance explained by the ten variables in colData(sce) that are most strongly associated with the PCs.

colData(sce)$SampleGroup <- factor(colData(sce)$SampleGroup)

explanPc <- getExplanatoryPCs(sce,
    variables = c(
        "sum",
        "detected",
        "SampleGroup",
        "SampleName",
        "subsets_Mito_percent"
    )
)

plotExplanatoryPCs(explanPc/100) 

We can see that PC1 can be explained mostly by SampleName, SampleGroups, and Mitochondrial expression.

plotExplanatoryVariables(
    sce,
    variables = c(
        "sum",
        "detected",
        "SampleGroup",
        "SampleName",
        "subsets_Mito_percent"
    )
)
## Warning: Removed 23530 rows containing non-finite values (stat_density).

We can also compute the marginal R2 for each variable when fitting a linear model regressing expression values for each gene against just that variable, and display a density plot of the gene-wise marginal R2 values for the variables.

This analysis indicates that individual and subtype have the highest explanatory power for many genes, and we don’t see technical covariates having high correlations. If that were the case, we might need to repeat the normalization step while conditioning out for these covariates, or we would include them in downstream analysis.

Chosing the number of PCs

How many of the top PCs should we retain for downstream analyses? The choice of the number of PCs is a decision that is analogous to the choice of the number of HVGs to use. Using more PCs will retain more biological signal at the cost of including more noise that might mask said signal. Much like the choice of the number of HVGs, it is hard to determine whether an “optimal” choice exists for the number of PCs. Even if we put aside the technical variation that is almost always uninteresting, there is no straightforward way to automatically determine which aspects of biological variation are relevant; one analyst’s biological signal may be irrelevant noise to another analyst with a different scientific question.

Most practitioners will simply set the number of PCs to a “reasonable” but arbitrary value, typically ranging from 10 to 50. This is often satisfactory as the later PCs explain so little variance that their inclusion or omission has no major effect. For example, in our dataset, few PCs explain more than 1% of the variance in the entire dataset.

table(percent.var>1)
## 
## FALSE  TRUE 
##    41     9

Most commonly used strategies include: * selecting top X PCs (with X typically ranging from 10 to 50)

  • using the elbow point in the scree plot
percent.var <- attr(reducedDim(sce), "percentVar")
chosen.elbow <- findElbowPoint(percent.var)
chosen.elbow
## [1] 6
plot(percent.var, xlab="PC", ylab="Variance explained (%)")
abline(v=chosen.elbow, col="red")

  • using technical noise Here we aim to find PCs linked to biological variation. The assumption is that the biology drives most of the variance hence should be captured by the first PCs, while technical noise affects each gene independently, hence is captured by later PCs. Compute the sum of the technical component across genes used in the PCA, use it as the amount of variance not related to biology and that we should therefore remove. Later PCs are excluded until the amount of variance they account for matches that corresponding to the technical component.
var.fit <- metadata(dec.sce)
sce <- denoisePCA(sce, technical=var.fit$trend, assay.type="logcounts")
## Warning in .process_subset_for_pca(subset.row, x): 'subset.row=' is typically used to specify HVGs for PCA. If the use of
## all genes is intentional, suppress this message with 'subset.row=NULL'.
ncol(reducedDim(sce))
## [1] 7
  • using permutation (permute a subset of the data, rerun PCA, construct a null distribution of feature scores, and repeat)

t-SNE: t-Distributed Stochastic Neighbor Embedding

The Stochastic Neighbor Embedding (SNE) approach address two shortcomings of PCA that captures the global covariance structure with a linear combination of initial variables: by preserving the local structure allowing for non-linear projections. It uses two distributions of the pairwise similarities between data points: in the input data set and in the low-dimensional space.

SNE aims at preserving neighbourhoods. For each points, it computes probabilities of chosing each other point as its neighbour based on a Normal distribution depending on 1) the distance matrix and 2) the size of the neighbourhood (perplexity). SNE aims at finding a low-dimension space (eg 2D-plane) such that the similarity matrix deriving from it is as similar as possible as that from the high-dimension space. To address the fact that in low dimension, points are brought together, the similarity matrix in the low-dimension is allowed to follow a t-distribution.

Two characteristics matter:

  • perplexity, to indicate the relative importance of the local and global patterns in structure of the data set, usually use a value of 50,
  • stochasticity; running the analysis will produce a different map every time, unless the seed is set.

See misread-tsne.

Perplexity

Compute t-SNE with default perplexity, ie 50.

# runTSNE default perpexity if min(50, floor(ncol(object)/5))
sce <- runTSNE(sce, dimred="PCA",perplexity=50, rand_seed=123)

Plot t-SNE:

tsne50 <- plotTSNE(sce, colour_by="SampleName",size_by="sum") + ggtitle("Perplexity = 50")
tsne50

Split by sample type:

tsne50 + facet_wrap(~ sce$SampleGroup)

Compute t-SNE for several perplexity values:

tsne5.run <- runTSNE(sce, dimred="PCA", perplexity=5, rand_seed=123)
tsne5 <- plotTSNE(tsne5.run, colour_by="SampleName") + ggtitle("Perplexity = 5")

tsne500.run <- runTSNE(sce, dimred="PCA", perplexity=500, rand_seed=123)
tsne500 <- plotTSNE(tsne500.run, colour_by="SampleName") + ggtitle("Perplexity = 500")

Low perplexities will favor resolution of finer structure, possibly to the point that the visualization is compromised by random noise. Thus, it is advisable to test different perplexity values to ensure that the choice of perplexity does not drive the interpretation of the plot.

tsne5

tsne50

tsne500

UMAP

Another neighbour graph method. Similar to t-SNE, but that is determistic, faster and claims to preserve both local and global structures.

Compute UMAP.

set.seed(123)
sce <- runUMAP(sce, dimred="PCA")

Plot UMAP

Compared to t-SNE, the UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than t -SNE. From a practical perspective, UMAP is much faster than t-SNE, which may be an important consideration for large datasets. (Nonetheless, we have still run UMAP on the top PCs here for consistency.) UMAP also involves a series of randomization steps so setting the seed is critical.

Like t-SNE, UMAP has its own suite of hyperparameters that affect the visualization. Of these, the number of neighbors (n_neighbors) and the minimum distance between embedded points (min_dist) have the greatest effect on the granularity of the output. If these values are too low, random noise will be incorrectly treated as high-resolution structure, while values that are too high will discard fine structure altogether in favor of obtaining an accurate overview of the entire dataset. Again, it is a good idea to test a range of values for these parameters to ensure that they do not compromise any conclusions drawn from a UMAP plot.

sce.umap <- plotUMAP(sce, colour_by="SampleName", size_by="sum") + ggtitle("UMAP")
sce.umap

Split by sample:

sce.umap + facet_wrap(~ sce$SampleGroup)

Save SCE object:

# save sce object to which we have added dimensionality reduction slots:
saveRDS(sce, "~/Course_Materials/scRNAseq/Robjects/caron_postDeconv_5hCellPerSpl_dimRed.Rds")

Extra Material

Session information

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
##  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
##  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
## [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] PCAtools_2.4.0              ggrepel_0.9.1              
##  [3] scran_1.20.1                scater_1.20.1              
##  [5] ggplot2_3.3.5               scuttle_1.2.0              
##  [7] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
##  [9] Biobase_2.52.0              GenomicRanges_1.44.0       
## [11] GenomeInfoDb_1.28.1         IRanges_2.26.0             
## [13] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [15] MatrixGenerics_1.4.0        matrixStats_0.59.0         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              RcppAnnoy_0.0.18         
##  [3] tools_4.1.0               bslib_0.2.5.1            
##  [5] utf8_1.2.1                R6_2.5.0                 
##  [7] irlba_2.3.3               vipor_0.4.5              
##  [9] uwot_0.1.10               DBI_1.1.1                
## [11] colorspace_2.0-2          withr_2.4.2              
## [13] tidyselect_1.1.1          gridExtra_2.3            
## [15] compiler_4.1.0            BiocNeighbors_1.10.0     
## [17] DelayedArray_0.18.0       labeling_0.4.2           
## [19] sass_0.4.0                scales_1.1.1             
## [21] stringr_1.4.0             digest_0.6.27            
## [23] rmarkdown_2.9             XVector_0.32.0           
## [25] pkgconfig_2.0.3           htmltools_0.5.1.1        
## [27] sparseMatrixStats_1.4.0   highr_0.9                
## [29] limma_3.48.1              rlang_0.4.11             
## [31] DelayedMatrixStats_1.14.0 farver_2.1.0             
## [33] jquerylib_0.1.4           generics_0.1.0           
## [35] jsonlite_1.7.2            BiocParallel_1.26.1      
## [37] dplyr_1.0.7               RCurl_1.98-1.3           
## [39] magrittr_2.0.1            BiocSingular_1.8.1       
## [41] GenomeInfoDbData_1.2.6    Matrix_1.3-4             
## [43] Rcpp_1.0.7                ggbeeswarm_0.6.0         
## [45] munsell_0.5.0             fansi_0.5.0              
## [47] viridis_0.6.1             lifecycle_1.0.0          
## [49] stringi_1.7.3             yaml_2.2.1               
## [51] edgeR_3.34.0              zlibbioc_1.38.0          
## [53] plyr_1.8.6                grid_4.1.0               
## [55] dqrng_0.3.0               crayon_1.4.1             
## [57] lattice_0.20-44           cowplot_1.1.1            
## [59] beachmat_2.8.0            locfit_1.5-9.4           
## [61] metapod_1.0.0             knitr_1.33               
## [63] pillar_1.6.1              igraph_1.2.6             
## [65] codetools_0.2-18          reshape2_1.4.4           
## [67] ScaledMatrix_1.0.0        glue_1.4.2               
## [69] evaluate_0.14             vctrs_0.3.8              
## [71] gtable_0.3.0              purrr_0.3.4              
## [73] assertthat_0.2.1          xfun_0.24                
## [75] rsvd_1.0.5                RSpectra_0.16-0          
## [77] viridisLite_0.4.0         tibble_3.1.2             
## [79] beeswarm_0.4.0            cluster_2.1.2            
## [81] bluster_1.2.1             statmod_1.4.36           
## [83] ellipsis_0.3.2

Extra Material

Seurat Analysis Seurat is another single cell analysis package and provides useful tools for analysing PCA loadings. We will convert our SCE object to a Seurat object to use these tools.

library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## 
symb <- mapIds(org.Hs.eg.db, keys=rownames(sce), keytype="ENSEMBL", column="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
# we want to use gene symbols instead of Ensembl gene IDs 
# this command will enable that
hvgs_symbols<-rowData(sce)[hvgs,"Symbol"]
rownames(sce)<-uniquifyFeatureNames(rownames(sce), symb)

# we convert our SCE object to a Seurat object 
library(Seurat)
## Attaching SeuratObject
## 
## Attaching package: 'Seurat'
## The following object is masked from 'package:SummarizedExperiment':
## 
##     Assays
seurat_obj<-as.Seurat(sce, counts="counts", data="logcounts")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## Warning: Keys should be one or more alphanumeric characters followed by an
## underscore, setting key from PC to PC_
## Warning: All keys should be one or more alphanumeric characters followed by an
## underscore '_', setting key to PC_
# scale the data for PCA for the highly variable genes 
seurat_obj <- ScaleData(seurat_obj, features = hvgs_symbols)
## Centering and scaling data matrix
# run PCA 
seurat_obj <- RunPCA(seurat_obj, features = hvgs_symbols, verbose = F)
## Warning: Cannot add objects with duplicate keys (offending key: PC_), setting
## key to 'pca_'
# PCA Plot 
DimPlot(seurat_obj,reduction = "pca")

VizDimLoadings(seurat_obj,dims=1:2)

DimHeatmap(seurat_obj, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(seurat_obj, dims = 1:6, cells = 500, balanced = TRUE)

<>