Acknowledgments: much of the material in this section hase been derived from the chapters on normalisation in the OSCA book and the Hemberg Group course materials.
library(scater)
library(scran)
library(sctransform)
library(tidyverse)
library(BiocParallel)
library(patchwork)
bpp <- MulticoreParam(7)
For the purposes of this demonstration we have generated a smaller data set in which there are only 500 cells per sample. This is so that the code can be run in a reasonable amount of time during the live teaching session. The data were first QC’d and filtered as described in the QC and exploratory analysis session. After this 500 cells were selected at random from each sample.
sce <- readRDS("R_objects/Caron_filtered.500.rds")
sce
## class: SingleCellExperiment
## dim: 29346 5500
## metadata(1): Samples
## assays(1): counts
## rownames(29346): ENSG00000243485 ENSG00000238009 ... ENSG00000275063
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(5500): 1_CGACTTCGTCCAGTTA-1 1_AGAATAGCATACGCTA-1 ...
## 8_AGCGTCGAGATGTAAC-1 8_CATGACAAGATAGGAG-1
## colData names(10): Sample Barcode ... subsets_Mito_percent total
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
table(sce$SampleName)
##
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 HHD_1 HHD_2
## 500 500 500 500 500 500
## PBMMC_1 PBMMC_2 PBMMC_3 PRE-T_1 PRE-T_2
## 500 500 500 500 500
oneSamTab <- colData(sce) %>%
as.data.frame() %>%
filter(SampleName == "PBMMC_1") %>%
dplyr::select(SampleName,Barcode, sum) %>%
mutate(cell_num = 1:n())
p_before_nom <- ggplot(data=oneSamTab, aes(x=cell_num, y=sum)) +
geom_bar(stat = 'identity') +
labs( x= 'Cell Index',
y='Cell UMI counts',
title = "PBMMC_1: Before Normalization" ) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, size=20, color = 'red')
)
p_before_nom
Above plot shows the UMI counts/cell (transcript molecules) for 500 cell from the PBMMC_1 sample
UMI counts fluctuate
We derive biological insights downstream by comparing cells against each other.
But the UMI counts differences makes it harder to compare cells.
Why total transcript molecules (UMI counts) detected between cells differ?
A normalization technique makes the UMI counts distribution uniform, so that each cell has similar counts.
Normalization is a critical step that corrects cell-to-cell technical differences.
By normalizing, downstream comparisons of relative expression between cells are valid.
The sparse nature of scRNA data makes normalization difficult, unlike bulk RNAseq data.
Broadly two classes
Typical normalization has two steps
Scaling methods typically generate normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that address the effect of sequencing depth. These values however typically have a variance that increases with their mean (heteroscedasticity) while most statistical methods assume a stable variance, which does not vary with the mean (homoscedasticity). A widely used ‘variance stabilising transformation’ is the log transformation (often log2). This works well for highly expressed genes (as in bulk RNA-seq) but less so for sparse scRNA-seq data.
Because single-cell data tend to have a substantial number of low and zero counts, these bulk normalization methods may be problematic for single-cell data.
In order to avoid pooling cells with radically different transcriptomic profiles, the cells are first clustered based on gene expression. The pools are then formed exclusively with each cluster. Size factors are calculated within each cluster and are then scaled so they are comparable across clusters.
The table below show the number and size of clusters found:
set.seed(100)
clust <- quickCluster(sce, BPPARAM=bpp)
table(clust)
## clust
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 604 387 258 567 143 138 858 136 245 579 242 131 125 128 141 128 128 139 146 277
sce <- computePooledFactors(sce,
clusters = clust,
min.mean = 0.1,
BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)
summary(deconv.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0160 0.4163 0.7316 1.0000 1.3418 12.3957
Note: min.mean - A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. This means large numbers of very lowly expressed genes will not bias the normalization.
Plot deconvolution size factors against library size factors:
lib.sf <- librarySizeFactors(sce)
data.frame(LibrarySizeFactors = lib.sf,
DeconvolutionSizeFactors = deconv.sf,
SampleGroup = sce$SampleGroup) %>%
ggplot(aes(x=LibrarySizeFactors, y=DeconvolutionSizeFactors)) +
geom_point(aes(col=SampleGroup)) +
geom_abline(slope = 1, intercept = 0)
For each cell, raw counts for genes are divided by the size factor
for that cell and log-transformed so downstream analyses focus on genes
with strong relative differences. We use
scater::logNormCounts()
.
sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts" "logcounts"
Normalised counts are much less variable across cells than raw counts
norm_counts <- logNormCounts(sce,transform='none' ) %>%
assay('normcounts') %>%
as.matrix() %>%
colSums()
norm_counts <- tibble(Barcode=names(norm_counts),
normCounts = log2(norm_counts)
)
norm_counts <- inner_join(norm_counts, oneSamTab, by='Barcode')
p_after_norm <- ggplot(data=norm_counts, aes(x=cell_num, y=normCounts)) +
geom_bar(stat = 'identity') +
labs( x= 'Cell Index',
y='Normalized Cell UMI counts',
title = "PBMMC_1:After Normalization" ) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, size=20, color = 'red')
)
p_before_nom + p_after_norm
Let’s separate out the scaling normalisation from the log transformation
What do the un-normalised data look like if we log them?
p_before_norm_log <- ggplot(data=oneSamTab, aes(x=cell_num, y=log2(sum))) +
geom_bar(stat = 'identity') +
labs( x= 'Cell Index',
y='Cell UMI counts',
title = "Logged raw counts" ) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, size=20, color = 'red')
)
p_before_norm_log + p_after_norm
Simply logging the sum of counts per cell reduces the variation a lot, but the scaling is required to do the job properly
The log transformation is meant to reduce the correlation between mean and variance for genes - has this worked?
We can look at the relationship between the mean gene expression and variance for raw UMI counts, scaled counts and scaled, logged counts
For raw counts:
# mean and variance for raw counts
mean <- rowMeans(assay(sce, "counts"))
var <- rowVars(assay(sce, "counts"))
# Scatter plot
plot(log(mean), log(var))
abline(a=1, b=1, col="red")
There is a strong linear relationship between mean and variance of UMI counts across genes
For scaled counts:
# Mean and variance for scaled counts
mean_scaled <- logNormCounts(sce,transform='none' ) %>%
assay('normcounts') %>%
rowMeans()
var_scaled <- logNormCounts(sce,transform='none' ) %>%
assay('normcounts') %>%
rowVars()
plot(log(mean_scaled), log(var_scaled))
abline(a=1, b=1, col="red")
The relationship is still there after scaling the counts
For scaled, log transformed counts:
# Mean and variance for scaled, log transformed counts
mean_norm <- rowMeans(assay(sce, "logcounts"))
var_norm <- rowVars(assay(sce, "logcounts"))
plot(mean_norm, var_norm)
abline(a=1, b=1, col="red")
We see that the log transformation removes a large part of the relationship between mean and variance for gene expression values
saveRDS(sce, "results/caron_normalised.rds")
Apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1.
You will first load the a single cell experiment object containing the entire Caron data set. First it is necessary to select only cells that came from the sample ‘ETV6-RUNX1_1’. You can then apply the normalization by deconvolution by clustering the cells, computing size factors and using these to log-normalize the counts.
With scaling normalisation a correlation remains between the mean and
variation of expression (heteroskedasticity). This affects downstream
dimensionality reduction as the few main new dimensions are usually
correlated with library size. sctransform
addresses the
issue by regressing library size out of raw counts and providing
residuals to use as normalized and variance-stabilized expression values
in some downstream analyses, such as dimensionality reduction.
The sctransform
package is from the Seurat suite of
scRNAseq analysis packages. Rather than convert our Single Cell
Experiment object into a Seurat object and use the Seurat
package’s command SCTransform
, we will extract the counts
matrix from our SCE object and run the variance stabilising
transformation (VST) algorithm, using the sctranform
package’s vst
command, directly on the matrix. We can
extract the counts matrix - as a dgCMatrix object sparse matrix
- using the counts
function.
counts <- counts(sce)
class(counts)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
In order to demonstrate the rationale behind the using the variance stabilising transformation, we will visually inspect various properties of our data. Our main interest is in the general trends not in individual outliers. Neither genes nor cells that stand out are important at this step; we focus on the global trends.
Gene attributes include for each gene:
gene_attr <- data.frame(mean = rowMeans(counts),
detection_rate = rowMeans(counts > 0),
var = rowVars(counts)) %>%
mutate(log_mean = log10(mean)) %>%
mutate(log_var = log10(var))
dim(gene_attr)
## [1] 29346 5
head(gene_attr)
## mean detection_rate var log_mean log_var
## ENSG00000243485 0.0001818182 0.0001818182 0.0001818182 -3.740363 -3.740363
## ENSG00000238009 0.0021818182 0.0021818182 0.0021774538 -2.661181 -2.662051
## ENSG00000239945 0.0001818182 0.0001818182 0.0001818182 -3.740363 -3.740363
## ENSG00000241860 0.0063636364 0.0061818182 0.0066879929 -2.196295 -2.174704
## ENSG00000237491 0.0403636364 0.0378181818 0.0441969945 -1.394010 -1.354607
## ENSG00000225880 0.0085454545 0.0081818182 0.0092013755 -2.068265 -2.036147
Attributes include for each cell:
cell_attr <- data.frame(n_umi = colSums(counts),
n_gene = colSums(counts > 0))
dim(cell_attr)
## [1] 5500 2
head(cell_attr)
## n_umi n_gene
## 1_CGACTTCGTCCAGTTA-1 4557 2024
## 1_AGAATAGCATACGCTA-1 7745 2695
## 1_TGACTAGAGAACTCGG-1 7181 2617
## 1_CTTAACTGTTATGCGT-1 1704 1081
## 1_CCCAGTTTCAAGCCTA-1 6914 2570
## 1_TACAGTGCACGACGAA-1 11510 4070
For the genes, on the log10 scale we can see that up to a mean UMI count of 0 the variance follows the line through the origin with slope one, i.e. variance and mean are roughly equal as expected under a Poisson model. However, genes with a higher average UMI count show overdispersion compared to Poisson.
ggplot(gene_attr, aes(log_mean, log_var)) +
geom_point(alpha=0.3, shape=16) +
geom_abline(intercept = 0, slope = 1, color='red')
In line with the previous plot, we see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is at or very close to 1.0 suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.
x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- data.frame(log_mean = x,
detection_rate = 1 - dpois(0, lambda = 10^x))
ggplot(gene_attr, aes(log_mean, detection_rate)) +
geom_point(alpha=0.3, shape=16) +
geom_line(data=poisson_model, color='red') +
theme_gray(base_size = 8)
The plot below shows the relationship between the two cell attributes computed: library size (n_umi) and number of genes detected (n_gene).
ggplot(cell_attr, aes(n_umi, n_gene)) +
geom_point(alpha=0.3, shape=16) +
geom_density_2d(size = 0.3)
From the sctransform vignette: “Based on the observations above, which are not unique to this particular data set, we propose to model the expression of each gene as a negative binomial random variable with a mean that depends on other variables. Here the other variables can be used to model the differences in sequencing depth between cells and are used as independent variables in a regression model. In order to avoid overfitting, we will first fit model parameters per gene, and then use the relationship between gene mean and parameter values to fit parameters, thereby combining information across genes. Given the fitted model parameters, we transform each observed UMI count into a Pearson residual which can be interpreted as the number of standard deviations an observed count was away from the expected mean. If the model accurately describes the mean-variance relationship and the dependency of mean and latent factors, then the result should have mean zero and a stable variance across the range of expression.”
In short:
Assumptions:
Outcome:
We will now estimate model parameters and transform data.
The vst
function estimates model parameters and performs
the variance stabilizing transformation.
Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean). We will set the following arguments:
umi
- The matrix of UMI counts with genes as rows and
cells as columnslatent_var
- The independent variables to regress out
as a character vectorreturn_gene_attr
- Make cell attributes part of the
outputreturn_cell_attr
- Calculate gene attributes and make
part of outputset.seed(44)
vst_out <- vst(umi = counts,
latent_var = c('log_umi'),
return_gene_attr = TRUE,
return_cell_attr = TRUE
)
We will generate some diagnostic plots in order to inspect the estimated and fitted model parameters.
By default parameters shown are:
plot_model_pars(vst_out)
We check the regression model used is the one the we intended:
vst_out$model_str
## [1] "y ~ log_umi"
We will now look at several genes in more detail by plotting observed UMI counts and comparing these to plots using the residuals from the modelling.
For each gene of interest, we will plot:
We will look at two genes: ‘RPL10’ and ‘HBB’:
## ID Symbol Type Chromosome
## ENSG00000244734 ENSG00000244734 HBB Gene Expression chr11
## ENSG00000147403 ENSG00000147403 RPL10 Gene Expression chrX
ensId <- rowData(sce) %>%
as.data.frame %>%
filter(Symbol %in% c('RPL10', 'HBB')) %>%
pull("ID")
plot_model(x = vst_out,
umi = counts,
goi = ensId,
plot_residual = TRUE)
The distribution of residual mean is centered around 0:
ggplot(vst_out$gene_attr, aes(x = residual_mean)) +
geom_histogram(binwidth=0.01)
The distribution of residual variance is centered around 1:
ggplot(vst_out$gene_attr, aes(residual_variance)) +
geom_histogram(binwidth=0.1) +
geom_vline(xintercept=1, color='red') +
xlim(0, 10)
Plotting the residual variance against the mean shows that after transformation there is no relationship between gene mean and variance.
ggplot(vst_out$gene_attr, aes(x = log10(gmean), y = residual_variance)) +
geom_point(alpha=0.3, shape=16)
Check genes with large residual variance. These genes would be markers of expected cell populations. Note how they represent a great range of mean UMI and detection rate values.
vst_out$gene_attr %>%
arrange(desc(residual_variance)) %>%
top_n(n = 10) %>%
mutate(across(where(is.numeric), round, 2)) %>%
rownames_to_column("ID") %>%
left_join(as.data.frame(rowData(sce))[,c("ID", "Symbol")], "ID")
## ID detection_rate gmean amean variance residual_mean
## 1 ENSG00000244734 0.67 3.49 203.68 651201.82 5.37
## 2 ENSG00000206172 0.46 1.58 55.13 46412.66 5.07
## 3 ENSG00000188536 0.51 1.96 77.43 91150.75 5.16
## 4 ENSG00000223609 0.21 0.51 7.31 1218.72 3.18
## 5 ENSG00000090382 0.11 0.25 3.38 434.14 2.23
## 6 ENSG00000206177 0.15 0.35 4.03 475.97 2.30
## 7 ENSG00000143546 0.13 0.24 2.72 321.30 1.85
## 8 ENSG00000163220 0.13 0.24 2.65 315.11 1.80
## 9 ENSG00000169877 0.15 0.36 3.66 362.73 2.14
## 10 ENSG00000133742 0.14 0.30 3.20 474.86 1.67
## residual_variance Symbol
## 1 368.61 HBB
## 2 362.61 HBA1
## 3 360.10 HBA2
## 4 213.69 HBD
## 5 171.25 LYZ
## 6 146.62 HBM
## 7 144.80 S100A8
## 8 135.42 S100A9
## 9 124.19 AHSP
## 10 105.70 CA1
In order to store the transformed values in our Single Cell object,
we need to add them as a new “assay”. The transformed values are kept as
a matrix in the y
object within vst_out
.
Note that, by default, genes that are expressed in fewer than 5 cells
are not used by vst
and results for these genes are not
returned, so to add vst_out$y
as an assay in our single
cell object we may need to subset the rows of our sce
object to match the rows of vst_out$y
. In our case, about
10,000 genes were expressed in less than 5 cells, so we will need to
subset our SCE object before adding the VST normalised counts.
keepGenes <- rownames(sce)%in%rownames(vst_out$y)
sce <- sce[keepGenes,]
vstMat <- as(vst_out$y[rownames(sce),], "dgCMatrix")
assay(sce, "sctrans_norm", withDimnames=FALSE) <- vstMat
Using a single sample - ETV6-RUNX1_1 - inspect the mean-variance relationship and apply sctransform to that data.
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.1.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.2.0 BiocParallel_1.38.0
## [3] lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 dplyr_1.1.4
## [7] purrr_1.0.2 readr_2.1.5
## [9] tidyr_1.3.1 tibble_3.2.1
## [11] tidyverse_2.0.0 sctransform_0.4.1
## [13] scran_1.32.0 scater_1.32.1
## [15] ggplot2_3.5.1 scuttle_1.14.0
## [17] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [19] Biobase_2.64.0 GenomicRanges_1.56.0
## [21] GenomeInfoDb_1.40.1 IRanges_2.38.0
## [23] S4Vectors_0.42.0 BiocGenerics_0.50.0
## [25] MatrixGenerics_1.16.0 matrixStats_1.3.0
## [27] knitr_1.47
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.3
## [3] magrittr_2.0.3 compiler_4.4.0
## [5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
## [7] reshape2_1.4.4 pkgconfig_2.0.3
## [9] crayon_1.5.2 fastmap_1.2.0
## [11] XVector_0.44.0 labeling_0.4.3
## [13] utf8_1.2.4 rmarkdown_2.27
## [15] tzdb_0.4.0 UCSC.utils_1.0.0
## [17] ggbeeswarm_0.7.2 xfun_0.44
## [19] bluster_1.14.0 zlibbioc_1.50.0
## [21] cachem_1.1.0 beachmat_2.20.0
## [23] jsonlite_1.8.8 highr_0.11
## [25] DelayedArray_0.30.1 irlba_2.3.5.1
## [27] parallel_4.4.0 cluster_2.1.6
## [29] R6_2.5.1 bslib_0.7.0
## [31] stringi_1.8.4 limma_3.60.2
## [33] parallelly_1.37.1 jquerylib_0.1.4
## [35] Rcpp_1.0.12 future.apply_1.11.2
## [37] timechange_0.3.0 Matrix_1.7-0
## [39] igraph_2.0.3 tidyselect_1.2.1
## [41] rstudioapi_0.16.0 abind_1.4-5
## [43] yaml_2.3.8 viridis_0.6.5
## [45] codetools_0.2-20 listenv_0.9.1
## [47] lattice_0.22-6 plyr_1.8.9
## [49] withr_3.0.0 evaluate_0.23
## [51] future_1.33.2 isoband_0.2.7
## [53] pillar_1.9.0 generics_0.1.3
## [55] hms_1.1.3 sparseMatrixStats_1.16.0
## [57] munsell_0.5.1 scales_1.3.0
## [59] globals_0.16.3 glue_1.7.0
## [61] metapod_1.12.0 tools_4.4.0
## [63] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
## [65] locfit_1.5-9.9 grid_4.4.0
## [67] edgeR_4.2.2 colorspace_2.1-0
## [69] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
## [71] BiocSingular_1.20.0 vipor_0.4.7
## [73] cli_3.6.2 rsvd_1.0.5
## [75] fansi_1.0.6 S4Arrays_1.4.1
## [77] viridisLite_0.4.2 gtable_0.3.5
## [79] sass_0.4.9 digest_0.6.35
## [81] SparseArray_1.4.8 ggrepel_0.9.5
## [83] dqrng_0.4.1 farver_2.1.2
## [85] htmltools_0.5.8.1 lifecycle_1.0.4
## [87] httr_1.4.7 statmod_1.5.0
## [89] MASS_7.3-60.2