Acknowledgments: much of the material in this section hase been derived from the chapters on normalisation in the OSCA book and the Hemberg Group course materials.
#install.packages('irlba')
library(irlba)
library(knitr)
library(scater)
library(scran)
library(tidyverse)
library(BiocParallel)
library(patchwork)
bpp <- MulticoreParam(7)
For the purposes of this demonstration we have generated a smaller data set in which there are only 500 cells per sample. This is so that the code can be run in a reasonable amount of time during the live teaching session. The data were first QC’d and filtered as described in the QC and exploratory analysis session. After this 500 cells were selected at random from each sample.
sce <- readRDS("R_objects/Caron_filtered.500.rds")
sce
## class: SingleCellExperiment
## dim: 29346 5500
## metadata(1): Samples
## assays(1): counts
## rownames(29346): ENSG00000243485 ENSG00000238009 ... ENSG00000275063
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(5500): 1_CGACTTCGTCCAGTTA-1 1_AGAATAGCATACGCTA-1 ...
## 8_AGCGTCGAGATGTAAC-1 8_CATGACAAGATAGGAG-1
## colData names(10): Sample Barcode ... subsets_Mito_percent total
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
table(sce$SampleName)
##
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4 HHD_1 HHD_2
## 500 500 500 500 500 500
## PBMMC_1 PBMMC_2 PBMMC_3 PRE-T_1 PRE-T_2
## 500 500 500 500 500
oneSamTab <- colData(sce) %>%
as.data.frame() %>%
filter(SampleName == "PBMMC_1") %>%
dplyr::select(SampleName,Barcode, sum) %>%
mutate(cell_num = 1:n())
p_before_nom <- ggplot(data=oneSamTab, aes(x=cell_num, y=sum)) +
geom_bar(stat = 'identity') +
labs( x= 'Cell Index',
y='Cell UMI counts',
title = "PBMMC_1: Before Normalization" ) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, size=20, color = 'red')
)
p_before_nom
Above plot shows the UMI counts/cell (transcript molecules) for 500 cell from the PBMMC_1 sample
UMI counts fluctuate
We derive biological insights downstream by comparing cells against each other.
But the UMI counts differences makes it harder to compare cells.
Why total transcript molecules (UMI counts) detected between cells differ?
A normalization technique makes the UMI counts distribution uniform, so that each cell has similar counts.
Normalization is a critical step that corrects cell-to-cell technical differences.
By normalizing, downstream comparisons of relative expression between cells are valid.
The sparse nature of scRNA data makes normalization difficult, unlike bulk RNAseq data.
Scaling methods typically generate normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that address the effect of sequencing depth. These values however typically have a variance that increases with their mean (heteroscedasticity) while most statistical methods assume a stable variance, which does not vary with the mean (homoscedasticity). A widely used ‘variance stabilising transformation’ is the log transformation (often log2). This works well for highly expressed genes (as in bulk RNA-seq) but less so for sparse scRNA-seq data.
* DEseq, edgeR-TMM and Library size normalization initially developed for bulk RNAseq * Applying these methods on scRNAseq data systematically under or over estimate size factors. i.e systematically deviate from true size factors. * This deviation is the result of removing zeroes prior to normalization. * Therefore other normalization methods specific to scRNAseq data like deconvolution, sctransform etc. were proposed.
Because single-cell data tend to have a substantial number of low and zero counts, these bulk normalization methods may be problematic for single-cell data.
In order to avoid pooling cells with radically different transcriptomic profiles, the cells are first clustered based on gene expression. The pools are then formed exclusively with each cluster. Size factors are calculated within each cluster and are then scaled so they are comparable across clusters.
The table below show the number and size of clusters found:
set.seed(100)
clust <- quickCluster(sce, BPPARAM=bpp)
table(clust)
## clust
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 604 387 258 567 143 138 858 136 245 579 242 131 125 128 141 128 128 139 146 277
sce <- computePooledFactors(sce,
clusters = clust,
min.mean = 0.1,
BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)
summary(deconv.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0160 0.4163 0.7316 1.0000 1.3418 12.3957
Note: min.mean - A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. This means large numbers of very lowly expressed genes will not bias the normalization.
Plot deconvolution size factors against library size factors:
lib.sf <- librarySizeFactors(sce)
data.frame(LibrarySizeFactors = lib.sf,
DeconvolutionSizeFactors = deconv.sf,
SampleGroup = sce$SampleGroup) %>%
ggplot(aes(x=LibrarySizeFactors, y=DeconvolutionSizeFactors)) +
geom_point(aes(col=SampleGroup)) +
geom_abline(slope = 1, intercept = 0)
For each cell, raw counts for genes are divided by the size factor
for that cell and log-transformed so downstream analyses focus on genes
with strong relative differences. We use
scater::logNormCounts()
.
sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts" "logcounts"
Normalised counts are much less variable across cells than raw counts
norm_counts <- logNormCounts(sce,transform='none' ) %>%
assay('normcounts') %>%
as.matrix() %>%
colSums()
norm_counts <- tibble(Barcode=names(norm_counts),
normCounts = log2(norm_counts)
)
norm_counts <- inner_join(norm_counts, oneSamTab, by='Barcode')
p_after_norm <- ggplot(data=norm_counts, aes(x=cell_num, y=normCounts)) +
geom_bar(stat = 'identity') +
labs( x= 'Cell Index',
y='Normalized Cell UMI counts',
title = "PBMMC_1:After Normalization" ) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, size=20, color = 'red')
)
p_before_nom + p_after_norm
Let’s separate out the scaling normalisation from the log transformation
What do the un-normalised data look like if we log them?
p_before_norm_log <- ggplot(data=oneSamTab, aes(x=cell_num, y=log2(sum))) +
geom_bar(stat = 'identity') +
labs( x= 'Cell Index',
y='Cell UMI counts',
title = "Logged raw counts" ) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, size=20, color = 'red')
)
p_before_norm_log + p_after_norm
Simply logging the sum of counts per cell reduces the variation a lot, but the scaling is required to do the job properly
The log transformation is meant to reduce the correlation between mean and variance for genes - has this worked?
We can look at the relationship between the mean gene expression and variance for raw UMI counts, scaled counts and scaled, logged counts
For raw counts:
# mean and variance for raw counts
mean <- rowMeans(assay(sce, "counts"))
var <- rowVars(assay(sce, "counts"))
# Scatter plot
plot(log(mean), log(var))
abline(a=1, b=1, col="red")
There is a strong linear relationship between mean and variance of UMI counts across genes
For scaled counts:
# Mean and variance for scaled counts
mean_scaled <- logNormCounts(sce,transform='none' ) %>%
assay('normcounts') %>%
rowMeans()
var_scaled <- logNormCounts(sce,transform='none' ) %>%
assay('normcounts') %>%
rowVars()
plot(log(mean_scaled), log(var_scaled))
abline(a=1, b=1, col="red")
The relationship is still there after scaling the counts
For scaled, log transformed counts:
# Mean and variance for scaled, log transformed counts
mean_norm <- rowMeans(assay(sce, "logcounts"))
var_norm <- rowVars(assay(sce, "logcounts"))
plot(mean_norm, var_norm)
abline(a=1, b=1, col="red")
We see that the log transformation removes a large part of the relationship between mean and variance for gene expression values
saveRDS(sce, "results/caron_normalised.rds")
Apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1.
You will first load the a single cell experiment object containing the entire Caron data set. First it is necessary to select only cells that came from the sample ‘ETV6-RUNX1_1’. You can then apply the normalization by deconvolution by clustering the cells, computing size factors and using these to log-normalize the counts.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.2.0 BiocParallel_1.36.0
## [3] lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 dplyr_1.1.4
## [7] purrr_1.0.2 readr_2.1.5
## [9] tidyr_1.3.1 tibble_3.2.1
## [11] tidyverse_2.0.0 scran_1.30.2
## [13] scater_1.30.1 ggplot2_3.5.0
## [15] scuttle_1.12.0 SingleCellExperiment_1.24.0
## [17] SummarizedExperiment_1.32.0 Biobase_2.62.0
## [19] GenomicRanges_1.54.1 GenomeInfoDb_1.38.7
## [21] IRanges_2.36.0 S4Vectors_0.40.2
## [23] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
## [25] matrixStats_1.2.0 knitr_1.45
## [27] irlba_2.3.5.1 Matrix_1.6-5
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 gridExtra_2.3
## [3] rlang_1.1.3 magrittr_2.0.3
## [5] compiler_4.3.2 DelayedMatrixStats_1.24.0
## [7] vctrs_0.6.5 pkgconfig_2.0.3
## [9] crayon_1.5.2 fastmap_1.1.1
## [11] XVector_0.42.0 labeling_0.4.3
## [13] utf8_1.2.4 rmarkdown_2.26
## [15] tzdb_0.4.0 ggbeeswarm_0.7.2
## [17] xfun_0.42 bluster_1.12.0
## [19] zlibbioc_1.48.0 cachem_1.0.8
## [21] beachmat_2.18.1 jsonlite_1.8.8
## [23] highr_0.10 DelayedArray_0.28.0
## [25] parallel_4.3.2 cluster_2.1.6
## [27] R6_2.5.1 bslib_0.6.1
## [29] stringi_1.8.3 limma_3.58.1
## [31] jquerylib_0.1.4 Rcpp_1.0.12
## [33] igraph_2.0.2 timechange_0.3.0
## [35] tidyselect_1.2.1 rstudioapi_0.15.0
## [37] abind_1.4-5 yaml_2.3.8
## [39] viridis_0.6.5 codetools_0.2-19
## [41] lattice_0.22-5 withr_3.0.0
## [43] evaluate_0.23 pillar_1.9.0
## [45] generics_0.1.3 RCurl_1.98-1.14
## [47] hms_1.1.3 sparseMatrixStats_1.14.0
## [49] munsell_0.5.0 scales_1.3.0
## [51] glue_1.7.0 metapod_1.10.1
## [53] tools_4.3.2 BiocNeighbors_1.20.2
## [55] ScaledMatrix_1.10.0 locfit_1.5-9.9
## [57] grid_4.3.2 edgeR_4.0.16
## [59] colorspace_2.1-0 GenomeInfoDbData_1.2.11
## [61] beeswarm_0.4.0 BiocSingular_1.18.0
## [63] vipor_0.4.7 cli_3.6.2
## [65] rsvd_1.0.5 fansi_1.0.6
## [67] S4Arrays_1.2.1 viridisLite_0.4.2
## [69] gtable_0.3.4 sass_0.4.8
## [71] digest_0.6.35 SparseArray_1.2.4
## [73] ggrepel_0.9.5 dqrng_0.3.2
## [75] farver_2.1.1 htmltools_0.5.7
## [77] lifecycle_1.0.4 statmod_1.5.0