1 Normalisation

Acknowledgments: much of the material in this section hase been derived from the chapters on normalisation in the OSCA book and the Hemberg Group course materials.

2 Load packages

#install.packages('irlba')
library(irlba)
library(knitr)
library(scater)
library(scran)
library(tidyverse)
library(BiocParallel)
library(patchwork)
bpp <- MulticoreParam(7)

2.1 Load data object

For the purposes of this demonstration we have generated a smaller data set in which there are only 500 cells per sample. This is so that the code can be run in a reasonable amount of time during the live teaching session. The data were first QC’d and filtered as described in the QC and exploratory analysis session. After this 500 cells were selected at random from each sample.

sce <- readRDS("R_objects/Caron_filtered.500.rds")
sce
## class: SingleCellExperiment 
## dim: 29346 5500 
## metadata(1): Samples
## assays(1): counts
## rownames(29346): ENSG00000243485 ENSG00000238009 ... ENSG00000275063
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(5500): 1_CGACTTCGTCCAGTTA-1 1_AGAATAGCATACGCTA-1 ...
##   8_AGCGTCGAGATGTAAC-1 8_CATGACAAGATAGGAG-1
## colData names(10): Sample Barcode ... subsets_Mito_percent total
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
table(sce$SampleName)
## 
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4        HHD_1        HHD_2 
##          500          500          500          500          500          500 
##      PBMMC_1      PBMMC_2      PBMMC_3      PRE-T_1      PRE-T_2 
##          500          500          500          500          500

2.2 Learning objectives

  • Understand why normalisation is required
  • Understand concepts of deconvolution normalisation methods

2.3 Why normalise?

oneSamTab <- colData(sce) %>% 
  as.data.frame() %>% 
  filter(SampleName == "PBMMC_1") %>% 
  dplyr::select(SampleName,Barcode, sum) %>% 
  mutate(cell_num = 1:n())

p_before_nom <- ggplot(data=oneSamTab, aes(x=cell_num, y=sum)) +
  geom_bar(stat = 'identity') +
  labs( x= 'Cell Index',
        y='Cell UMI counts',
        title = "PBMMC_1: Before Normalization" ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, size=20, color = 'red')
  )

p_before_nom

  • Above plot shows the UMI counts/cell (transcript molecules) for 500 cell from the PBMMC_1 sample

  • UMI counts fluctuate

  • We derive biological insights downstream by comparing cells against each other.

  • But the UMI counts differences makes it harder to compare cells.

  • Why total transcript molecules (UMI counts) detected between cells differ?

    • Biological:
      • Cell sub type differences, like size and transcription activity etc.
    • Technical: scRNA data is inherently noisy
      • Low mRNA content per cell
      • cell-to-cell differences in mRNA capture efficiency
      • Variable sequencing depth
      • PCR amplification efficiency
  • A normalization technique makes the UMI counts distribution uniform, so that each cell has similar counts.

  • Normalization is a critical step that corrects cell-to-cell technical differences.

  • By normalizing, downstream comparisons of relative expression between cells are valid.

2.4 Normalization strategies

The sparse nature of scRNA data makes normalization difficult, unlike bulk RNAseq data.

  • Broadly two classes
    1. Spike-in methods
    • Uses pike-in controls for normalisation
    • Not available for droplet based scRNA techniques like 10x.
    1. Non-spike-in methods:
    • Using available counts data for normalization
    • DEseq2
    • edgeR - TMM
    • Library size normalization
    • deconvolution
    • sctransform
  • Typical normalization has two steps
    1. scaling
    • Estimate size or scaling or normalization factor: computation of a cell-specific ‘scaling’ or ‘size’ factor or “normalization factor” that represents the relative bias in that cell and division of all counts for the cell by that factor to remove that bias. Assumption: any cell specific bias will affect genes the same way.
    • Scale the data by dividing the count for each gene with the appropriate size factor for that cell
    1. Transformation
    • log2
    • Square root transformation
    • Pearson residuals (eg. sctransform)

Scaling methods typically generate normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that address the effect of sequencing depth. These values however typically have a variance that increases with their mean (heteroscedasticity) while most statistical methods assume a stable variance, which does not vary with the mean (homoscedasticity). A widely used ‘variance stabilising transformation’ is the log transformation (often log2). This works well for highly expressed genes (as in bulk RNA-seq) but less so for sparse scRNA-seq data.

DESeq,edgeR and Library size normalizations * DEseq, edgeR-TMM and Library size normalization initially developed for bulk RNAseq * Applying these methods on scRNAseq data systematically under or over estimate size factors. i.e systematically deviate from true size factors. * This deviation is the result of removing zeroes prior to normalization. * Therefore other normalization methods specific to scRNAseq data like deconvolution, sctransform etc. were proposed.

2.5 Deconvolution

Because single-cell data tend to have a substantial number of low and zero counts, these bulk normalization methods may be problematic for single-cell data.

  • Deconvolution aims to normalize expression values based on summed values from pools of cells.
  • Since cell summation results in fewer zeros, the ensuing normalization is less susceptible to errors than existing methods.
  • The estimated size factors are only relevant to the pools of cells, even though normalization accuracy has improved.
  • Each pool’s size factor is deconvolved into its constituent cells’ size factors.

In order to avoid pooling cells with radically different transcriptomic profiles, the cells are first clustered based on gene expression. The pools are then formed exclusively with each cluster. Size factors are calculated within each cluster and are then scaled so they are comparable across clusters.

2.5.1 Cluster cells

The table below show the number and size of clusters found:

set.seed(100)
clust <- quickCluster(sce, BPPARAM=bpp)
table(clust)
## clust
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 604 387 258 567 143 138 858 136 245 579 242 131 125 128 141 128 128 139 146 277

2.5.2 Compute size factors

sce <- computePooledFactors(sce,
             clusters = clust,
             min.mean = 0.1,
             BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)
summary(deconv.sf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0160  0.4163  0.7316  1.0000  1.3418 12.3957

Note: min.mean - A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. This means large numbers of very lowly expressed genes will not bias the normalization.

Plot deconvolution size factors against library size factors:

lib.sf <- librarySizeFactors(sce)
data.frame(LibrarySizeFactors = lib.sf, 
           DeconvolutionSizeFactors = deconv.sf,
                 SampleGroup = sce$SampleGroup) %>%
  ggplot(aes(x=LibrarySizeFactors, y=DeconvolutionSizeFactors)) +
      geom_point(aes(col=SampleGroup)) +
      geom_abline(slope = 1, intercept = 0)

2.5.3 Apply size factors

For each cell, raw counts for genes are divided by the size factor for that cell and log-transformed so downstream analyses focus on genes with strong relative differences. We use scater::logNormCounts().

sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts"    "logcounts"

2.5.4 Explore the effect of normalisation

Normalised counts are much less variable across cells than raw counts

norm_counts <- logNormCounts(sce,transform='none' ) %>% 
  assay('normcounts') %>% 
  as.matrix() %>% 
  colSums()
norm_counts <- tibble(Barcode=names(norm_counts),
                      normCounts = log2(norm_counts)
                      )
norm_counts <- inner_join(norm_counts, oneSamTab, by='Barcode')


p_after_norm <- ggplot(data=norm_counts, aes(x=cell_num, y=normCounts)) +
  geom_bar(stat = 'identity') +
  labs( x= 'Cell Index',
        y='Normalized Cell UMI counts',
        title = "PBMMC_1:After Normalization" ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, size=20, color = 'red')
  )

p_before_nom + p_after_norm

Let’s separate out the scaling normalisation from the log transformation

What do the un-normalised data look like if we log them?

p_before_norm_log <- ggplot(data=oneSamTab, aes(x=cell_num, y=log2(sum))) +
  geom_bar(stat = 'identity') +
  labs( x= 'Cell Index',
        y='Cell UMI counts',
        title = "Logged raw counts" ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, size=20, color = 'red')
  )

p_before_norm_log + p_after_norm

Simply logging the sum of counts per cell reduces the variation a lot, but the scaling is required to do the job properly

The log transformation is meant to reduce the correlation between mean and variance for genes - has this worked?

We can look at the relationship between the mean gene expression and variance for raw UMI counts, scaled counts and scaled, logged counts

For raw counts:

# mean and variance for raw counts
mean <- rowMeans(assay(sce, "counts"))
var <- rowVars(assay(sce, "counts"))

# Scatter plot
plot(log(mean), log(var))
abline(a=1, b=1, col="red")

There is a strong linear relationship between mean and variance of UMI counts across genes

For scaled counts:

# Mean and variance for scaled counts
mean_scaled <- logNormCounts(sce,transform='none' ) %>% 
  assay('normcounts') %>% 
  rowMeans()
var_scaled <- logNormCounts(sce,transform='none' ) %>% 
  assay('normcounts') %>% 
  rowVars()

plot(log(mean_scaled), log(var_scaled))
abline(a=1, b=1, col="red")

The relationship is still there after scaling the counts

For scaled, log transformed counts:

# Mean and variance for scaled, log transformed counts
mean_norm <- rowMeans(assay(sce, "logcounts"))
var_norm <- rowVars(assay(sce, "logcounts"))

plot(mean_norm, var_norm)
abline(a=1, b=1, col="red")

We see that the log transformation removes a large part of the relationship between mean and variance for gene expression values

2.6 Save the normalised object

saveRDS(sce, "results/caron_normalised.rds")

2.7 Exercise

Apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1.

You will first load the a single cell experiment object containing the entire Caron data set. First it is necessary to select only cells that came from the sample ‘ETV6-RUNX1_1’. You can then apply the normalization by deconvolution by clustering the cells, computing size factors and using these to log-normalize the counts.

3 Session information

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.2.0             BiocParallel_1.36.0        
##  [3] lubridate_1.9.3             forcats_1.0.0              
##  [5] stringr_1.5.1               dplyr_1.1.4                
##  [7] purrr_1.0.2                 readr_2.1.5                
##  [9] tidyr_1.3.1                 tibble_3.2.1               
## [11] tidyverse_2.0.0             scran_1.30.2               
## [13] scater_1.30.1               ggplot2_3.5.0              
## [15] scuttle_1.12.0              SingleCellExperiment_1.24.0
## [17] SummarizedExperiment_1.32.0 Biobase_2.62.0             
## [19] GenomicRanges_1.54.1        GenomeInfoDb_1.38.7        
## [21] IRanges_2.36.0              S4Vectors_0.40.2           
## [23] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
## [25] matrixStats_1.2.0           knitr_1.45                 
## [27] irlba_2.3.5.1               Matrix_1.6-5               
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              gridExtra_2.3            
##  [3] rlang_1.1.3               magrittr_2.0.3           
##  [5] compiler_4.3.2            DelayedMatrixStats_1.24.0
##  [7] vctrs_0.6.5               pkgconfig_2.0.3          
##  [9] crayon_1.5.2              fastmap_1.1.1            
## [11] XVector_0.42.0            labeling_0.4.3           
## [13] utf8_1.2.4                rmarkdown_2.26           
## [15] tzdb_0.4.0                ggbeeswarm_0.7.2         
## [17] xfun_0.42                 bluster_1.12.0           
## [19] zlibbioc_1.48.0           cachem_1.0.8             
## [21] beachmat_2.18.1           jsonlite_1.8.8           
## [23] highr_0.10                DelayedArray_0.28.0      
## [25] parallel_4.3.2            cluster_2.1.6            
## [27] R6_2.5.1                  bslib_0.6.1              
## [29] stringi_1.8.3             limma_3.58.1             
## [31] jquerylib_0.1.4           Rcpp_1.0.12              
## [33] igraph_2.0.2              timechange_0.3.0         
## [35] tidyselect_1.2.1          rstudioapi_0.15.0        
## [37] abind_1.4-5               yaml_2.3.8               
## [39] viridis_0.6.5             codetools_0.2-19         
## [41] lattice_0.22-5            withr_3.0.0              
## [43] evaluate_0.23             pillar_1.9.0             
## [45] generics_0.1.3            RCurl_1.98-1.14          
## [47] hms_1.1.3                 sparseMatrixStats_1.14.0 
## [49] munsell_0.5.0             scales_1.3.0             
## [51] glue_1.7.0                metapod_1.10.1           
## [53] tools_4.3.2               BiocNeighbors_1.20.2     
## [55] ScaledMatrix_1.10.0       locfit_1.5-9.9           
## [57] grid_4.3.2                edgeR_4.0.16             
## [59] colorspace_2.1-0          GenomeInfoDbData_1.2.11  
## [61] beeswarm_0.4.0            BiocSingular_1.18.0      
## [63] vipor_0.4.7               cli_3.6.2                
## [65] rsvd_1.0.5                fansi_1.0.6              
## [67] S4Arrays_1.2.1            viridisLite_0.4.2        
## [69] gtable_0.3.4              sass_0.4.8               
## [71] digest_0.6.35             SparseArray_1.2.4        
## [73] ggrepel_0.9.5             dqrng_0.3.2              
## [75] farver_2.1.1              htmltools_0.5.7          
## [77] lifecycle_1.0.4           statmod_1.5.0