1 Normalisation - Caron set

Sources: chapters on normalisation in the OSCA book and the Hemberg group materials.

1.1 Learning objectives

Understand why normalisation is required
Understand concepts of deconvolution normalisation method

1.2 Why normalise?

Systematic differences in sequencing coverage between libraries occur because of low input material, differences in cDNA capture and PCR amplification. Normalisation removes such differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells.

library(scater)
library(scran)
library(scuttle)
library(tidyverse)
library(BiocSingular)
library(BiocParallel)
library(glue)
library(patchwork)

# prepare 'cluster' for multicore processing
# use 7 workers
bpp <- MulticoreParam(7)

1.3 Load object

We will load the R object created after QC and check its content (class, dimensions, assays, …)

# Read object in:
sce <- readRDS("../Robjects/Caron_filtered_genes.rds")
sce

## class: SingleCellExperiment 
## dim: 28377 46571 
## metadata(1): Samples
## assays(1): counts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(46571): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
##   12_TTTGTCATCAGTTGAC-1 12_TTTGTCATCTCGTTTA-1
## colData names(6): Sample Barcode ... detected total
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

# rename column(s) if need be
# we need 'SampleName'
dd <- colData(sce) %>%
  data.frame() %>%
  rename(SampleName=Sample) %>%
  DataFrame()

colData(sce) <- dd 

## add SampleGroup information
colData(sce)$SampleGroup <- str_replace(colData(sce)$SampleName,"_.*","")

We can also count the number of cells for each sample:

colData(sce) %>%
  # colData() returns a DFrame
  # that we need to convert to a data.frame for parsing
  data.frame() %>%
  # group by some columns only: SampleName, SampleId, SampleGroup
  # (could do with SampleName only but we would miss SampleId, SampleGroup later)
  group_by(SampleName, SampleId, SampleGroup) %>%
  # count cells for each group
  summarise(nbCells=n()) %>%
  # display output table
  DT::datatable(rownames = FALSE,
                options = list(dom="tpl", pageLength = 15))

For analyses to run within the time allocated to sessions, we will subsample cells down to 500 per sample:

# number of cells to keep
nbCells <- 500

# have new list of cell barcodes for each sample
vec.bc <- colData(sce) %>%
    data.frame() %>%
    dplyr::filter(!SampleId == "SRR9264351") %>%
    group_by(SampleName) %>%
    sample_n(nbCells) %>%
    pull(Barcode)

# subsetting a SCE (SingleCellExperiment) object requires indices not names
# so find index for each cell we will keep:
tmpInd <- which(colData(sce)$Barcode %in% vec.bc) # mind QC metrics will be wrong
# subset cells from the existing SCE object:
sce <- sce[,tmpInd] # this update 'sce', e.g. its assays, but not the cell meta data.

colData(sce) %>%
  data.frame() %>%
  dplyr::select(SampleName, SampleId, SampleGroup) %>%
  group_by(SampleName, SampleId, SampleGroup) %>%
  summarise(nbCells=n()) %>%
  DT::datatable(rownames = FALSE,
                options = list(dom="tpl", pageLength = 15, nrows=20))

Update per-gene QC metrics.

# for each gene in each cell: is it expressed?
exprLogic <- counts(sce) > 0
# count cells where gene is expressed,
# and ask if the number of cells is gt 5
detectedGenes <- rowSums(exprLogic) > 5
# count genes in each class, not-detected and detected
table(detectedGenes)

## detectedGenes
## FALSE  TRUE 
##  5003 23374

# remove these genes:
sce <- sce[detectedGenes,] # removes genes but does not update QC metrics.

# update cell QC metrics
sce$sum <- NULL
sce$detected <- NULL
sce$total <- NULL
sce <- addPerCellQC(sce)

# update gene QC metrics
sce <- addPerFeatureQC(sce, BPPARAM = bpp)

We write the R object to ‘caron_filtered_5hCell.Rds’.

# Write object to file
saveRDS(sce, "../Robjects/caron_filtered_5hCell.Rds")

1.4 Scaling normalization

In scaling normalization, the “normalization factor” is an estimate of the library size relative to the other cells. Steps usually include: computation of a cell-specific ‘scaling’ or ‘size’ factor that represents the relative bias in that cell and division of all counts for the cell by that factor to remove that bias. Assumption: any cell specific bias will affect genes the same way.

1.4.1 Library size normalization

For each cell, the library size factor is proportional to the library size such that the average size factor across cell is one.

Compute size factors:

lib.sf <- librarySizeFactors(sce)
summary(lib.sf)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1141  0.4570  0.7399  1.0000  1.2784 14.0165

Size factor distribution: wide range, typical of scRNA-seq data.

dd <- data.frame("log10libSf"=log10(lib.sf))
ggplot(dd, aes(x=log10libSf)) + 
  geom_histogram(bins=50)

Assumption: absence of compositional bias; differential expression between two cells is balanced: upregulation in some genes is accompanied by downregulation of other genes.

This normalisation due to unaccounted-for composition bias affects the size of the log fold change measured between clusters, but less so the clustering itself. It is thus sufficient to identify clusters and top marker genes.

1.4.2 Deconvolution

Composition bias occurs when differential expression between two samples or here cells is not balanced. For a fixed library size, identical in both cells, upregulation of one gene in a cell will means fewer UMIs can be assigned to other genes, which would then appear down regulated. Even if library sizes are allowed to differ, with that for the cell with upregulation being higher, scaling normalisation will reduce normalised counts. Non-upregulated would therefore also appear downregulated.

Given the sparsity of scRNA-seq data, the methods are not appropriate.

The method below increases read counts by pooling cells into groups, computing size factors within each of these groups and scaling them so they are comparable across clusters. This process is repeated many times, changing pools each time to collect several size factors for each cell, from which is derived a single value for that cell.

knitr::include_graphics("../Images/scran_Fig3.png", auto_pdf = TRUE)

Clusters of cells are first identified to help form sensible pools of cells. Scaling factors are then computed.

1.4.2.1 Cluster cells

The table below show the number and size of clusters found:

set.seed(100) # clusters with PCA from irlba with approximation
clust <- quickCluster(sce, BPPARAM=bpp) # slow with all cells.
table(clust)

## clust
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1500 3566  523 4353 1228 1087 1452 1407  220 1470 1479 1940  106  888 3580  910 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 2148 3079 2865  317 1273  312 3547 1366  191  988  146  452  503  666  378  490 
##   33   34   35   36   37 
## 1100  514  122  121  284

1.4.2.2 Compute size factors

sce <- computePooledFactors(sce,
             clusters = clust,
             min.mean = 0.1,
             BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)
summary(deconv.sf)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01483  0.41906  0.74237  1.00000  1.31331 14.27850

Plot deconvolution size factors against library size factors:

deconvDf <- data.frame(lib.sf, deconv.sf,
            "source_name" = sce$SampleGroup)

# colour by sample type
sp <- ggplot(deconvDf, aes(x=lib.sf, y=deconv.sf, col=source_name)) +
  geom_point()
sp

1.4.2.3 Apply size factors

For each cell, raw counts for genes are divided by the size factor for that cell and log-transformed so downstream analyses focus on genes with strong relative differences. We use scater::logNormCounts().

sce <- logNormCounts(sce) # adds logcounts
# check list of assays stored:
print(assayNames(sce))

## [1] "counts"    "logcounts"

1.4.2.4 Save object

# write to file
saveRDS(sce, "../Robjects/caron_postDeconv_5hCellPerSpl.Rds")

1.5 Exercise

Exercise: apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1 (aka GSM3872434).

You first load the same object we loaded earlier, then select cells for SampleName ‘ETV6-RUNX1_1’. You will then cluster cells, compute and apply size factors.

1.6 Session information

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.1.1             glue_1.6.2                 
##  [3] BiocParallel_1.28.3         BiocSingular_1.10.0        
##  [5] forcats_0.5.1               stringr_1.4.0              
##  [7] dplyr_1.0.8                 purrr_0.3.4                
##  [9] readr_2.1.2                 tidyr_1.2.0                
## [11] tibble_3.1.6                tidyverse_1.3.1            
## [13] scran_1.22.1                scater_1.22.0              
## [15] ggplot2_3.3.5               scuttle_1.4.0              
## [17] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [19] Biobase_2.54.0              GenomicRanges_1.46.1       
## [21] GenomeInfoDb_1.30.1         IRanges_2.28.0             
## [23] S4Vectors_0.32.4            BiocGenerics_0.40.0        
## [25] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [27] knitr_1.38                  rmarkdown_2.13             
## 
## loaded via a namespace (and not attached):
##  [1] ggbeeswarm_0.6.0          colorspace_2.0-3         
##  [3] ellipsis_0.3.2            bluster_1.4.0            
##  [5] XVector_0.34.0            BiocNeighbors_1.12.0     
##  [7] fs_1.5.2                  rstudioapi_0.13          
##  [9] farver_2.1.0              ggrepel_0.9.1            
## [11] DT_0.22                   fansi_1.0.3              
## [13] lubridate_1.8.0           xml2_1.3.3               
## [15] sparseMatrixStats_1.6.0   jsonlite_1.8.0           
## [17] broom_0.7.12              cluster_2.1.3            
## [19] dbplyr_2.1.1              png_0.1-7                
## [21] compiler_4.1.1            httr_1.4.2               
## [23] dqrng_0.3.0               backports_1.4.1          
## [25] assertthat_0.2.1          Matrix_1.4-1             
## [27] fastmap_1.1.0             limma_3.50.1             
## [29] cli_3.2.0                 htmltools_0.5.2          
## [31] tools_4.1.1               rsvd_1.0.5               
## [33] igraph_1.3.0              gtable_0.3.0             
## [35] GenomeInfoDbData_1.2.7    Rcpp_1.0.8.3             
## [37] cellranger_1.1.0          jquerylib_0.1.4          
## [39] vctrs_0.4.0               crosstalk_1.2.0          
## [41] DelayedMatrixStats_1.16.0 xfun_0.30                
## [43] beachmat_2.10.0           rvest_1.0.2              
## [45] lifecycle_1.0.1           irlba_2.3.5              
## [47] statmod_1.4.36            edgeR_3.36.0             
## [49] zlibbioc_1.40.0           scales_1.1.1             
## [51] hms_1.1.1                 parallel_4.1.1           
## [53] yaml_2.3.5                gridExtra_2.3            
## [55] sass_0.4.1                stringi_1.7.6            
## [57] highr_0.9                 ScaledMatrix_1.2.0       
## [59] rlang_1.0.2               pkgconfig_2.0.3          
## [61] bitops_1.0-7              evaluate_0.15            
## [63] lattice_0.20-45           labeling_0.4.2           
## [65] htmlwidgets_1.5.4         tidyselect_1.1.2         
## [67] magrittr_2.0.3            R6_2.5.1                 
## [69] generics_0.1.2            metapod_1.2.0            
## [71] DelayedArray_0.20.0       DBI_1.1.2                
## [73] pillar_1.7.0              haven_2.4.3              
## [75] withr_2.5.0               RCurl_1.98-1.6           
## [77] modelr_0.1.8              crayon_1.5.1             
## [79] utf8_1.2.2                tzdb_0.3.0               
## [81] viridis_0.6.2             locfit_1.5-9.5           
## [83] grid_4.1.1                readxl_1.4.0             
## [85] reprex_2.0.1              digest_0.6.29            
## [87] munsell_0.5.0             beeswarm_0.4.0           
## [89] viridisLite_0.4.0         vipor_0.4.5              
## [91] bslib_0.3.1

Introduction to single-cell RNA-seq analysis

Normalisation - Caron data set - 500 cells per sample

Stephane Ballereau/ Kamal Kishore