Sources: chapters on normalisation in the OSCA book and the Hemberg group materials.
Systematic differences in sequencing coverage between libraries occur because of low input material, differences in cDNA capture and PCR amplification. Normalisation removes such differences so that differences between cells are not technical but biological, allowing meaningful comparison of expression profiles between cells. Normalisation and batch correction have different aims. Normalisation addresses technical differences only, while batch correction considers both technical and biological differences.
We will load the R object created after QC and check its content (class, dimensions, assays, …)
# Read object in:
sce <- readRDS("../Robjects/Caron_filtered.rds")
sce
## class: SingleCellExperiment
## dim: 28377 46571
## metadata(1): Samples
## assays(1): counts
## rownames(28377): ENSG00000238009 ENSG00000239945 ... ENSG00000278633
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(46571): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 12_TTTGTCATCAGTTGAC-1 12_TTTGTCATCTCGTTTA-1
## colData names(14): Sample Barcode ... low_n_features high_Mito_percent
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
# rename column(s) if need be
# we need 'SampleName'
dd <- colData(sce) %>%
data.frame() %>%
rename(SampleName=Sample) %>%
DataFrame()
colData(sce) <- dd
We can also count the number of cells for each sample:
colData(sce) %>%
# colData() returns a DFrame
# that we need to convert to a data.frame for parsing
data.frame() %>%
# group by some columns only: SampleName, SampleId, SampleGroup
# (could do with SampleName only but we would miss SampleId, SampleGroup later)
group_by(SampleName, SampleId, SampleGroup) %>%
# count cells for each group
summarise(nbCells=n()) %>%
# display output table
DT::datatable(rownames = FALSE,
options = list(dom="tpl", pageLength = 15))
For analyses to run within the time allocated to sessions, we will subsample cells down to 500 per sample:
# number of cells to keep
nbCells <- 500
# have new list of cell barcodes for each sample
vec.bc <- colData(sce) %>%
data.frame() %>%
dplyr::filter(!SampleId == "SRR9264351") %>%
group_by(SampleName) %>%
sample_n(nbCells) %>%
pull(Barcode)
# subsetting a SCE (SingleCellExperiment) object requires indices not names
# so find index for each cell we will keep:
tmpInd <- which(colData(sce)$Barcode %in% vec.bc) # mind QC metrics will be wrong
# subset cells from the existing SCE object:
sce <- sce[,tmpInd] # this update 'sce', e.g. its assays, but not the cell meta data.
colData(sce) %>%
data.frame() %>%
dplyr::select(SampleName, SampleId, SampleGroup) %>%
group_by(SampleName, SampleId, SampleGroup) %>%
summarise(nbCells=n()) %>%
DT::datatable(rownames = FALSE,
options = list(dom="tpl", pageLength = 15, nrows=20))
Update per-gene QC metrics.
# for each gene in each cell: is it expressed?
exprLogic <- counts(sce) > 0
# count cells where gene is expressed,
# and ask if the number of cells is gt 5
detectedGenes <- rowSums(exprLogic) > 5
# count genes in each class, not-detected and detected
table(detectedGenes)
## detectedGenes
## FALSE TRUE
## 10675 17702
# remove these genes:
sce <- sce[detectedGenes,] # removes genes but does not update QC metrics.
# update cell QC metrics
sce$sum <- NULL
sce$detected <- NULL
sce$total <- NULL
sce <- addPerCellQC(sce)
# update gene QC metrics
sce <- addPerFeatureQC(sce, BPPARAM = bpp)
We write the R object to ‘caron_postQc_5hCellPerSpl.Rds’.
# Write object to file
saveRDS(sce, "../Robjects/caron_postQc_5hCellPerSpl.Rds")
In scaling normalization, the “normalization factor” is an estimate of the library size relative to the other cells. Steps usually include: computation of a cell-specific ‘scaling’ or ‘size’ factor that represents the relative bias in that cell and division of all counts for the cell by that factor to remove that bias. Assumption: any cell specific bias will affect genes the same way.
Scaling methods typically generate normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that address the effect of sequencing depth. These values however typically have a variance that increases with their mean (heteroscedasticity) while most statistical methods assume a stable variance, which does not vary with the mean (homoscedasticity). A widely used ‘variance stabilising transformation’ is the log transformation (often log2). This works fine for highly expressed genes (as in bulk RNA-seq) but less so for sparse scRNA-seq data.
Convert raw counts to counts-per-million (CPM) for each cell by dividing counts by the library size then multiplying by 1.000.000. Mind that this does not adress compositional bias caused by highly expressed genes that are also differentially expressed between cells. In scuttle
CPMs are computed as follows:
calc_cpm <- function (expr_mat, spikes = NULL)
{
norm_factor <- colSums(expr_mat[-spikes, ])
return(t(t(expr_mat)/norm_factor)) * 10^6
}
We will use scuttle
’s calculateCPM()
For each gene, compute geometric mean across cells. for each cell compute for each gene the ratio of its expression to its geometric mean, and derive the cell’s size factor as the median ratio across genes. Not suitable for sparse scRNA-seq data as the geometric is computed on non-zero values only. This method is also known as ‘Relative Log Expression’ (RLE) in edgeR
and scater
.
Example code:
calc_sf <- function (expr_mat, spikes = NULL)
{
geomeans <- exp(rowMeans(log(expr_mat[-spikes, ])))
SF <- function(cnts) {
median((cnts/geomeans)[(is.finite(geomeans) &
geomeans > 0)])
}
norm_factor <- apply(expr_mat[-spikes, ], 2, SF)
return(t(t(expr_mat)/norm_factor))
}
To compute weighted Trimmed mean of M-values (TMM), a given cell is chosen as a reference to use in computation for other cells. The M-values are gene-wise log2-fold changes between cells. Trimming entails the removal of the top and bottom 30% of values. The size factor is computed as the average for the remaining cells with a weight according to inverse variances. This method assumes that most genes are not differentially expressed, and the 40% of genes left after trimming may include many zero counts.
sizeFactors(sce) <- edgeR::calcNormFactors(counts(sce), method = "TMM")
For each cell, the library size factor is proportional to the library size such that the average size factor across cell is one.
Advantage: normalised counts are on the same scale as the initial counts.
Compute size factors:
lib.sf <- librarySizeFactors(sce)
summary(lib.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1141 0.4370 0.7413 1.0000 1.3005 8.6839
Size factor distribution: wide range, typical of scRNA-seq data.
dd <- data.frame("log10libSf"=log10(lib.sf))
ggplot(dd, aes(x=log10libSf)) +
geom_histogram(bins=50)
Assumption: absence of compositional bias; differential expression between two cells is balanced: upregulation in some genes is accompanied by downregulation of other genes. Not observed.
Inaccurate normalisation due to unaccounted-for composition bias affects the size of the log fold change measured between clusters, but less so the clustering itself. It is thus sufficient to identify clusters and top marker genes.
Composition bias occurs when differential expression between two samples or here cells is not balanced. For a fixed library size, identical in both cells, upregulation of one gene in a cell will means fewer UMIs can be assigned to other genes, which would then appear down regulated. Even if library sizes are allowed to differ, with that for the cell with upregulation being higher, scaling normalisation will reduce normalised counts. Non-upregulated would therefore also appear downregulated.
For bulk RNA-seq, composition bias is removed by assuming that most genes are not differentially expressed between samples, so that differences in non-DE genes would amount to the bias, and used to compute size factors.
Given the sparsity of scRNA-seq data, the methods are not appropriate.
The method below increases read counts by pooling cells into groups, computing size factors within each of these groups and scaling them so they are comparable across clusters. This process is repeated many times, changing pools each time to collect several size factors for each cell, from which is derived a single value for that cell.
knitr::include_graphics("../Images/scran_Fig3.png", auto_pdf = TRUE)
Clusters of cells are first identified to help form sensible pools of cells. Scaling factors are then computed.
The table below show the number and size of clusters found:
set.seed(100) # clusters with PCA from irlba with approximation
clust <- quickCluster(sce, BPPARAM=bpp) # slow with all cells.
table(clust)
## clust
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 600 371 913 255 168 234 243 307 434 386 394 123 142 171 243 205 169 142
sce <- computePooledFactors(sce,
clusters = clust,
min.mean = 0.1,
BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)
summary(deconv.sf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01664 0.39553 0.73914 1.00000 1.32157 9.23472
# min.mean
# A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization.
Plot deconvolution size factors against library size factors:
sce <- addPerFeatureQC(sce, BPPARAM = bpp) # PATCH
deconvDf <- data.frame(lib.sf, deconv.sf,
"source_name" = sce$SampleGroup,
"sum" = sce$sum,
"mito_content" = sce$subsets_Mito_percent)
# colour by sample type
sp <- ggplot(deconvDf, aes(x=lib.sf, y=deconv.sf, col=source_name)) +
geom_point()
sp
For each cell, raw counts for genes are divided by the size factor for that cell and log-transformed so downstream analyses focus on genes with strong relative differences. We use scater::logNormCounts()
.
sce <- logNormCounts(sce) # adds logcounts
# check list of assays stored:
print(assayNames(sce))
## [1] "counts" "logcounts"
# write to file
saveRDS(sce, "../Robjects/caron_postDeconv_5hCellPerSpl.Rds")
Exercise: apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1 (aka GSM3872434).
You first load the same object we loaded earlier, then select cells for SampleName ‘ETV6-RUNX1_1’. You will then cluster cells, compute and apply size factors.
With scaling normalisation a correlation remains between the mean and variation of expression (heteroskedasticity). This affects downstream dimensionality reduction as the few main new dimensions are usually correlated with library size. sctransform addresses the issue by regressing library size out of raw counts and providing residuals to use as normalized and variance-stabilized expression values in some downstream analyses, such as dimensionality reduction. We will use the sctransform vignette.
We will first obtain the raw counts matrix:
# keep raw counts in a 'counts' variable:
counts <- counts(sce)
# check the class of the object
# expect a 'dgCMatrix': Compressed, sparse, column-oriented numeric matrices
# the “standard” class for sparse numeric matrices in the Matrix package
print(class(counts))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
# check the dimensions of the object
print(dim(counts))
## [1] 17702 5500
# name columns (cells) with barcodes
colnames(counts) <- colData(sce)$Barcode
We will now calculate some properties and visually inspect the data. Our main interest is in the general trends not in individual outliers. Neither genes nor cells that stand out are important at this step; we focus on the global trends.
Gene attributes include for each gene:
Cells attributes include for each cell:
# gene attributes:
# prepare a data frame named e.g. 'gene_attr' to keep gene attributes, inc:
gene_attr <- data.frame(mean = rowMeans(counts),
detection_rate = rowMeans(counts > 0),
var = rowVars(counts))
gene_attr$log_mean <- log10(gene_attr$mean)
gene_attr$log_var <- log10(gene_attr$var)
# name rows of the 'gene_attr' data frame:
rownames(gene_attr) <- rownames(counts)
# cell attributes:
cell_attr <- data.frame(n_umi = colSums(counts),
n_gene = colSums(counts > 0))
rownames(cell_attr) <- colnames(counts)
dim(gene_attr)
## [1] 17702 5
head(gene_attr)
dim(cell_attr)
## [1] 5500 2
head(cell_attr)
For the genes, on the log10 scale we can see that up to a mean UMI count of 0 the variance follows the line through the origin with slope one, i.e. variance and mean are roughly equal as expected under a Poisson model. However, genes with a higher average UMI count show overdispersion compared to Poisson.
ggplot(gene_attr, aes(log_mean, log_var)) +
geom_point(alpha=0.3, shape=16) +
geom_density_2d(size = 0.3) +
geom_abline(intercept = 0, slope = 1, color='red')
In line with the previous plot, we see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is at or very close to 1.0 suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.
# add the expected detection rate under Poisson model
x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- data.frame(log_mean = x,
detection_rate = 1 - dpois(0, lambda = 10^x))
ggplot(gene_attr, aes(log_mean, detection_rate)) +
geom_point(alpha=0.3, shape=16) +
geom_line(data=poisson_model, color='red') +
theme_gray(base_size = 8)
The plot below shows the relationship between the two cell attributes computed: library size (n_umi) and number of genes detected (n_gene).
ggplot(cell_attr, aes(n_umi, n_gene)) +
geom_point(alpha=0.3, shape=16) +
geom_density_2d(size = 0.3)
“Based on the observations above, which are not unique to this particular data set, we propose to model the expression of each gene as a negative binomial random variable with a mean that depends on other variables. Here the other variables can be used to model the differences in sequencing depth between cells and are used as independent variables in a regression model. In order to avoid overfitting, we will first fit model parameters per gene, and then use the relationship between gene mean and parameter values to fit parameters, thereby combining information across genes. Given the fitted model parameters, we transform each observed UMI count into a Pearson residual which can be interpreted as the number of standard deviations an observed count was away from the expected mean. If the model accurately describes the mean-variance relationship and the dependency of mean and latent factors, then the result should have mean zero and a stable variance across the range of expression.” sctransform vignette.
In short:
Assumptions:
Outcome:
We will now estimate model parameters and transform data.
The vst
function estimates model parameters and performs the variance stabilizing transformation.
Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean).
print(dim(counts))
## [1] 17702 5500
# We use the Future API for parallel processing;
# set parameters here
future::plan(strategy = 'multicore', workers = 7)
options(future.globals.maxSize = 10 * 1024 ^ 3)
# transform counts:
set.seed(44) # for reproducibility
vst_out <- sctransform::vst(
counts, # A matrix of UMI counts with genes as rows and cells as columns
latent_var = c('log_umi'), # The independent variables to regress out as a character vector
return_gene_attr = TRUE, # Make cell attributes part of the output
return_cell_attr = TRUE, # Calculate gene attributes and make part of output
verbosity = 0 # An integer specifying what to show (0: nothing, 1: messages, 2: + progress bar)
)
# diagnostic plots: estimated and fitted model parameters
# by default parameters shown are:
# - intercept
# - latent variables, here log_umi
# - overdispersion factor (od_factor)
sctransform::plot_model_pars(
vst_out, # The output of a vst run
verbosity = 1 # Messages only, no progress bar
)
Inspect model:
print(vst_out$model_str)
## [1] "y ~ log_umi"
We will look at several genes in more detail by plotting the observed UMI counts and model.
For each gene of interest, plots include:
We will look at two genes: ‘RPL10’ and ‘HBB’:
rowData(sce) %>%
as.data.frame %>%
dplyr::filter(Symbol %in% c('RPL10', 'HBB'))
ensId <- rowData(sce) %>%
as.data.frame %>%
dplyr::filter(Symbol %in% c('RPL10', 'HBB')) %>%
pull("ID")
sctransform::plot_model(
vst_out, # The output of a vst run
counts, # UMI count matrix
ensId, # Vector of genes to plot
plot_residual = TRUE
)
The distribution of residual mean is cetered around 0:
ggplot(vst_out$gene_attr, aes(residual_mean)) +
geom_histogram(binwidth=0.01)
The distribution of residual variance is centered around 1:
ggplot(vst_out$gene_attr, aes(residual_variance)) +
geom_histogram(binwidth=0.1) +
geom_vline(xintercept=1, color='red') +
xlim(0, 10)
The following plot of the residual variance against the mean: after transformation there is no relationship between gene mean and variance.
ggplot(vst_out$gene_attr,
aes(log10(gmean), residual_variance)) +
geom_point(alpha=0.3, shape=16) +
geom_density_2d(size = 0.3)
Check genes with large residual variance. These genes would be markers of expected cell populations. Note how they represent a great range of mean UMI and detection rate values.
dd <- vst_out$gene_attr %>%
arrange(-residual_variance) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), round, 2))
dd %>% tibble::rownames_to_column("ID") %>%
left_join(as.data.frame(rowData(sce))[,c("ID", "Symbol")], "ID") %>%
DT::datatable(rownames = FALSE)
Check transformed values:
print(dim(vst_out$y))
## [1] 17702 5500
vst_out$y[1:10,1:5]
## AAACGGGCAGTTCATG-1 AAACGGGGTTCACCTC-1 AAAGATGAGCGATGAC-1
## ENSG00000241860 -0.07892483 -0.08374454 -0.04408308
## ENSG00000237491 -0.18356914 -0.19572285 -0.09614831
## ENSG00000228794 -0.28473010 -0.30490626 -0.14212128
## ENSG00000225880 -0.09629389 -0.10225772 -0.05322317
## ENSG00000230368 -0.16560055 -0.17623320 -0.08866591
## ENSG00000230699 -0.05545929 -0.05852999 -0.03291700
## ENSG00000241180 -0.04399818 -0.04641039 -0.02628544
## ENSG00000188976 -0.53647019 0.79763316 -0.27543984
## ENSG00000187961 -0.08289801 -0.08799867 -0.04606763
## ENSG00000188290 -0.50146670 -0.53425834 -0.25682417
## AAAGATGCAGCCAATT-1 AAAGTAGCAATGCCAT-1
## ENSG00000241860 -0.07143280 -0.06993862
## ENSG00000237491 -0.16463296 -0.16085511
## ENSG00000228794 -0.25340328 -0.24717362
## ENSG00000225880 -0.08702076 -0.08517135
## ENSG00000230368 -0.14902109 -0.14571029
## ENSG00000230699 -0.05067514 -0.04971864
## ENSG00000241180 24.13478705 -0.03948979
## ENSG00000188976 -0.48149417 -0.47038188
## ENSG00000187961 -0.07497026 -0.07338945
## ENSG00000188290 -0.44955076 -0.43908874
Check SCE object:
sce
## class: SingleCellExperiment
## dim: 17702 5500
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(17702): ENSG00000241860 ENSG00000237491 ... ENSG00000275063
## ENSG00000278817
## rowData names(9): ID Symbol ... detected gene_sparsity
## colnames(5500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
## 12_TTTGTCAAGTGGTAGC-1 12_TTTGTCAGTACAGCAG-1
## colData names(16): SampleName Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
print(assayNames(sce))
## [1] "counts" "logcounts"
Genes that are expressed in fewer than 5 cells are not used and not returned, so to add vst_out$y as an assay we need to remove the missing genes.
# genes that are expressed in fewer than 5 cells are not used and not returned
# so to add vst_out$y as an assay we need to ditch the missing genes completely.
# https://github.com/ChristophH/sctransform/issues/27
geneOverlap <- rownames(sce) %in% rownames(vst_out$y)
if(!all(geneOverlap))
{
table(rownames(sce) %in% rownames(vst_out$y))
tmpInd <- which(rownames(sce) %in% rownames(vst_out$y))
sce <- sce[tmpInd,]
assayNames(sce)
}
sce
## class: SingleCellExperiment
## dim: 17702 5500
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(17702): ENSG00000241860 ENSG00000237491 ... ENSG00000275063
## ENSG00000278817
## rowData names(9): ID Symbol ... detected gene_sparsity
## colnames(5500): 1_AAACGGGCAGTTCATG-1 1_AAACGGGGTTCACCTC-1 ...
## 12_TTTGTCAAGTGGTAGC-1 12_TTTGTCAGTACAGCAG-1
## colData names(16): SampleName Barcode ... sizeFactor cell_sparsity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
vstMat <- as(vst_out$y[rownames(sce),], "dgCMatrix")
# reading 10X data with vector above adds a prefix to sce colnames
# so we will not pass vstMat colnames when copying it in a assay slot,
# but must first check that barcodes are indeed in the same order
# in sce and vstMat.
all(colnames(vstMat) == sce$Barcode)
## [1] TRUE
all(rownames(vstMat) == rownames(sce))
## [1] TRUE
assay(sce, "sctrans_norm", withDimnames=FALSE) <- vstMat
# write to file
saveRDS(sce, "../Robjects/caron_postSct_5hCellPerSpl.Rds")
Exercise: apply the sctransform normalisation on a single sample: ETV6-RUNX1_1 (aka GSM3872434).
In exercise 1, you have made a new SCE object with cells for SampleName ‘ETV6-RUNX1_1’. You will now inspect the mean-variance relationship and apply sctransform to that data.
knit_exit()