1 Normalisation

Acknowledgments: much of the material in this section hase been derived from the chapters on normalisation in the OSCA book and the Hemberg Group course materials.

library(scater)
library(scran)
library(sctransform)
library(tidyverse)
library(BiocParallel)
library(patchwork)
bpp <- MulticoreParam(7)

1.1 Load data object

For the purposes of this demonstration we have generated a smaller data set in which there are only 500 cells per sample. This is so that the code can be run in a reasonable amount of time during the live teaching session. The data were first QC’d and filtered as described in the QC and exploratory analysis session. After this 500 cells were selected at random from each sample.

sce <- readRDS("R_objects/Caron_filtered.500.rds")
sce

## class: SingleCellExperiment 
## dim: 33102 5500 
## metadata(1): Samples
## assays(1): counts
## rownames(33102): ENSG00000243485 ENSG00000238009 ... ENSG00000276017
##   ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(5500): 1_CGACTTCGTCCAGTTA-1 1_AGAATAGCATACGCTA-1 ...
##   8_AGCGTCGAGATGTAAC-1 8_CATGACAAGATAGGAG-1
## colData names(10): Sample Barcode ... subsets_Mito_percent total
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

table(sce$SampleName)

## 
## ETV6-RUNX1_1 ETV6-RUNX1_2 ETV6-RUNX1_3 ETV6-RUNX1_4        HHD_1        HHD_2 
##          500          500          500          500          500          500 
##      PBMMC_1      PBMMC_2      PBMMC_3      PRE-T_1      PRE-T_2 
##          500          500          500          500          500

1.2 Learning objectives

Understand why normalisation is required
Understand concepts of two normalisation methods
- deconvolution
- sctransform

1.3 Why normalise?

oneSamTab <- colData(sce) %>% 
  as.data.frame() %>% 
  filter(SampleName == "PBMMC_1") %>% 
  dplyr::select(SampleName,Barcode, sum) %>% 
  mutate(cell_num = 1:n())

p_before_nom <- ggplot(data=oneSamTab, aes(x=cell_num, y=sum)) +
  geom_bar(stat = 'identity') +
  labs( x= 'Cell Index',
        y='Cell UMI counts',
        title = "PBMMC_1: Before Normalization" ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, size=20, color = 'red')
  )

p_before_nom

Above plot shows the UMI counts/cell (transcript molecules) for 500 cell from the PBMMC_1 sample
UMI counts fluctuate
We derive biological insights downstream by comparing cells against each other.
But the UMI counts differences makes it harder to compare cells.
Why total transcript molecules (UMI counts) detected between cells differ?
- Biological:
  - Cell sub type differences, like size and transcription activity etc.
- Technical: scRNA data is inherently noisy
  - Low mRNA content per cell
  - cell-to-cell differences in mRNA capture efficiency
  - Variable sequencing depth
  - PCR amplification efficiency
A normalization technique makes the UMI counts distribution uniform, so that each cell has similar counts.
Normalization is a critical step that corrects cell-to-cell technical differences.
By normalizing, downstream comparisons of relative expression between cells are valid.

1.4 Normalization strategies

The sparse nature of scRNA data makes normalization difficult, unlike bulk RNAseq data.

Broadly two classes
1. Spike-in methods
- Uses pike-in controls for normalisation
- Not available for droplet based scRNA techniques like 10x.
1. Non-spike-in methods:
- Using available counts data for normalization
- DEseq2
- edgeR - TMM
- Library size normalization
- deconvolution
- sctransform
Typical normalization has two steps
1. scaling
- Estimate size or scaling or normalization factor: computation of a cell-specific ‘scaling’ or ‘size’ factor or “normalization factor” that represents the relative bias in that cell and division of all counts for the cell by that factor to remove that bias. Assumption: any cell specific bias will affect genes the same way.
- Scale the data by dividing the count for each gene with the appropriate size factor for that cell
1. Transformation
- log2
- Square root transformation
- Pearson residuals (eg. sctransform)

Scaling methods typically generate normalised counts-per-million (CPM) or transcripts-per-million (TPM) values that address the effect of sequencing depth. These values however typically have a variance that increases with their mean (heteroscedasticity) while most statistical methods assume a stable variance, which does not vary with the mean (homoscedasticity). A widely used ‘variance stabilising transformation’ is the log transformation (often log2). This works well for highly expressed genes (as in bulk RNA-seq) but less so for sparse scRNA-seq data.

* DEseq, edgeR-TMM and Library size normalization initially developed for bulk RNAseq * Applying these methods on scRNAseq data systematically under or over estimate size factors. i.e systematically deviate from true size factors. * This deviation is the result of removing zeroes prior to normalization. * Therefore other normalization methods specific to scRNAseq data like deconvolution, sctransform etc. were proposed.

1.5 Deconvolution

Because single-cell data tend to have a substantial number of low and zero counts, these bulk normalization methods may be problematic for single-cell data.

Deconvolution aims to normalize expression values based on summed values from pools of cells.
Since cell summation results in fewer zeros, the ensuing normalization is less susceptible to errors than existing methods.
The estimated size factors are only relevant to the pools of cells, even though normalization accuracy has improved.
Each pool’s size factor is deconvolved into its constituent cells’ size factors.

In order to avoid pooling cells with radically different transcriptomic profiles, the cells are first clustered based on gene expression. The pools are then formed exclusively with each cluster. Size factors are calculated within each cluster and are then scaled so they are comparable across clusters.

1.5.1 Cluster cells

The table below show the number and size of clusters found:

set.seed(100)
clust <- quickCluster(sce, BPPARAM=bpp)
table(clust)

## clust
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 604 387 258 567 143 138 858 136 245 579 242 131 125 128 141 128 128 139 146 277

1.5.2 Compute size factors

sce <- computePooledFactors(sce,
             clusters = clust,
             min.mean = 0.1,
             BPPARAM = bpp)
deconv.sf <- sizeFactors(sce)
summary(deconv.sf)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0160  0.4163  0.7316  1.0000  1.3418 12.3957

Note: min.mean - A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. This means large numbers of very lowly expressed genes will not bias the normalization.

Plot deconvolution size factors against library size factors:

lib.sf <- librarySizeFactors(sce)
data.frame(LibrarySizeFactors = lib.sf, 
           DeconvolutionSizeFactors = deconv.sf,
                 SampleGroup = sce$SampleGroup) %>%
  ggplot(aes(x=LibrarySizeFactors, y=DeconvolutionSizeFactors)) +
      geom_point(aes(col=SampleGroup)) +
      geom_abline(slope = 1, intercept = 0)

1.5.3 Apply size factors

For each cell, raw counts for genes are divided by the size factor for that cell and log-transformed so downstream analyses focus on genes with strong relative differences. We use scater::logNormCounts().

sce <- logNormCounts(sce)
assayNames(sce)

## [1] "counts"    "logcounts"

norm_counts <- logNormCounts(sce,transform='none' ) %>% 
  assay('normcounts') %>% 
  as.matrix() %>% 
  colSums()
norm_counts <- tibble(Barcode=names(norm_counts),
                      normCounts = log2(norm_counts)
                      )
norm_counts <- inner_join(norm_counts, oneSamTab, by='Barcode')


p_after_norm <- ggplot(data=norm_counts, aes(x=cell_num, y=normCounts)) +
  geom_bar(stat = 'identity') +
  labs( x= 'Cell Index',
        y='Normalized Cell UMI counts',
        title = "PBMMC_1:After Normalization" ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, size=20, color = 'red')
  )

p_before_nom + p_after_norm

1.6 Save the normalised object

saveRDS(sce, "results/caron_normalised.rds")

1.7 Exercise

Apply the deconvolution normalisation on a single sample: ETV6-RUNX1_1.

You will first load the a single cell experiment object containing the entire Caron data set. First it is necessary to select only cells that came from the sample ‘ETV6-RUNX1_1’. You can then apply the normalization by deconvolution by clustering the cells, computing size factors and using these to log-normalize the counts.

2 sctransform: Variant Stabilising Transformation

With scaling normalisation a correlation remains between the mean and variation of expression (heteroskedasticity). This affects downstream dimensionality reduction as the few main new dimensions are usually correlated with library size. sctransform addresses the issue by regressing library size out of raw counts and providing residuals to use as normalized and variance-stabilized expression values in some downstream analyses, such as dimensionality reduction.

“Effect of scaling normalization”

The sctransform package is from the Seurat suite of scRNAseq analysis packages. Rather than convert our Single Cell Experiment object into a Seurat object and use the Seurat package’s command SCTransform, we will extract the counts matrix from our SCE object and run the variance stabilising transformation (VST) algorithm, using the sctranform package’s vst command, directly on the matrix. We can extract the counts matrix - as a dgCMatrix object sparse matrix - using the counts function.

counts <- counts(sce)
class(counts)

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

2.1 Rationale

In order to demonstrate the rationale behind the using the variance stabilising transformation, we will visually inspect various properties of our data. Our main interest is in the general trends not in individual outliers. Neither genes nor cells that stand out are important at this step; we focus on the global trends.

2.1.1 Derive gene and cell attributes from the UMI matrix

2.1.1.1 Gene attributes

Gene attributes include for each gene:

mean UMI count across cells
number of cells where the gene is detected
variance of UMI counts across cells
the mean and variance above on the log10 scale

gene_attr <- data.frame(mean = rowMeans(counts), 
                        detection_rate = rowMeans(counts > 0),
                        var = rowVars(counts)) %>% 
  mutate(log_mean = log10(mean)) %>% 
  mutate(log_var = log10(var))

dim(gene_attr)

## [1] 33102     5

head(gene_attr)

##                         mean detection_rate          var  log_mean   log_var
## ENSG00000243485 0.0001818182   0.0001818182 0.0001818182 -3.740363 -3.740363
## ENSG00000238009 0.0021818182   0.0021818182 0.0021774538 -2.661181 -2.662051
## ENSG00000239945 0.0001818182   0.0001818182 0.0001818182 -3.740363 -3.740363
## ENSG00000241860 0.0063636364   0.0061818182 0.0066879929 -2.196295 -2.174704
## ENSG00000286448 0.0000000000   0.0000000000 0.0000000000      -Inf      -Inf
## ENSG00000235146 0.0000000000   0.0000000000 0.0000000000      -Inf      -Inf

2.1.1.2 Cell attributes

Attributes include for each cell:

total UMI count across genes (library size)
number of genes detected (with at least 1 UMI)

cell_attr <- data.frame(n_umi = colSums(counts),
                        n_gene = colSums(counts > 0))

dim(cell_attr)

## [1] 5500    2

head(cell_attr)

##                      n_umi n_gene
## 1_CGACTTCGTCCAGTTA-1  4557   2024
## 1_AGAATAGCATACGCTA-1  7745   2695
## 1_TGACTAGAGAACTCGG-1  7181   2617
## 1_CTTAACTGTTATGCGT-1  1704   1081
## 1_CCCAGTTTCAAGCCTA-1  6914   2570
## 1_TACAGTGCACGACGAA-1 11510   4070

2.1.2 Mean-variance relationship

For the genes, on the log10 scale we can see that up to a mean UMI count of 0 the variance follows the line through the origin with slope one, i.e. variance and mean are roughly equal as expected under a Poisson model. However, genes with a higher average UMI count show overdispersion compared to Poisson.

ggplot(gene_attr, aes(log_mean, log_var)) + 
  geom_point(alpha=0.3, shape=16) +
  geom_abline(intercept = 0, slope = 1, color='red')

2.1.3 Mean-detection-rate relationship

In line with the previous plot, we see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is at or very close to 1.0 suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.

x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- data.frame(log_mean = x,
                detection_rate = 1 - dpois(0, lambda = 10^x))

ggplot(gene_attr, aes(log_mean, detection_rate)) + 
  geom_point(alpha=0.3, shape=16) + 
  geom_line(data=poisson_model, color='red') +
  theme_gray(base_size = 8)

2.1.4 Cell attributes

The plot below shows the relationship between the two cell attributes computed: library size (n_umi) and number of genes detected (n_gene).

ggplot(cell_attr, aes(n_umi, n_gene)) + 
  geom_point(alpha=0.3, shape=16) + 
  geom_density_2d(size = 0.3)

2.2 Method

From the sctransform vignette: “Based on the observations above, which are not unique to this particular data set, we propose to model the expression of each gene as a negative binomial random variable with a mean that depends on other variables. Here the other variables can be used to model the differences in sequencing depth between cells and are used as independent variables in a regression model. In order to avoid overfitting, we will first fit model parameters per gene, and then use the relationship between gene mean and parameter values to fit parameters, thereby combining information across genes. Given the fitted model parameters, we transform each observed UMI count into a Pearson residual which can be interpreted as the number of standard deviations an observed count was away from the expected mean. If the model accurately describes the mean-variance relationship and the dependency of mean and latent factors, then the result should have mean zero and a stable variance across the range of expression.”

In short:

expression of a gene is modeled by a negative binomial random variable with a mean that depends on library size
library size is used as the independent variable in a regression model
the model is fit for each gene, then combined data across genes is used to fit parameters
convert UMI counts to residuals akin to the number of standard deviations away from the expected mean.

Assumptions:

accurate model of the mean-variance relationship
accurate model of the dependency of mean and latent factors

Outcome:

the mean of the transformed data (residuals) is zero
stable variance across expression range

2.3 Application

2.3.1 Estimation and transformation

We will now estimate model parameters and transform data.

The vst function estimates model parameters and performs the variance stabilizing transformation.

Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean). We will set the following arguments:

umi - The matrix of UMI counts with genes as rows and cells as columns
latent_var - The independent variables to regress out as a character vector
return_gene_attr - Make cell attributes part of the output
return_cell_attr - Calculate gene attributes and make part of output

set.seed(44)
vst_out <- vst(umi = counts,
               latent_var = c('log_umi'),
               return_gene_attr = TRUE,
               return_cell_attr = TRUE
               
  )

2.3.2 Parameter plots

We will generate some diagnostic plots in order to inspect the estimated and fitted model parameters.

By default parameters shown are:

intercept
latent variables, here log_umi
overdispersion factor (od_factor)

plot_model_pars(vst_out)

We check the regression model used is the one the we intended:

vst_out$model_str

## [1] "y ~ log_umi"

We will now look at several genes in more detail by plotting observed UMI counts and comparing these to plots using the residuals from the modelling.

For each gene of interest, we will plot:

the observed cell attribute (UMI counts) against the latent variable (library size) (by default), with the fitted model as a pink line showing the expected UMI counts given the model and a shaded region spanning one standard deviation from the expected value.
the residuals against the latent variable in the same way.

We will look at two genes: ‘RPL10’ and ‘HBB’:

##                              ID Symbol            Type Chromosome
## ENSG00000244734 ENSG00000244734    HBB Gene Expression      chr11
## ENSG00000147403 ENSG00000147403  RPL10 Gene Expression       chrX

ensId <- rowData(sce) %>%
    as.data.frame %>%
    filter(Symbol %in% c('RPL10', 'HBB')) %>%
  pull("ID")

plot_model(x = vst_out,
           umi = counts,
           goi = ensId,
           plot_residual = TRUE)

2.3.3 Overall properties of transformed data

The distribution of residual mean is centered around 0:

ggplot(vst_out$gene_attr, aes(x = residual_mean)) +
    geom_histogram(binwidth=0.01)

The distribution of residual variance is centered around 1:

ggplot(vst_out$gene_attr, aes(residual_variance)) +
    geom_histogram(binwidth=0.1) +
    geom_vline(xintercept=1, color='red') +
    xlim(0, 10)

Plotting the residual variance against the mean shows that after transformation there is no relationship between gene mean and variance.

ggplot(vst_out$gene_attr, aes(x = log10(gmean), y = residual_variance)) +
       geom_point(alpha=0.3, shape=16)

Check genes with large residual variance. These genes would be markers of expected cell populations. Note how they represent a great range of mean UMI and detection rate values.

vst_out$gene_attr %>%
  arrange(desc(residual_variance)) %>% 
    top_n(n = 10) %>%
    mutate(across(where(is.numeric), round, 2)) %>% 
  rownames_to_column("ID") %>%
  left_join(as.data.frame(rowData(sce))[,c("ID", "Symbol")], "ID")

##                 ID detection_rate gmean  amean  variance residual_mean
## 1  ENSG00000244734           0.67  3.49 203.68 651201.82          5.37
## 2  ENSG00000206172           0.46  1.58  55.13  46412.66          5.06
## 3  ENSG00000188536           0.51  1.96  77.43  91150.75          5.15
## 4  ENSG00000223609           0.21  0.51   7.31   1218.72          3.16
## 5  ENSG00000090382           0.11  0.25   3.38    434.14          2.21
## 6  ENSG00000206177           0.15  0.35   4.03    475.97          2.26
## 7  ENSG00000143546           0.13  0.24   2.72    321.30          1.83
## 8  ENSG00000163220           0.13  0.24   2.65    315.11          1.77
## 9  ENSG00000169877           0.15  0.36   3.66    362.73          2.09
## 10 ENSG00000133742           0.14  0.30   3.20    474.86          1.63
##    residual_variance Symbol
## 1             368.64    HBB
## 2             362.54   HBA1
## 3             360.03   HBA2
## 4             211.97    HBD
## 5             169.10    LYZ
## 6             143.38    HBM
## 7             143.29 S100A8
## 8             133.37 S100A9
## 9             120.70   AHSP
## 10            103.17    CA1

2.4 Storage of the VST transformed data in the SCE object

In order to store the transformed values in our Single Cell object, we need to add them as a new “assay”. The transformed values are kept as a matrix in the y object within vst_out.

Note that, by default, genes that are expressed in fewer than 5 cells are not used by vst and results for these genes are not returned, so to add vst_out$y as an assay in our single cell object we may need to subset the rows of our sce object to match the rows of vst_out$y. In our case, about 10,000 genes were expressed in less than 5 cells, so we will need to subset our SCE object before adding the VST normalised counts.

keepGenes <- rownames(sce)%in%rownames(vst_out$y)
sce <- sce[keepGenes,]
vstMat <- as(vst_out$y[rownames(sce),], "dgCMatrix")

assay(sce, "sctrans_norm", withDimnames=FALSE) <- vstMat

2.5 Exercise

Using a single sample - ETV6-RUNX1_1 - inspect the mean-variance relationship and apply sctransform to that data.

3 Session information

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] patchwork_1.1.2             BiocParallel_1.30.3        
##  [3] forcats_0.5.2               stringr_1.4.1              
##  [5] dplyr_1.0.10                purrr_0.3.4                
##  [7] readr_2.1.2                 tidyr_1.2.0                
##  [9] tibble_3.1.8                tidyverse_1.3.2            
## [11] sctransform_0.3.4           scran_1.24.0               
## [13] scater_1.24.0               ggplot2_3.3.6              
## [15] scuttle_1.6.3               SingleCellExperiment_1.18.0
## [17] SummarizedExperiment_1.26.1 Biobase_2.56.0             
## [19] GenomicRanges_1.48.0        GenomeInfoDb_1.32.3        
## [21] IRanges_2.30.1              S4Vectors_0.34.0           
## [23] BiocGenerics_0.42.0         MatrixGenerics_1.8.1       
## [25] matrixStats_0.62.0          knitr_1.40                 
## 
## loaded via a namespace (and not attached):
##   [1] googledrive_2.0.0         ggbeeswarm_0.6.0         
##   [3] colorspace_2.0-3          ellipsis_0.3.2           
##   [5] bluster_1.6.0             XVector_0.36.0           
##   [7] BiocNeighbors_1.14.0      fs_1.5.2                 
##   [9] rstudioapi_0.14           farver_2.1.1             
##  [11] listenv_0.8.0             ggrepel_0.9.1            
##  [13] fansi_1.0.3               lubridate_1.8.0          
##  [15] xml2_1.3.3                codetools_0.2-18         
##  [17] sparseMatrixStats_1.8.0   cachem_1.0.6             
##  [19] jsonlite_1.8.0            broom_1.0.1              
##  [21] cluster_2.1.3             dbplyr_2.2.1             
##  [23] compiler_4.2.0            httr_1.4.4               
##  [25] dqrng_0.3.0               backports_1.4.1          
##  [27] assertthat_0.2.1          Matrix_1.4-1             
##  [29] fastmap_1.1.0             gargle_1.2.0             
##  [31] limma_3.52.2              cli_3.3.0                
##  [33] BiocSingular_1.12.0       htmltools_0.5.3          
##  [35] tools_4.2.0               rsvd_1.0.5               
##  [37] igraph_1.3.4              gtable_0.3.1             
##  [39] glue_1.6.2                GenomeInfoDbData_1.2.8   
##  [41] reshape2_1.4.4            Rcpp_1.0.9               
##  [43] cellranger_1.1.0          jquerylib_0.1.4          
##  [45] vctrs_0.4.1               DelayedMatrixStats_1.18.0
##  [47] xfun_0.32                 globals_0.16.1           
##  [49] rvest_1.0.3               beachmat_2.12.0          
##  [51] lifecycle_1.0.1           irlba_2.3.5              
##  [53] statmod_1.4.37            googlesheets4_1.0.1      
##  [55] future_1.27.0             edgeR_3.38.4             
##  [57] zlibbioc_1.42.0           MASS_7.3-57              
##  [59] scales_1.2.1              hms_1.1.2                
##  [61] parallel_4.2.0            yaml_2.3.5               
##  [63] gridExtra_2.3             sass_0.4.2               
##  [65] stringi_1.7.8             highr_0.9                
##  [67] ScaledMatrix_1.4.0        rlang_1.0.5              
##  [69] pkgconfig_2.0.3           bitops_1.0-7             
##  [71] evaluate_0.16             lattice_0.20-45          
##  [73] labeling_0.4.2            tidyselect_1.1.2         
##  [75] parallelly_1.32.1         plyr_1.8.7               
##  [77] magrittr_2.0.3            R6_2.5.1                 
##  [79] generics_0.1.3            metapod_1.4.0            
##  [81] DelayedArray_0.22.0       DBI_1.1.3                
##  [83] pillar_1.8.1              haven_2.5.1              
##  [85] withr_2.5.0               RCurl_1.98-1.8           
##  [87] future.apply_1.9.0        crayon_1.5.1             
##  [89] modelr_0.1.9              utf8_1.2.2               
##  [91] tzdb_0.3.0                rmarkdown_2.16           
##  [93] viridis_0.6.2             isoband_0.2.5            
##  [95] readxl_1.4.1              locfit_1.5-9.6           
##  [97] grid_4.2.0                reprex_2.0.2             
##  [99] digest_0.6.29             munsell_0.5.0            
## [101] beeswarm_0.4.0            viridisLite_0.4.1        
## [103] vipor_0.4.5               bslib_0.4.0

Introduction to single-cell RNA-seq analysis

Normalisation