In the demonstration we ran the normalisation using just 500 cells per sample. For this practical you will carry out normalisation, but this time using the all of the cells in one sample - ETV6-RUNX_1. Some of the commands will take a little longer to run this time.
library(scater)
library(scran)
library(tidyverse)
library(BiocParallel)
bpp <- MulticoreParam(7)
First we will load the full filtered data set. This data was previously filtered to remove poor quality cells.
sce <- readRDS("R_objects/Caron_filtered.full.rds")
Now we need to extract just the cells for ETV6-RUNX_1.
etvr1 <- which(sce$SampleName=="ETV6-RUNX1_1")
sce <- sce[, etvr1]
Now that we have reduced the number of cells in the data set, it may be that there are genes in the object that have not been detected in any of the remaining cells. Filter the object to remove any genes that have not been detected.
Any gene that has not been detected will have total UMI count across
all cells of zero. You can use rowSums
on the
counts
matrix of the single cell object to determine the
total counts for each each gene. Then keep only rows with total counts
greater than 0.
detected_genes <- rowSums(counts(sce)) > 0
sce <- sce[detected_genes,]
Q1. How many cells and genes are there in the ETV6_RUNX1_1 data set?
dim(sce)
## [1] 26248 2773
or
sce
## class: SingleCellExperiment
## dim: 26248 2773
## metadata(1): Samples
## assays(1): counts
## rownames(26248): ENSG00000238009 ENSG00000241860 ... ENSG00000276345
## ENSG00000278817
## rowData names(4): ID Symbol Type Chromosome
## colnames(2773): 1_AAACCTGAGACTTTCG-1 1_AAACCTGGTCTTCAAG-1 ...
## 1_TTTGTCAAGGACGAAA-1 1_TTTGTCAGTTCGGCAC-1
## colData names(10): Sample Barcode ... subsets_Mito_percent total
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
There are 2773 cells remaining and 26248 genes detected.
Q2. In how many cells has the gene ORC1 been detected with at least 1 UMI?
You will need to search the rowData
of the single cell
experiment object to determine which row of the counts
matrix contains the data for the gene ORC1.
You can either use the row name (in this case the Ensembl ID) or the row number to retrieve the counts for ORC1 from the counts matrix. You can then determine how many of these are greater than 0.
orc1_row <- which(rowData(sce)$Symbol == "ORC1")
sum(counts(sce)[orc1_row, ] > 0)
## [1] 164
Now normalise the data set using the deconvolution method. You will need to
These are the commands you will need to use for each step:
quickCluster
computePooledFactors
logNormCounts
assayNames
set.seed(100)
clust <- quickCluster(sce, BPPARAM = bpp)
sce <- computePooledFactors(sce,
clusters = clust,
min.mean = 0.1,
BPPARAM = bpp)
sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts" "logcounts"
As in the demonstration, we will extract the counts matrix from our
SCE object and run the variance stabilising transformation (VST)
algorithm, using the sctranform
package’s vst
command, directly on the matrix.
counts <- counts(sce)
Now use the vst
function on the counts matrix to
estimate model parameters and perform the variance stabilizing
transformation. Call the newly created object vst_out
.
You will need to provide 4 arguments:
umi
- The matrix of UMI counts with genes as rows and
cells as columnslatent_var
- The independent variables to regress out
as a character vectorreturn_gene_attr
- set to ‘TRUE’return_cell_attr
- set to ‘TRUE’set.seed(44)
vst_out <- vst(umi = counts,
latent_var = 'log_umi',
return_gene_attr = TRUE,
return_cell_attr = TRUE
)
Check that the model used is the one we want (regressing out the effect of library size).
print(vst_out$model_str)
## [1] "y ~ log_umi"
Use the plot_model
function to inspect the genes ‘RPL10’
and ‘FTL’ to see if the fitted relationship between cell total UMI and
gene expression, and to check the residuals.
First you will need to retrieve the Ensembl IDs for each of the genes
from the rowData
of the original sce object.
The plot_model
function requires four arguments:
x
- the output of vst
umi
- the counts matrix that vst
was run
ongoi
- the IDs of the genes
of interest - these should match the
rownames of the original counts matrix. In our case these should be
Ensembl IDSplot_residual
- set this to TRUE
ensId <- rowData(sce) %>%
as.data.frame %>%
filter(Symbol %in% c('RPL10', 'FTL')) %>%
pull("ID")
plot_model(x = vst_out,
umi = counts,
goi = ensId,
plot_residual = TRUE)
The distribution of residual mean should be centered around 0.
ggplot(vst_out$gene_attr, aes(residual_mean)) +
geom_histogram(binwidth=0.01)
The distribution of residual variance should be centered around 1.
ggplot(vst_out$gene_attr, aes(residual_variance)) +
geom_histogram(binwidth=0.1) +
geom_vline(xintercept=1, color='red') +
xlim(0, 10)
Finally, check that the relationship between variance and mean has been removed.
ggplot(vst_out$gene_attr,
aes(log10(gmean), residual_variance)) +
geom_point(alpha=0.3, shape=16)