suppressMessages(library(ggplot2))
suppressMessages(library(scater))
suppressMessages(library(scran))
projDir <- "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020"
outDirBit <- "AnaWiSce/Attempt1"
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
We will load the R file keeping the SCE object with the normalised counts for 500 cells per sample.
setName <- "caron"
setSuf <- "_5hCellPerSpl"
# Read object in:
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s_dimRed.Rds", projDir, outDirBit, setName, setSuf)
print(tmpFn)
[1] "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020/AnaWiSce/Attempt1/Robjects/caron_sce_nz_postDeconv_5hCellPerSpl_dimRed.Rds"
if(!file.exists(tmpFn))
{
knitr::knit_exit()
}
sce <- readRDS(tmpFn)
sce
class: SingleCellExperiment
dim: 18372 5500
metadata(0):
assays(2): counts logcounts
rownames(18372): ENSG00000238009 ENSG00000237491 ... ENSG00000275063
ENSG00000271254
rowData names(11): ensembl_gene_id external_gene_name ... detected
gene_sparsity
colnames: NULL
colData names(20): Sample Barcode ... cell_sparsity sizeFactor
reducedDimNames(3): PCA TSNE UMAP
altExpNames(0):
head(rowData(sce))
DataFrame with 6 rows and 11 columns
ensembl_gene_id external_gene_name chromosome_name
<character> <character> <character>
ENSG00000238009 ENSG00000238009 AL627309.1 1
ENSG00000237491 ENSG00000237491 LINC01409 1
ENSG00000225880 ENSG00000225880 LINC00115 1
ENSG00000230368 ENSG00000230368 FAM41C 1
ENSG00000230699 ENSG00000230699 AL645608.2 1
ENSG00000188976 ENSG00000188976 NOC2L 1
start_position end_position strandNum Symbol
<integer> <integer> <integer> <character>
ENSG00000238009 89295 133723 -1 AL627309.1
ENSG00000237491 778747 810065 1 AL669831.5
ENSG00000225880 826206 827522 -1 LINC00115
ENSG00000230368 868071 876903 -1 FAM41C
ENSG00000230699 911435 914948 1 AL645608.3
ENSG00000188976 944203 959309 -1 NOC2L
Type mean detected gene_sparsity
<character> <numeric> <numeric> <numeric>
ENSG00000238009 Gene Expression 0.000815739 0.0815739 0.999256
ENSG00000237491 Gene Expression 0.033200190 3.1966535 0.979330
ENSG00000225880 Gene Expression 0.013530022 1.3200511 0.985855
ENSG00000230368 Gene Expression 0.019421026 1.8757987 0.981826
ENSG00000230699 Gene Expression 0.000642947 0.0638929 0.998270
ENSG00000188976 Gene Expression 0.168154822 13.2828888 0.831308
#any(duplicated(rowData(sce)$ensembl_gene_id))
# some function(s) used below complain about 'strand' already being used in row data,
# so rename that column now:
colnames(rowData(sce))[colnames(rowData(sce)) == "strand"] <- "strandNum"
scRNASeq measures the expression of thousands of genes in each cell. The biological question asked in a study will most often relate to a fraction of these genes only, linked for example to differences between cell types, drivers of differentiation, or response to perturbation.
Most high-throughput molecular data include variation created by the assay itself, not biology, i.e. technical noise, for example caused by sampling during RNA capture and library preparation. In scRNASeq, this technical noise will result in most genes being detected at different levels. This noise may hinder the detection of the biological signal.
Let’s identify Highly Variables Genes (HVGs) with the aim to find those underlying the heterogeneity observed across cells.
Some assays allow the inclusion of known molecules in a known amount covering a wide range, from low to high abundance: spike-ins. The technical noise is assessed based on the amount of spike-ins used, the corresponding read counts obtained and their variation across cells. The variance in expression can then be decomposed into the biolgical and technical components.
UMI-based assays do not (yet?) allow spike-ins. But one can still identify HVGs, that is genes with the highest biological component. Assuming that expression does not vary across cells for most genes, the total variance for these genes mainly reflects technical noise. The latter can thus be assessed by fitting a trend to the variance in expression. The fitted value will be the estimate of the technical component.
Let’s fit a trend to the variance, using modelGeneVar().
dec.sce <- modelGeneVar(sce)
Let’s plot variance against mean of expression (log scale) and the mean-dependent trend fitted to the variance:
# Visualizing the fit:
var.fit <- metadata(dec.sce)
plot(var.fit$mean, var.fit$var, xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(var.fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
In the absence of spike-in data, one can attempt to create a trend by making some distributional assumptions about the noise. For example, UMI counts typically exhibit near-Poisson variation if we only consider technical noise from library preparation and sequencing. This can be used to construct a mean-variance trend in the log-counts with the modelGeneVarByPoisson() function. Note the increased residuals of the high-abundance genes, which can be interpreted as the amount of biological variation that was assumed to be “uninteresting” when fitting the gene-based trend above.
set.seed(0010101)
dec.pois.sce <- modelGeneVarByPoisson(sce)
dec.pois.sce <- dec.pois.sce[order(dec.pois.sce$bio, decreasing=TRUE),]
head(dec.pois.sce)
DataFrame with 6 rows and 6 columns
mean total tech bio p.value FDR
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000244734 2.487368 11.39855 0.913036 10.48551 0 0
ENSG00000188536 1.848790 9.16065 1.020994 8.13966 0 0
ENSG00000206172 1.585734 8.27191 1.007091 7.26482 0 0
ENSG00000019582 2.929866 4.40937 0.781826 3.62755 0 0
ENSG00000223609 0.770139 4.24159 0.691198 3.55040 0 0
ENSG00000204287 2.404310 3.75521 0.934058 2.82115 0 0
Plot:
plot(dec.pois.sce$mean, dec.pois.sce$total, pch=16, xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(metadata(dec.pois.sce)$trend(x), col="dodgerblue", add=TRUE)
Interestingly, trends based purely on technical noise tend to yield large biological components for highly-expressed genes. This often includes so-called “house-keeping” genes coding for essential cellular components such as ribosomal proteins, which are considered uninteresting for characterizing cellular heterogeneity. These observations suggest that a more accurate noise model does not necessarily yield a better ranking of HVGs, though one should keep an open mind - house-keeping genes are regularly DE in a variety of conditions (Glare et al. 2002; Nazari, Parham, and Maleki 2015; Guimaraes and Zavolan 2016), and the fact that they have large biological components indicates that there is strong variation across cells that may not be completely irrelevant.
Identify the top 20 HVGs by sorting genes in decreasing order of biological component.
# order genes by decreasing order of biological component
o <- order(dec.sce$bio, decreasing=TRUE)
# check top and bottom of sorted table
head(dec.sce[o,])
DataFrame with 6 rows and 6 columns
mean total tech bio p.value
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000244734 2.487368 11.39855 1.463827 9.93472 0.00000e+00
ENSG00000188536 1.848790 9.16065 1.362805 7.79785 0.00000e+00
ENSG00000206172 1.585734 8.27191 1.280827 6.99108 2.64879e-304
ENSG00000223609 0.770139 4.24159 0.787929 3.45367 4.21240e-197
ENSG00000019582 2.929866 4.40937 1.404771 3.00460 1.32522e-48
ENSG00000206177 0.572339 3.18102 0.616104 2.56491 4.89490e-178
FDR
<numeric>
ENSG00000244734 0.00000e+00
ENSG00000188536 0.00000e+00
ENSG00000206172 1.61691e-300
ENSG00000223609 1.92854e-193
ENSG00000019582 7.35416e-46
ENSG00000206177 1.49401e-174
tail(dec.sce[o,])
DataFrame with 6 rows and 6 columns
mean total tech bio p.value FDR
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000232112 1.74745 1.047294 1.33227 -0.284978 0.927926 1
ENSG00000115268 3.91154 0.978745 1.26887 -0.290126 0.940762 1
ENSG00000140264 2.21495 1.128841 1.45721 -0.328371 0.938052 1
ENSG00000173812 3.13508 1.033705 1.37027 -0.336568 0.953237 1
ENSG00000174748 3.90858 0.924135 1.26938 -0.345243 0.968348 1
ENSG00000149806 2.95350 1.007274 1.40060 -0.393324 0.972410 1
# choose the top 20 genes with the highest biological component
chosen.genes.index <- o[1:20]
Show the top 20 HVGs on the plot displaying the variance against the mean expression:
plot(var.fit$mean, var.fit$var)
curve(var.fit$trend(x), col="red", lwd=2, add=TRUE)
points(var.fit$mean[chosen.genes.index], var.fit$var[chosen.genes.index], col="orange")
Rather than choosing a fixed number of top genes, one may define ‘HVGs’ as genes with a positive biological component, ie whose variance is higher than the fitted value for the corresponding mean expression.
Select and show these ‘HVGs’ on the plot displaying the variance against the mean expression:
hvgBool <- dec.sce$bio > 0
table(hvgBool)
hvgBool
FALSE TRUE
8406 9966
hvg.index <- which(hvgBool)
plot(var.fit$mean, var.fit$var)
curve(var.fit$trend(x), col="red", lwd=2, add=TRUE)
points(var.fit$mean[hvg.index], var.fit$var[hvg.index], col="orange")
Save objects to file
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s_featSel.Rds", projDir, outDirBit, setName, setSuf)
saveRDS(list("dec.sce"=dec.sce,"hvg.index"=hvg.index), tmpFn)
HVGs may be driven by outlier cells. So let’s plot the distribution of expression values for the genes with the largest biological components.
First, get gene names to replace ensembl IDs on plot.
# the count matrix rows are named with ensembl gene IDs. Let's label gene with their name instead:
# row indices of genes in rowData(sce)
tmpInd <- which(rowData(sce)$ensembl_gene_id %in% rownames(dec.sce)[chosen.genes.index])
# check:
rowData(sce)[tmpInd,c("ensembl_gene_id","external_gene_name")]
DataFrame with 20 rows and 2 columns
ensembl_gene_id external_gene_name
<character> <character>
ENSG00000163220 ENSG00000163220 S100A9
ENSG00000143546 ENSG00000143546 S100A8
ENSG00000211592 ENSG00000211592 IGKC
ENSG00000170180 ENSG00000170180 GYPA
ENSG00000019582 ENSG00000019582 CD74
... ... ...
ENSG00000206177 ENSG00000206177 HBM
ENSG00000188536 ENSG00000188536 HBA2
ENSG00000206172 ENSG00000206172 HBA1
ENSG00000169877 ENSG00000169877 AHSP
ENSG00000090013 ENSG00000090013 BLVRB
# store names:
tmpName <- rowData(sce)[tmpInd,"external_gene_name"]
# the gene name may not be known, so keep the ensembl gene ID in that case:
tmpName[tmpName==""] <- rowData(sce)[tmpInd,"ensembl_gene_id"][tmpName==""]
tmpName[is.na(tmpName)] <- rowData(sce)[tmpInd,"ensembl_gene_id"][is.na(tmpName)]
rm(tmpInd)
Now show a violin plot for each gene, using plotExpression() and label genes with their name:
g <- plotExpression(sce, rownames(dec.sce)[chosen.genes.index],
point_alpha=0.05, jitter="jitter") + fontsize
g <- g + scale_x_discrete(breaks=rownames(dec.sce)[chosen.genes.index],
labels=tmpName)
g
Another few genes:
chosen.genes.index <- o[21:40]
tmpInd <- which(rowData(sce)$ensembl_gene_id %in% rownames(dec.sce)[chosen.genes.index])
# check:
rowData(sce)[tmpInd,c("ensembl_gene_id","external_gene_name")]
DataFrame with 20 rows and 2 columns
ensembl_gene_id external_gene_name
<character> <character>
ENSG00000117632 ENSG00000117632 STMN1
ENSG00000177606 ENSG00000177606 JUN
ENSG00000196154 ENSG00000196154 S100A4
ENSG00000145335 ENSG00000145335 SNCA
ENSG00000250361 ENSG00000250361 GYPB
... ... ...
ENSG00000013306 ENSG00000013306 SLC25A39
ENSG00000007312 ENSG00000007312 CD79B
ENSG00000171223 ENSG00000171223 JUNB
ENSG00000167815 ENSG00000167815 PRDX2
ENSG00000169575 ENSG00000169575 VPREB1
# store names:
tmpName <- rowData(sce)[tmpInd,"external_gene_name"]
# the gene name may not be known, so keep the ensembl gene ID in that case:
tmpName[tmpName==""] <- rowData(sce)[tmpInd,"ensembl_gene_id"][tmpName==""]
tmpName[is.na(tmpName)] <- rowData(sce)[tmpInd,"ensembl_gene_id"][is.na(tmpName)]
rm(tmpInd)
g <- plotExpression(sce, rownames(dec.sce)[chosen.genes.index],
point_alpha=0.05, jitter="jitter") + fontsize
g <- g + scale_x_discrete(breaks=rownames(dec.sce)[chosen.genes.index],
labels=tmpName)
g
Challenge: Show violin plots for the 20 genes with the lowest biological component. How do they compare to the those for HVGs chosen above?
chosen.genes.index.tmp <- order(dec.sce$bio, decreasing=FALSE)[1:20]
tmpInd <- (which(rowData(sce)$ensembl_gene_id %in% rownames(dec.sce)[chosen.genes.index.tmp]))
# check:
rowData(sce)[tmpInd,c("ensembl_gene_id","external_gene_name")]
DataFrame with 20 rows and 2 columns
ensembl_gene_id external_gene_name
<character> <character>
ENSG00000142676 ENSG00000142676 RPL11
ENSG00000174748 ENSG00000174748 RPL15
ENSG00000232112 ENSG00000232112 TMA7
ENSG00000104529 ENSG00000104529 EEF1D
ENSG00000161016 ENSG00000161016 RPL8
... ... ...
ENSG00000105640 ENSG00000105640 RPL18A
ENSG00000221983 ENSG00000221983 UBA52
ENSG00000063177 ENSG00000063177 RPL18
ENSG00000142534 ENSG00000142534 RPS11
ENSG00000170889 ENSG00000170889 RPS9
# store names:
tmpName <- rowData(sce)[tmpInd,"external_gene_name"]
# the gene name may not be known, so keep the ensembl gene ID in that case:
tmpName[tmpName==""] <- rowData(sce)[tmpInd,"ensembl_gene_id"][tmpName==""]
tmpName[is.na(tmpName)] <- rowData(sce)[tmpInd,"ensembl_gene_id"][is.na(tmpName)]
rm(tmpInd)
g <- plotExpression(sce, rownames(dec.sce)[chosen.genes.index.tmp],
point_alpha=0.05, jitter="jitter") + fontsize
g <- g + scale_x_discrete(breaks=rownames(dec.sce)[chosen.genes.index.tmp],
labels=tmpName)
g
rm(chosen.genes.index.tmp)