projDir <- "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020"
outDirBit <- "AnaWiSce/Attempt1"
In part 1 we gathered the data, aligned reads, checked quality, and normalised read counts. We will now identify genes to focus on, use visualisation to explore the data, collapse the data set, cluster cells by their expression profile and identify genes that best characterise these cell populations. These main steps are shown below [@ANDREWS2018114].
We’ll first explain dimensionality reduction for visualisation, using Principal Component Analysis, t-SNE and UMAP.
In a single cell RNA-seq (scRNASeq) data set, each cell is described by the expression level of thoushands of genes.
The total number of genes measured is referred to as dimensionality. Each gene measured is one dimension in the space characterising the data set. Many genes will little vary across cells and thus be uninformative when comparing cells. Also, because some genes will have correlated expression patterns, some information is redundant. Moreover, we can represent data in three dimensions, not more. So reducing the number of useful dimensions is necessary.
The data set: a matrix with one row per sample and one variable per column. Here samples are cells and each variable is the normalised read count for a given gene.
The space: each cell is associated to a point in a multi-dimensional space where each gene is a dimension.
The aim: to find a new set of variables defining a space with fewer dimensions while losing as little information as possible.
Out of a set of variables (read counts), PCA defines new variables called Principal Components (PCs) that best capture the variability observed amongst samples (cells), see [@field2012discovering] for example.
The number of variables does not change. Only the fraction of variance captured by each variable differs. The first PC explains the highest proportion of variance possible (bound by prperties of PCA). The second PC explains the highest proportion of variance not explained by the first PC. PCs each explain a decreasing amount of variance not explained by the previous ones. Each PC is a dimension in the new space.
The total amount of variance explained by the first few PCs is usually such that excluding remaining PCs, ie dimensions, loses little information. The stronger the correlation between the initial variables, the stronger the reduction in dimensionality. PCs to keep can be chosen as those capturing at least as much as the average variance per initial variable or using a scree plot, see below.
PCs are linear combinations of the initial variables. PCs represent the same amount of information as the initial set and enable its restoration. The data is not altered. We only look at it in a different way.
About the mapping function from the old to the new space:
Two transformations of the data are necessary:
Here we will make a simple data set of 100 samples and 2 variables, perform PCA and visualise on the initial plane the data set and PCs [@pca_blog_Patcher2014].
library(ggplot2)
fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))
Let’s make and plot a data set.
set.seed(123) #sets the seed for random number generation.
x <- 1:100 #creates a vector x with numbers from 1 to 100
ex <- rnorm(100, 0, 30) #100 normally distributed rand. nos. w/ mean=0, s.d.=30
ey <- rnorm(100, 0, 30) # " "
y <- 30 + 2 * x #sets y to be a vector that is a linear function of x
x_obs <- x + ex #adds "noise" to x
y_obs <- y + ey #adds "noise" to y
P <- cbind(x_obs,y_obs) #places points in matrix
plot(P,asp=1,col=1) #plot points
points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
Center the data and compute covariance matrix.
M <- cbind(x_obs - mean(x_obs), y_obs - mean(y_obs)) #centered matrix
MCov <- cov(M) #creates covariance matrix
Compute the principal axes, ie eigenvectors and corresponding eigenvalues.
An eigenvector is a direction and an eigenvalue is a number measuring the spread of the data in that direction. The eigenvector with the highest eigenvalue is the first principal component.
The eigenvectors of the covariance matrix provide the principal axes, and the eigenvalues quantify the fraction of variance explained in each component.
eigenValues <- eigen(MCov)$values #compute eigenvalues
eigenVectors <- eigen(MCov)$vectors #compute eigenvectors
# or use 'singular value decomposition' of the matrix
d <- svd(M)$d #the singular values
v <- svd(M)$v #the right singular vectors
Let’s plot the principal axes.
First PC:
# PC 1:
plot(P,asp=1,col=1) #plot points
points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)
Second PC:
plot(P,asp=1,col=1) #plot points
points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
# PC 1:
lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)
# PC 2:
lines(x_obs,eigenVectors[2,2]/eigenVectors[1,2]*M[x]+mean(y_obs),col=8)
Add the projections of the points onto the first PC:
plot(P,asp=1,col=1) #plot points
points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
# PC 1:
lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)
# PC 2:
lines(x_obs,eigenVectors[2,2]/eigenVectors[1,2]*M[x]+mean(y_obs),col=8)
# add projecions:
trans <- (M%*%v[,1])%*%v[,1] #compute projections of points
P_proj <- scale(trans, center=-cbind(mean(x_obs),mean(y_obs)), scale=FALSE)
points(P_proj, col=4,pch=19,cex=0.5) #plot projections
segments(x_obs,y_obs,P_proj[,1],P_proj[,2],col=4,lty=2) #connect to points
Compute PCs with prcomp().
pca_res <- prcomp(M)
summary(pca_res)
Importance of components:
PC1 PC2
Standard deviation 73.827 28.279
Proportion of Variance 0.872 0.128
Cumulative Proportion 0.872 1.000
var_explained <- pca_res$sdev^2/sum(pca_res$sdev^2)
var_explained
[1] 0.8720537 0.1279463
Check amount of variance captured by PCs on a scree plot.
# Show scree plot:
plot(pca_res)
Plot with ggplot.
df_pc <- data.frame(pca_res$x)
g <- ggplot(df_pc, aes(PC1, PC2)) +
geom_point(size=2) + # draw points
labs(title="PCA",
subtitle="With principal components PC1 and PC2 as X and Y axis") +
coord_cartesian(xlim = 1.2 * c(min(df_pc$PC1), max(df_pc$PC1)),
ylim = 1.2 * c(min(df_pc$PC2), max(df_pc$PC2)))
g <- g + geom_hline(yintercept=0)
g <- g + geom_vline(xintercept=0)
g
Or use ggfortify autoplot().
# ggfortify
library(ggfortify)
g <- autoplot(pca_res)
g <- g + geom_hline(yintercept=0)
g <- g + geom_vline(xintercept=0)
g
Going from 2D to 3D (figure from [@nlpcaPlot]):
library(scater) # for QC and plots
We will load the R file keeping the SCE object with the normalised counts for 500 cells per sample.
setName <- "caron"
setSuf <- "_5hCellPerSpl"
# Read object in:
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s.Rds", projDir, outDirBit, setName, setSuf)
print(tmpFn)
[1] "/mnt/scratcha/bioinformatics/baller01/20200511_FernandesM_ME_crukBiSs2020/AnaWiSce/Attempt1/Robjects/caron_sce_nz_postDeconv_5hCellPerSpl.Rds"
if(!file.exists(tmpFn))
{
knitr::knit_exit()
}
sce <- readRDS(tmpFn)
sce
class: SingleCellExperiment
dim: 18372 5500
metadata(0):
assays(2): counts logcounts
rownames(18372): ENSG00000238009 ENSG00000237491 ... ENSG00000275063
ENSG00000271254
rowData names(11): ensembl_gene_id external_gene_name ... detected
gene_sparsity
colnames: NULL
colData names(20): Sample Barcode ... cell_sparsity sizeFactor
reducedDimNames(0):
altExpNames(0):
head(rowData(sce))
DataFrame with 6 rows and 11 columns
ensembl_gene_id external_gene_name chromosome_name
<character> <character> <character>
ENSG00000238009 ENSG00000238009 AL627309.1 1
ENSG00000237491 ENSG00000237491 LINC01409 1
ENSG00000225880 ENSG00000225880 LINC00115 1
ENSG00000230368 ENSG00000230368 FAM41C 1
ENSG00000230699 ENSG00000230699 AL645608.2 1
ENSG00000188976 ENSG00000188976 NOC2L 1
start_position end_position strand Symbol
<integer> <integer> <integer> <character>
ENSG00000238009 89295 133723 -1 AL627309.1
ENSG00000237491 778747 810065 1 AL669831.5
ENSG00000225880 826206 827522 -1 LINC00115
ENSG00000230368 868071 876903 -1 FAM41C
ENSG00000230699 911435 914948 1 AL645608.3
ENSG00000188976 944203 959309 -1 NOC2L
Type mean detected gene_sparsity
<character> <numeric> <numeric> <numeric>
ENSG00000238009 Gene Expression 0.000815739 0.0815739 0.999256
ENSG00000237491 Gene Expression 0.033200190 3.1966535 0.979330
ENSG00000225880 Gene Expression 0.013530022 1.3200511 0.985855
ENSG00000230368 Gene Expression 0.019421026 1.8757987 0.981826
ENSG00000230699 Gene Expression 0.000642947 0.0638929 0.998270
ENSG00000188976 Gene Expression 0.168154822 13.2828888 0.831308
#any(duplicated(rowData(nz.sce)$ensembl_gene_id))
# some function(s) used below complain about 'strand' already being used in row data,
# so rename that column now:
colnames(rowData(sce))[colnames(rowData(sce)) == "strand"] <- "strandNum"
Perform PCA, keep outcome in same object.
nbPcToComp <- 50
# compute PCA:
#sce <- runPCA(sce, ncomponents = nbPcToComp, method = "irlba")
sce <- runPCA(sce, ncomponents = nbPcToComp)
Display scree plot.
# with reducedDim
sce.pca <- reducedDim(sce, "PCA")
attributes(sce.pca)$percentVar
[1] 15.2999156 9.3354845 4.5969248 3.7569827 2.2441693 1.8559946
[7] 1.4766071 1.3096737 1.1098978 0.9504496 0.8836664 0.7173686
[13] 0.6430277 0.6102293 0.5772604 0.4536958 0.4116736 0.3936074
[19] 0.3782046 0.3523188 0.3268835 0.3112065 0.2935900 0.2818532
[25] 0.2781618 0.2690670 0.2621160 0.2521110 0.2453925 0.2417972
[31] 0.2382387 0.2329139 0.2317321 0.2285999 0.2248257 0.2241382
[37] 0.2231744 0.2198212 0.2189487 0.2175669 0.2165470 0.2152846
[43] 0.2138223 0.2115537 0.2089565 0.2077544 0.2067207 0.2042463
[49] 0.2040119 0.2030388
barplot(attributes(sce.pca)$percentVar,
main=sprintf("Scree plot for the %s first PCs", nbPcToComp),
names.arg=1:nbPcToComp,
cex.names = 0.8)
Display cells on a plot for the first 2 PCs, colouring by ‘Sample’ and setting size to match ‘total_features’.
The proximity of cells reflects the similarity of their expression profiles.
g <- plotPCA(sce,
colour_by = "Sample.Name",
size_by = "sum"
)
g
One can also split the plot by sample.
g <- g + facet_wrap(sce$source_name ~ .)
g
Or plot several PCs at once, using plotReducedDim():
plotReducedDim(sce, dimred="PCA", ncomponents=3,
colour_by = "Sample.Name") + fontsize
The PCA plot above shows cells as symbols whose size depends on the total number of features or library size. It suggests there may be a correlation between PCs and these variables. Let’s check:
r2mat <- getExplanatoryPCs(sce)
Warning in .get_variance_explained(assay(x, exprs_values), variables =
variables, : ignoring 'setName' with fewer than 2 unique levels
Warning in .get_variance_explained(assay(x, exprs_values), variables =
variables, : ignoring 'discard' with fewer than 2 unique levels
r2mat
Sample Barcode Run Sample.Name source_name block sum
PC1 32.38640 100 32.38640 32.38640 8.4986462 8.4986462 0.58110841
PC2 47.80688 100 47.80688 47.80688 35.3418361 35.3418361 1.85421786
PC3 32.75978 100 32.75978 32.75978 14.6767195 14.6767195 44.34756096
PC4 55.99486 100 55.99486 55.99486 38.9799200 38.9799200 3.03070349
PC5 28.24749 100 28.24749 28.24749 7.8050559 7.8050559 3.72489579
PC6 41.88543 100 41.88543 41.88543 31.4250588 31.4250588 0.42052937
PC7 67.52140 100 67.52140 67.52140 55.9836036 55.9836036 0.90308302
PC8 34.71980 100 34.71980 34.71980 17.0250718 17.0250718 2.59669739
PC9 14.28924 100 14.28924 14.28924 0.7097667 0.7097667 0.78226768
PC10 40.42089 100 40.42089 40.42089 4.9130312 4.9130312 0.08101985
detected percent_top_50 percent_top_100 percent_top_200 percent_top_500
PC1 5.10606516 37.12952734 20.24995118 14.44253398 7.54488853
PC2 2.70714393 44.10224023 58.18308601 57.72742188 29.01344442
PC3 49.60355537 2.22943741 2.66854556 6.19758430 25.85336598
PC4 1.34946212 1.24120434 2.19054145 1.19146786 0.99305621
PC5 6.15721522 1.90417550 2.45830122 3.37462685 6.50580537
PC6 1.37903530 0.07333369 0.29860955 1.26890679 5.51767508
PC7 0.09688156 2.60892250 2.64570737 2.31569588 1.57043418
PC8 2.84075090 0.66151948 0.83007858 0.90919973 1.15000522
PC9 0.70593085 0.07163045 0.04900552 0.08439033 0.00211341
PC10 0.04951289 0.12988712 0.19737690 0.32858007 0.31489508
subsets_Mito_sum subsets_Mito_detected subsets_Mito_percent total
PC1 6.031782854 16.1920172 9.8272620 0.58110841
PC2 0.374532007 2.8225720 10.5752699 1.85421786
PC3 25.627700690 14.4055366 0.3414742 44.34756096
PC4 6.966131262 3.7827424 2.7006483 3.03070349
PC5 5.373659131 9.9353426 4.7529521 3.72489579
PC6 0.320199729 6.0125147 1.3606729 0.42052937
PC7 3.216854677 1.7328138 6.3300506 0.90308302
PC8 0.001606446 1.0963393 12.4357818 2.59669739
PC9 8.263537293 14.6306191 31.2286456 0.78226768
PC10 0.114705832 0.2138113 0.5220398 0.08101985
setName discard cell_sparsity sizeFactor
PC1 NA NA 5.10580877 4.625288747
PC2 NA NA 2.70789745 1.199340088
PC3 NA NA 49.60164087 42.776596791
PC4 NA NA 1.35174775 0.438052834
PC5 NA NA 6.15802883 4.108075966
PC6 NA NA 1.37842626 0.365211036
PC7 NA NA 0.09681848 0.158464104
PC8 NA NA 2.84093898 2.989803463
PC9 NA NA 0.70638354 1.077646110
PC10 NA NA 0.04948949 0.004907248
dat <- cbind(colData(sce)[,c("Sample.Name",
"source_name",
"sum",
"detected",
"percent_top_200",
"subsets_Mito_percent")],
reducedDim(sce,"PCA"))
dat <- data.frame(dat)
dat$sum <- log2(dat$sum)
ggplot(dat, aes(x=sum, y=PC1, shape=source_name, col=Sample.Name)) +
geom_point() +
geom_smooth(method=lm, inherit.aes = FALSE, aes(x=sum, y=PC1))
`geom_smooth()` using formula 'y ~ x'
ggplot(dat, aes(x=percent_top_200, y=PC2, shape=source_name, col=Sample.Name)) +
geom_point() +
geom_smooth(method=lm, inherit.aes = FALSE, aes(x=percent_top_200, y=PC2))
`geom_smooth()` using formula 'y ~ x'
ggplot(dat, aes(x=detected, y=PC3, shape=source_name, col=Sample.Name)) +
geom_point() +
geom_smooth(method=lm, inherit.aes = FALSE, aes(x=detected, y=PC3))
`geom_smooth()` using formula 'y ~ x'
ggplot(dat, aes(x=subsets_Mito_percent, y=PC2, shape=source_name, col=Sample.Name)) +
geom_point() +
geom_smooth(method=lm, inherit.aes = FALSE, aes(x=subsets_Mito_percent, y=PC2))
`geom_smooth()` using formula 'y ~ x'
ggplot(dat, aes(x=source_name, y=PC7, shape=source_name, col=Sample.Name)) +
geom_boxplot()
The Stochastic Neighbor Embedding (SNE) approach address two shortcomings of PCA that captures the global covariance structure with a linear combination of initial variables: by preserving the local structure allowing for non-linear projections. It uses two distributions of the pairwise similarities between data points: in the input data set and in the low-dimensional space.
SNE aims at preserving neighbourhoods. For each points, it computes probabilities of chosing each other point as its neighbour based on a Normal distribution depending on 1) the distance matrix and 2) the size of the neighbourhood (perplexity). SNE aims at finding a low-dimension space (eg 2D-plane) such that the similarity matrix deriving from it is as similar as possible as that from the high-dimension space. To address the fact that in low dimension, points are brought together, the similarity matrix in the low-dimension is allowed to follow a t-distribution.
Two characteristics matter:
See misread-tsne.
Compute t-SNE with default perplexity, ie 50.
# runTSNE default perpexity if min(50, floor(ncol(object)/5))
sce <- runTSNE(sce, dimred="PCA", perplexity=50, rand_seed=123)
Plot t-SNE:
tsne50 <- plotTSNE(sce,
colour_by="Sample.Name",
size_by="sum") +
fontsize +
ggtitle("Perplexity = 50")
tsne50
Compute t-SNE for several perplexity values:
tsne5.run <- runTSNE(sce, use_dimred="PCA", perplexity=5, rand_seed=123)
tsne5 <- plotTSNE(tsne5.run, colour_by="Sample.Name") + fontsize + ggtitle("Perplexity = 5")
#tsne200.run <- runTSNE(sce, use_dimred="PCA", perplexity=200, rand_seed=123)
#tsne200 <- plotTSNE(tsne200.run, colour_by="Sample.Name") + fontsize + ggtitle("Perplexity = 200")
tsne500.run <- runTSNE(sce, use_dimred="PCA", perplexity=500, rand_seed=123)
tsne500 <- plotTSNE(tsne500.run, colour_by="Sample.Name") + fontsize + ggtitle("Perplexity = 500")
#tsne1000.run <- runTSNE(sce, use_dimred="PCA", perplexity=1000, rand_seed=123)
#tsne1000 <- plotTSNE(tsne1000.run, colour_by="Sample.Name") + fontsize + ggtitle("Perplexity = 1000")
tsne5
#tsne50
#tsne200
tsne500
Use a different seed with the same perplexity 50.
tsne50.b <- runTSNE(sce, use_dimred="PCA", perplexity=50, rand_seed=456)
tsne50.b <- plotTSNE(tsne50.b,
colour_by="Sample.Name",
size_by="sum") +
fontsize +
ggtitle("Perplexity = 50, seed 456")
tsne50.b
Another neighbour graph method. Similar to t-SNE, but that is determistic, faster and claims to preserve both local and global structures.
Compute UMAP.
set.seed(123)
sce <- runUMAP(sce, dimred="PCA")
Plot UMAP:
sce.umap <- plotUMAP(sce,
colour_by="Sample.Name",
size_by="sum") +
fontsize +
ggtitle("UMAP")
sce.umap
Save SCE object:
tmpFn <- sprintf("%s/%s/Robjects/%s_sce_nz_postDeconv%s_dimRed.Rds", projDir, outDirBit, setName, setSuf)
saveRDS(sce, tmpFn)