1 Identification of cell populations

1.1 Preamble

In part 1 we gathered the data, aligned reads, checked quality, and normalised read counts. We will now identify genes to focus on, use visualisation to explore the data, collapse the data set, cluster cells by their expression profile and identify genes that best characterise these cell populations. These main steps are shown below [1].

This practical explains how to identify cell populations using R and draws on several sources [2–4].

We’ll first explain dimension reduction using Principal Component Analysis.

1.2 Principal Component Analysis

In a single cell RNA-seq (scRNASeq) data set, each cell is described by the expression level of thoushands of genes.

The total number of genes measured is referred to as dimensionality. Each gene measured is one dimension in the space characterising the data set. Many genes will little vary across cells and thus be uninformative when comparing cells. Also, because some genes will have correlated expression patterns, some information is redundant. Moreover, we can represent data in three dimensions, not more. So reducing the number of useful dimensions is necessary.

1.2.1 Description

The data set: a matrix with one row per sample and one variable per column. Here samples are cells and each variable is the normalised read count for a given gene.

The space: each cell is associated to a point in a multi-dimensional space where each gene is a dimension.

The aim: to find a new set of variables defining a space with fewer dimensions while losing as little information as possible.

Out of a set of variables (read counts), PCA defines new variables called Principal Components (PCs) that best capture the variability observed amongst samples (cells), see [5] for example.

The number of variables does not change. Only the fraction of variance captured by each variable differs. The first PC explains the highest proportion of variance possible (bound by prperties of PCA). The second PC explains the highest proportion of variance not explained by the first PC. PCs each explain a decreasing amount of variance not explained by the previous ones. Each PC is a dimension in the new space.

The total amount of variance explained by the first few PCs is usually such that excluding remaining PCs, ie dimensions, loses little information. The stronger the correlation between the initial variables, the stronger the reduction in dimensionality. PCs to keep can be chosen as those capturing at least as much as the average variance per initial variable or using a scree plot, see below.

PCs are linear combinations of the initial variables. PCs represent the same amount of information as the initial set and enable its restoration. The data is not altered. We only look at it in a different way.

About the mapping function from the old to the new space:

  • it is linear
  • it is inverse, to restore the original space
  • it relies on orthogonal PCs so that the total variance remains the same.

Two transformations of the data are necessary:

  • center the data so that the sample mean for each column is 0 so the covariance matrix of the intial matrix takes a simple form
  • scale variance to 1, ie standardize, to avoid PCA loading on variables with large variance.

1.2.2 Example

Here we will make a simple data set of 100 samples and 2 variables, perform PCA and visualise on the initial plane the data set and PCs [6].

library(ggplot2)

Let’s make and plot a data set.

set.seed(123)            #sets the seed for random number generation.
 x <- 1:100              #creates a vector x with numbers from 1 to 100
 ex <- rnorm(100, 0, 30) #100 normally distributed rand. nos. w/ mean=0, s.d.=30
 ey <- rnorm(100, 0, 30) # " " 
 y <- 30 + 2 * x         #sets y to be a vector that is a linear function of x
 x_obs <- x + ex         #adds "noise" to x
 y_obs <- y + ey         #adds "noise" to y
 P <- cbind(x_obs,y_obs) #places points in matrix
 plot(P,asp=1,col=1) #plot points
 points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center

Center the data and compute covariance matrix.

M <- cbind(x_obs - mean(x_obs), y_obs - mean(y_obs)) #centered matrix
MCov <- cov(M)          #creates covariance matrix

Compute the principal axes, ie eigenvectors and corresponding eigenvalues.

An eigenvector is a direction and an eigenvalue is a number measuring the spread of the data in that direction. The eigenvector with the highest eigenvalue is the first principal component.

The eigenvectors of the covariance matrix provide the principal axes, and the eigenvalues quantify the fraction of variance explained in each component.

eigenValues <- eigen(MCov)$values       #compute eigenvalues
eigenVectors <- eigen(MCov)$vectors     #compute eigenvectors

# or use 'singular value decomposition' of the matrix
d <- svd(M)$d          #the singular values
v <- svd(M)$v          #the right singular vectors

Let’s plot the principal axes.

First PC:

# PC 1:
 plot(P,asp=1,col=1) #plot points
 points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)

Second PC:

 plot(P,asp=1,col=1) #plot points
 points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
# PC 1:
lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)
# PC 2:
lines(x_obs,eigenVectors[2,2]/eigenVectors[1,2]*M[x]+mean(y_obs),col=8)

Add the projections of the points onto the first PC:

plot(P,asp=1,col=1) #plot points
points(mean(x_obs),mean(y_obs),col=3, pch=19) #show center
# PC 1:
lines(x_obs,eigenVectors[2,1]/eigenVectors[1,1]*M[x]+mean(y_obs),col=8)
# PC 2:
lines(x_obs,eigenVectors[2,2]/eigenVectors[1,2]*M[x]+mean(y_obs),col=8)
# add projecions:
trans <- (M%*%v[,1])%*%v[,1] #compute projections of points
P_proj <- scale(trans, center=-cbind(mean(x_obs),mean(y_obs)), scale=FALSE) 
points(P_proj, col=4,pch=19,cex=0.5) #plot projections
segments(x_obs,y_obs,P_proj[,1],P_proj[,2],col=4,lty=2) #connect to points

Could use prcomp():

Compute PCs with prcomp().

pca_res <- prcomp(M)

Check amount of variance captured by PCs on a scree plot.

# Show scree plot:
plot(pca_res)

# (calls screeplot())

Plot with ggplot.

df_pc <- data.frame(pca_res$x)
g <- ggplot(df_pc, aes(PC1, PC2)) + 
  geom_point(size=2) +   # draw points
  labs(title="PCA", 
       subtitle="With principal components PC1 and PC2 as X and Y axis") + 
  coord_cartesian(xlim = 1.2 * c(min(df_pc$PC1), max(df_pc$PC1)), 
                  ylim = 1.2 * c(min(df_pc$PC2), max(df_pc$PC2)))
g <- g + geom_hline(yintercept=0)
g <- g + geom_vline(xintercept=0)
g

Or use ggfortify autoplot().

# ggfortify
library(ggfortify)
g <- autoplot(pca_res)
g <- g + geom_hline(yintercept=0)
g <- g + geom_vline(xintercept=0)
g

Going from 2D to 3D (figure from [7]):

Now let’s analyse our data set.

1.3 Load packages

library(scater) # for QC and plots
library(scran) # for normalisation
library(dynamicTreeCut)
library(cluster)
library(broom)
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
library(pheatmap)
library(RColorBrewer)
library(viridis)

Set font size for plots.

fontsize <- theme(axis.text=element_text(size=12), axis.title=element_text(size=16))

1.4 Load normalised counts

The R object keeping the normalised counts obtained at the end of part 1 was written to a file for you: Tcells_SCE.Rds. Let’s load this file.

# dir
inpDir <- "/Users/baller01/MyProjectsSvn/SvnRepoForScRnaSeq/20180516_FernandesM_ME_crukBioinfSumSch2018/trunk/ScRnaSeqIntroPart2"
dataSubDir <- "Data/PopId_allCells"

#inpDir <- "/home/participant/Course_Materials/SinglecellToUse/HumanBreastTCells"
#dataSubDir <- "GRCh38"
# file
rObjFile <- "Tcells_SCE.Rds"

# check dir exist:
if(! dir.exists(inpDir))
{ stop(sprintf("Cannot find dir inpDir '%s'", inpDir)) }
if(! dir.exists(file.path(inpDir, dataSubDir)))
{ stop(sprintf("Cannot find dir dataSubDir '%s'", file.path(inpDir, dataSubDir))) }

# check file exists:
tmpFileName <- file.path(inpDir, dataSubDir, rObjFile)
if(! file.exists(tmpFileName))
{ stop(sprintf("Cannot find dir tmpFileName '%s'", tmpFileName)) }

setwd(file.path(inpDir, dataSubDir))

# load file:
# Remember name of object saved in the file, or make up a new one
nz.sce <- readRDS(tmpFileName)

# check:
nz.sce
## class: SingleCellExperiment 
## dim: 13010 6312 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(13010): ENSG00000279457 ENSG00000228463 ...
##   ENSG00000278384 ENSG00000276345
## rowData names(16): ensembl_gene_id external_gene_name ...
##   n_cells_counts pct_dropout_counts
## colnames: NULL
## colData names(46): Sample Barcode ... total_features_Mito
##   log10_total_features_Mito
## reducedDimNames(0):
## spikeNames(0):
# features data:
head(rowData(nz.sce))
## DataFrame with 6 rows and 16 columns
##   ensembl_gene_id external_gene_name chromosome_name start_position
##       <character>        <character>     <character>      <integer>
## 1 ENSG00000279457         FO538757.1               1         185217
## 2 ENSG00000228463         AP006222.1               1         257864
## 3 ENSG00000237491         AL669831.5               1         778770
## 4 ENSG00000225880          LINC00115               1         826206
## 5 ENSG00000230368             FAM41C               1         868071
## 6 ENSG00000187634             SAMD11               1         923928
##   end_position    strand is_feature_control is_feature_control_Mito
##      <integer> <integer>          <logical>               <logical>
## 1       195411        -1              FALSE                   FALSE
## 2       359681        -1              FALSE                   FALSE
## 3       810060         1              FALSE                   FALSE
## 4       827522        -1              FALSE                   FALSE
## 5       876903        -1              FALSE                   FALSE
## 6       944581         1              FALSE                   FALSE
##           mean_counts   log10_mean_counts n_cells_by_counts
##             <numeric>           <numeric>         <integer>
## 1  0.0805124960455552  0.0336297938312443               477
## 2  0.0447643150901613  0.0190183304469851               266
## 3  0.0085416007592534   0.003693816884495                54
## 4  0.0147105346409364 0.00634216920738639                92
## 5  0.0101233786776337 0.00437442269986513                64
## 6 0.00300537804492249 0.00130326168340058                17
##   pct_dropout_by_counts total_counts log10_total_counts n_cells_counts
##               <numeric>    <numeric>          <numeric>      <integer>
## 1      92.4549193293262          509   2.70757017609794            477
## 2      95.7924707371085          283   2.45331834004704            266
## 3      99.1458399240747           54   1.74036268949424             54
## 4      98.5447643150902           93    1.9731278535997             92
## 5      98.9876621322366           64   1.81291335664286             64
## 6      99.7310977538753           19   1.30102999566398             17
##   pct_dropout_counts
##            <numeric>
## 1   92.4549193293262
## 2   95.7924707371085
## 3   99.1458399240747
## 4   98.5447643150902
## 5   98.9876621322366
## 6   99.7310977538753
#any(duplicated(rowData(nz.sce)$ensembl_gene_id))
# some function(s) used below complain about 'strand' already being used in row data,
# so rename that column now:
colnames(rowData(nz.sce))[colnames(rowData(nz.sce)) == "strand"] <- "strandNum"

# have sample name Tils20 for Tils20_1 and Tils20_2
colData(nz.sce)$Sample2 <- gsub("_[12]", "", colData(nz.sce)$Sample)

1.5 Data exploration with dimensionality reduction

1.5.1 PCA

Perform PCA, keep outcome in new object.

nbPcToComp <- 50
# compute PCA:
nz.sce <- runPCA(nz.sce, ncomponents = nbPcToComp, method = "irlba")

Display scree plot.

# with reducedDim
nz.sce.pca <- reducedDim(nz.sce, "PCA")
attributes(nz.sce.pca)$percentVar
##  [1] 0.083928796 0.031545000 0.021043351 0.014596215 0.012132267
##  [6] 0.010249901 0.008696996 0.007895337 0.006566651 0.005837878
## [11] 0.005523357 0.004976257 0.004545624 0.004399465 0.004248950
## [16] 0.004018964 0.003745983 0.003460966 0.003349146 0.003313608
## [21] 0.003265357 0.003205853 0.003122210 0.003088661 0.003044352
## [26] 0.002976101 0.002927976 0.002906630 0.002871550 0.002858724
## [31] 0.002842479 0.002837611 0.002800708 0.002781571 0.002765381
## [36] 0.002744475 0.002728944 0.002717324 0.002704822 0.002693545
## [41] 0.002689588 0.002676053 0.002669631 0.002650391 0.002641537
## [46] 0.002626736 0.002614868 0.002606981 0.002600771 0.002582161
barplot(attributes(nz.sce.pca)$percentVar,
        main=sprintf("Scree plot for the %s first PCs", nbPcToComp),
        names.arg=1:nbPcToComp,
        cex.names = 0.8)

Display cells on a plot for the first 2 PCs, colouring by ‘Sample’ and setting size to match ‘total_features’.

The proximity of cells reflects the similarity of their expression profiles.

# plot PCA with plotPCA():
g <- plotPCA(nz.sce)
#sce3 <- runPCA(nz.sce, ncomponents = 10, method = "prcomp")
#plotPCA(sce3)
g <- plotPCA(nz.sce,
        colour_by = "Sample",
        size_by = "total_features"
)         
g

#require(knitr); knit_exit()

Any observation?

One can also split the plot, say by sample.

g <- g +  facet_grid(nz.sce$Sample ~ .)
g

Or plot several PCs at once, using plotReducedDim():

plotReducedDim(nz.sce, use_dimred="PCA", ncomponents=3, 
        colour_by = "Sample",
        size_by = "total_features") + fontsize

1.5.2 Correlation between PCs and the total number of features detected

The PCA plot above shows cells as symbols whose size depends on the total number of features or library size. It suggests there may be a correlation between PCs and these variables. Let’s check:

g <- plotQC(
    nz.sce,
    type = "find-pcs",
    exprs_values = "logcounts",
    variable = "total_features"
)
g

These plots show that PC2 and PC1 correlate with the number of detected genes. This correlation is often observed.

Challenge: Check correlation of PCs with library size. Was the outcome expected?

g <- plotQC(
    nz.sce,
    type = "find-pcs",
    exprs_values = "logcounts",
    variable = "total_counts"
)
g

1.5.3 t-SNE

PCA represents relationships in the high-dimensional space linearly, while t-SNE allows non-linear relationships and thus usually separates cells from diverse populations better.

t-SNE stands for “T-distributed stochastic neighbor embedding”. It is a stochastic method to visualise large high dimensional datasets by preserving local structure amongst cells.

Two characteristics matter:

  • perplexity, to indicate the relative importance of the local and global patterns in structure of the data set, usually use a value of 50,
  • stochasticity; running the analysis will produce a different map every time, unless the seed is set.

See misread-tsne.

1.5.3.1 Perplexity

Compute t-SNE with default perplexity, ie 50.

# runTSNE default perpexity if min(50, floor(ncol(object)/5))
nz.sce <- runTSNE(nz.sce, use_dimred="PCA", perplexity=50, rand_seed=123)

Plot t-SNE:

tsne50 <- plotTSNE(nz.sce,
           colour_by="Sample",
           size_by="total_features") + 
         fontsize + 
         ggtitle("Perplexity = 50")
tsne50

Split by sample:

g <- tsne50 + facet_grid(. ~ nz.sce$Sample2)
g

Compute t-SNE for several perplexity values:

tsne5.run <- runTSNE(nz.sce, use_dimred="PCA", perplexity=5, rand_seed=123)
tsne5 <- plotTSNE(tsne5.run, colour_by="Sample") + fontsize + ggtitle("Perplexity = 5")

tsne1000.run <- runTSNE(nz.sce, use_dimred="PCA", perplexity=1000, rand_seed=123)
tsne1000 <- plotTSNE(tsne1000.run, colour_by="Sample") + fontsize + ggtitle("Perplexity = 1000")
#multiplot(tsne5, tsne50, tsne1000, cols=1)
tsne50.1 <- plotTSNE(nz.sce, colour_by="Sample") + fontsize + ggtitle("Perplexity = 50")
tsne5

tsne50.1

tsne1000

Challenge: t-SNE is a stochastic method. Change the seed with ‘rand_seed=’, compute and plot t-SNE. Try that a few times.

tsne50.run500 <- runTSNE(nz.sce, use_dimred="PCA", perplexity=50, rand_seed=500)
tsne50.500 <- plotTSNE(tsne50.run500, colour_by="Sample") +  fontsize + ggtitle("Perplexity = 50, seed 500")
#multiplot(tsne50, tsne50.500, cols=1)
tsne50.1

tsne50.500

1.5.4 Other methods

Several other dimensionality reduction techniques could also be used, e.g., multidimensional scaling, diffusion maps [1].

1.6 Feature selection

scRNASeq measures the expression of thousands of genes in each cell. The biological question asked in a study will most often relates to a fraction of these genes only, linked for example to differences between cell types, drivers of differentiation, or response to perturbation.

Most high-throughput molecular data include variation created by the assay itself, not biology, i.e. technical noise, for example caused by sampling during RNA capture and library preparation. In scRNASeq, this technical noise will result in most genes being detected at different levels. This noise may hinder the detection of the biological signal.

Let’s identify Highly Variables Genes (HVGs) with the aim to find those underlying the heterogeneity observed across cells.

1.6.1 Modelling and removing technical noise

Some assays allow the inclusion of known molecules in a known amount covering a wide range, from low to high abundance: spike-ins. The technical noise is assessed based on the amount of spike-ins used, the corresponding read counts obtained and their variation across cells. The variance in expression can then be decomposed into the biolgical and technical components.

UMI-based assays do not (yet?) allow spike-ins. But one can still identify HVGs, that is genes with the highest biological component. Assuming that expression does not vary across cells for most genes, the total variance for these genes mainly reflects technical noise. The latter can thus be assessed by fitting a trend to the variance in expression. The fitted value will be the estimate of the technical component.

Let’s fit a trend to the variance, using trendVar().

var.fit <- trendVar(nz.sce, method="loess", use.spikes=FALSE, loess.args=list("span"=0.05)) 

Plot variance against mean of expression (log scale) and the mean-dependent trend fitted to the variance:

plot(var.fit$mean, var.fit$var)
curve(var.fit$trend(x), col="red", lwd=2, add=TRUE)

Decompose variance into technical and biological components:

var.out <- decomposeVar(nz.sce, var.fit)

1.6.2 Choosing some HVGs:

Identify the top 20 HVGs by sorting genes in decreasing order of biological component.

# order genes by decreasing order of biological component
o <- order(var.out$bio, decreasing=TRUE)
# check top and bottom of sorted table
head(var.out[o,])
## DataFrame with 6 rows and 6 columns
##                              mean            total              bio
##                         <numeric>        <numeric>        <numeric>
## ENSG00000271503  1.67689971033089  3.1497503485454 2.22475055863307
## ENSG00000115523 0.633935956222409  1.8708493185678   1.297856694312
## ENSG00000275302 0.637546052173982 1.62175725328888 1.04655441021237
## ENSG00000090104  1.53732408277148 1.95975309354815 1.04644201869744
## ENSG00000145649  1.15484252319733 1.87253535755362  1.0428740834553
## ENSG00000187608  1.26430250149616 1.87793365545449 1.02306562438384
##                              tech   p.value       FDR
##                         <numeric> <numeric> <numeric>
## ENSG00000271503 0.924999789912326         0         0
## ENSG00000115523 0.572992624255796         0         0
## ENSG00000275302 0.575202843076502         0         0
## ENSG00000090104 0.913311074850705         0         0
## ENSG00000145649  0.82966127409832         0         0
## ENSG00000187608 0.854868031070651         0         0
tail(var.out[o,])
## DataFrame with 6 rows and 6 columns
##                             mean             total                bio
##                        <numeric>         <numeric>          <numeric>
## ENSG00000115268 3.18236405238753 0.625426355254008 -0.245479041561776
## ENSG00000234745 3.26445705396995 0.608183010708989 -0.256381162638251
## ENSG00000205542 6.20989925459274 0.403834593541917 -0.265868672173556
## ENSG00000174748 3.03864593768813 0.609471076208079 -0.272760142922892
## ENSG00000204525 3.33497453507121 0.562064236848997 -0.297308268720016
## ENSG00000166710 6.32401776809998 0.250114230724534 -0.412797227985147
##                              tech   p.value       FDR
##                         <numeric> <numeric> <numeric>
## ENSG00000115268 0.870905396815784         1         1
## ENSG00000234745  0.86456417334724         1         1
## ENSG00000205542 0.669703265715473         1         1
## ENSG00000174748 0.882231219130972         1         1
## ENSG00000204525 0.859372505569013         1         1
## ENSG00000166710 0.662911458709681         1         1
# choose the top 20 genes with the highest biological component
chosen.genes.index <- o[1:20]

Show the top 20 HVGs on the plot displaying the variance against the mean expression:

plot(var.fit$mean, var.fit$var)
curve(var.fit$trend(x), col="red", lwd=2, add=TRUE)
points(var.fit$mean[chosen.genes.index], var.fit$var[chosen.genes.index], col="orange")

Rather than choosing a fixed number of top genes, one may define ‘HVGs’ as genes with a positive biological component, ie whose variance is higher than the fitted value for the corresponding mean expression.

Select and show these ‘HVGs’ on the plot displaying the variance against the mean expression:

hvgBool <- var.out$bio > 0
table(hvgBool)
## hvgBool
## FALSE  TRUE 
##  6031  6979
hvg.index <- which(hvgBool)
plot(var.fit$mean, var.fit$var)
curve(var.fit$trend(x), col="red", lwd=2, add=TRUE)
points(var.fit$mean[hvg.index], var.fit$var[hvg.index], col="orange")

HVGs may be driven by outlier cells. So let’s plot the distribution of expression values for the genes with the largest biological components.

First, get gene names to replace ensembl IDs on plot.

# the count matrix rows are named with ensembl gene IDs. Let's label gene with their name instead:
# row indices of genes in rowData(nz.sce)
tmpInd <- which(rowData(nz.sce)$ensembl_gene_id %in% rownames(var.out)[chosen.genes.index])
# check:
rowData(nz.sce)[tmpInd,c("ensembl_gene_id","external_gene_name")]
## DataFrame with 20 rows and 2 columns
##     ensembl_gene_id external_gene_name
##         <character>        <character>
## 1   ENSG00000187608              ISG15
## 2   ENSG00000186827            TNFRSF4
## 3   ENSG00000177606                JUN
## 4   ENSG00000196154             S100A4
## 5   ENSG00000090104               RGS1
## ...             ...                ...
## 16  ENSG00000271503               CCL5
## 17  ENSG00000277632               CCL3
## 18  ENSG00000275302               CCL4
## 19  ENSG00000171223               JUNB
## 20  ENSG00000105374               NKG7
# store names:
tmpName <- rowData(nz.sce)[tmpInd,"external_gene_name"]
# the gene name may not be known, so keep the ensembl gene ID in that case:
tmpName[tmpName==""] <- rowData(nz.sce)[tmpInd,"ensembl_gene_id"][tmpName==""]
tmpName[is.na(tmpName)] <- rowData(nz.sce)[tmpInd,"ensembl_gene_id"][is.na(tmpName)]
rm(tmpInd)

Now show a violin plot for each gene, using plotExpression() and label genes with their name:

g <- plotExpression(nz.sce, rownames(var.out)[chosen.genes.index], 
    alpha=0.05, jitter="jitter") + fontsize
g <- g + scale_x_discrete(breaks=rownames(var.out)[chosen.genes.index],
        labels=tmpName)
g

Challenge: Show violin plots for the 20 genes with the lowest biological component. How do they compare to the those for HVGs chosen above?

chosen.genes.index.tmp <- order(var.out$bio, decreasing=FALSE)[1:20]
tmpInd <- (which(rowData(nz.sce)$ensembl_gene_id %in% rownames(var.out)[chosen.genes.index.tmp]))
# check:
rowData(nz.sce)[tmpInd,c("ensembl_gene_id","external_gene_name")]
# store names:
tmpName <- rowData(nz.sce)[tmpInd,"external_gene_name"]
# the gene name may not be known, so keep the ensembl gene ID in that case:
tmpName[tmpName==""] <- rowData(nz.sce)[tmpInd,"ensembl_gene_id"][tmpName==""]
tmpName[is.na(tmpName)] <- rowData(nz.sce)[tmpInd,"ensembl_gene_id"][is.na(tmpName)]
rm(tmpInd)
g <- plotExpression(nz.sce, rownames(var.out)[chosen.genes.index.tmp], 
            alpha=0.05, jitter="jitter") + fontsize
g <- g + scale_x_discrete(breaks=rownames(var.out)[chosen.genes.index.tmp],
        labels=tmpName)
g
rm(chosen.genes.index.tmp)

1.7 Denoising expression values using PCA

Aim: use the trend fitted above to identify PCs linked to biology.

Assumption: biology drives most of the variance hence should be captured by the first PCs, while technical noise affects each gene independently, hence is captured by later PCs.

Logic: Compute the sum of the technical component across genes used in the PCA, use it as the amount of variance not related to biology and that we should therefore remove. Later PCs are excluded until the amount of variance they account for matches that corresponding to the technical component.

# remove uninteresting PCs:
nz.sce <- denoisePCA(nz.sce, technical=var.fit$trend, assay.type="logcounts", approximate=TRUE)
#rObjFile <- "Tcells_SCE_comb_denoisePCA.Rds"; readRDS(rObjFile)
# check assay names, should see 'PCA':
assayNames(nz.sce)
## [1] "counts"    "logcounts"
# check dimension of the PC table:
dim(reducedDim(nz.sce, "PCA")) 
## [1] 6312    9
nz.sce.pca <- reducedDim(nz.sce, "PCA") #??get copy of PCA matrix
tmpCol <- rep("grey", nbPcToComp) #??set colours to show selected PCs in green
tmpCol[1:dim(nz.sce.pca)[2]] <- "green"
barplot(attributes(nz.sce.pca)$percentVar[1:nbPcToComp],
        main=sprintf("Scree plot for the %s first PCs", nbPcToComp),
        names.arg=1:nbPcToComp,
        col=tmpCol,
        cex.names = 0.8)

# cumulative proportion of variance explained by selected PCs
cumsum(attributes(nz.sce.pca)$percentVar)[1:dim(nz.sce.pca)[2]]
## [1] 0.04586455 0.06872205 0.08737289 0.09811947 0.10663176 0.11460659
## [7] 0.12133433 0.12704685 0.13244350
#??plot on PC1 and PC2 plane:
plotPCA(nz.sce, colour_by = "Sample")

#require(knitr); knit_exit()
rm(tmpCol)

Show cells on plane for PC1 and PC2:

plotReducedDim(nz.sce, use_dimred = "PCA", ncomponents = 3, 
        colour_by = "Sample",
        size_by = "total_features") + fontsize

1.8 Visualise expression patterns of some HVGs

On PCA plot:

# make and store PCA plot for top HVG 1:
pca1 <- plotReducedDim(nz.sce, use_dimred="PCA", colour_by=rowData(nz.sce)[chosen.genes.index[1],"ensembl_gene_id"]) + fontsize  # + coord_fixed()
# make and store PCA plot for top HVG 2:
pca2 <- plotReducedDim(nz.sce, use_dimred="PCA", colour_by=rowData(nz.sce)[chosen.genes.index[2],"ensembl_gene_id"]) + fontsize # + coord_fixed()

pca1

pca2

# display plots next to each other:
# multiplot(pca1, pca2, cols=2)

pca1 + facet_grid(. ~ nz.sce$Sample2) + coord_fixed()

pca2 + facet_grid(. ~ nz.sce$Sample2) + coord_fixed()

# display plots next to each other, splitting each by sample:
#multiplot(pca1 + facet_grid(. ~ nz.sce$Sample2),
#          pca2 + facet_grid(. ~ nz.sce$Sample2),
#          cols=2)

On t-SNE plot:

# plot TSNE, accessing counts for the gene of interest with the ID used to name rows in the count matrix:
# make and store TSNE plot for top HVG 1:
tsne1 <- plotTSNE(nz.sce, colour_by=rowData(nz.sce)[chosen.genes.index[1],"ensembl_gene_id"]) + fontsize
# make and store TSNE plot for top HVG 2:
tsne2 <- plotTSNE(nz.sce, colour_by=rowData(nz.sce)[chosen.genes.index[2],"ensembl_gene_id"]) + fontsize

tsne1

tsne2

# display plots next to each other:
#multiplot(tsne1, tsne2, cols=2)

tsne1 + facet_grid(. ~ nz.sce$Sample2)

tsne2 + facet_grid(. ~ nz.sce$Sample2)

# display plots next to each other, splitting each by sample:
#multiplot(tsne1 + facet_grid(. ~ nz.sce$Sample2), tsne2 + facet_grid(. ~ nz.sce$Sample2), cols=2)

1.9 Clustering cells into putative subpopulations

1.9.1 Defining cell clusters from expression data

See clustering methods on the Hemberg lab material.

We will use the denoised log-expression values to cluster cells.

1.9.1.1 hierarchical clustering

Here we’ll use hierarchical clustering on the Euclidean distances between cells, using Ward D2 criterion to minimize the total variance within each cluster.

This yields a dendrogram that groups together cells with similar expression patterns across the chosen genes.

1.9.1.1.1 clustering

Compute tree:

# get PCs
pcs <- reducedDim(nz.sce, "PCA")
# compute distance:
my.dist <- dist(pcs)
# derive tree:
my.tree <- hclust(my.dist, method="ward.D2")

Show tree:

plot(my.tree, labels = FALSE)

Clusters are identified in the dendrogram using a dynamic tree cut [8].

# identify clustering by cutting branches, requesting a minimum cluster size of 20 cells.
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist), minClusterSize=20, verbose=0))

Let’s count cells for each cluster and each sample.

table(my.clusters, nz.sce$Sample)
##            
## my.clusters Tils20_1 Tils20_2 Tils32
##          1       346      368     16
##          2       348      313      4
##          3       160      216    210
##          4       244      248     69
##          5       271      257      7
##          6       216      288     11
##          7       230      267      8
##          8         5       10    456
##          9       206      254      1
##          10      183      230      7
##          11        6       18    310
##          12      136      134      0
##          13       96      128     35

Clusters mostly include cells from one sample or the other. This suggests that the two samples differ, and/or the presence of batch effect.

Let’s show cluster assignments on the t-SNE.

# store cluster assignemnt in SCE object:
nz.sce$cluster <- factor(my.clusters)
# make, store and show TSNE plot:
g <- plotTSNE(nz.sce, colour_by = "cluster", size_by = "total_features")
g

# split by sample and show:
g <- g + facet_grid(. ~ nz.sce$Sample2)
g

Cells in the same area are not all assigned to the same cluster.

1.9.1.1.2 Separatedness

The congruence of clusters may be assessed by computing the sillhouette for each cell. The larger the value the closer the cell to cells in its cluster than to cells in other clusters. Cells closer to cells in other clusters have a negative value. Good cluster separation is indicated by clusters whose cells have large silhouette values.

Compute silhouette:

sil <- silhouette(my.clusters, dist = my.dist)

Plot silhouettes with one color per cluster and cells with a negative silhouette with the color of their closest cluster. Add the average silhouette for each cluster and all cells.

# prepare colours:
clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
# 
plot(sil, main = paste(length(unique(my.clusters)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE) 

The plot shows many cells with negative silhoutette indicating too many clusters were defined. The method and parameters used defined clusters with properties that may not fit the data set, eg clusters with the same diameter.

1.9.1.2 k-means

This approach assumes a pre-determined number of round equally-sized clusters.

The dendogram built above suggests there may be 5 or 6 large populations.

Let’s define 6 clusters.

# define clusters:
kclust <- kmeans(pcs, centers=6)

# compute silhouette
require("cluster")
sil <- silhouette(kclust$cluster, dist(pcs))

# plot silhouette:
clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
plot(sil, main = paste(length(unique(kclust$cluster)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE) 

tSneCoord <- as.data.frame(reducedDim(nz.sce, "TSNE"))
colnames(tSneCoord) <- c("x", "y")
p2 <- ggplot(tSneCoord, aes(x, y)) +
    geom_point(aes(color = as.factor(kclust$cluster)))
p2 + facet_wrap(~ nz.sce$Sample2)

To find the most appropriate number of clusters, one performs the analysis for a series of k values, computes a measure of fit of the clusters defined: the within cluster sum-of-square. This value decreases as k increases, by an amount that decreases with k. Choose k at the inflexion point of the curve.

library(broom)
require(tibble)
require(dplyr)
require(tidyr)
library(purrr)
points <- as.tibble(pcs)
augment(kclust, points)
## # A tibble: 6,312 x 10
##      PC1    PC2     PC3    PC4    PC5     PC6    PC7    PC8     PC9
##    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
##  1  7.80 -2.25  -5.64    3.34   1.49  -4.42    2.03  -0.554  0.0315
##  2  5.49  4.78   0.896   0.259 -1.78  -0.0332  1.03  -2.01   0.918 
##  3 -5.10 -1.60  -4.46   -4.85  -0.525  0.774  -1.91  -1.28   0.125 
##  4 -7.26  1.76  -4.25   -1.71   2.08   0.439  -3.98  -1.63  -1.24  
##  5 -8.07 -4.87  -2.47   -7.74  -3.69  -0.286  -0.908 -2.15  -3.59  
##  6  3.40 -2.50  -2.51   -0.504 -0.425  2.53    4.11  -1.73   2.41  
##  7  8.07  8.22   0.351  -6.23  -3.22  -1.38   -0.353 -3.94  -3.53  
##  8  8.87  7.60   1.98   -3.78  -1.12   1.52   -0.769 -3.31  -0.100 
##  9  8.17 -6.29   0.0399  1.95  -1.32  -0.414  -0.890  0.369 -2.54  
## 10 -1.16  0.255 -3.62    2.30  -2.79   0.670  -0.183 -0.719 -0.408 
## # ... with 6,302 more rows, and 1 more variable: .cluster <fct>
kclusts <- tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(points, .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, points)
  )

kclusts
## # A tibble: 9 x 5
##       k kclust       tidied           glanced         augmented           
##   <int> <list>       <list>           <list>          <list>              
## 1     1 <S3: kmeans> <tibble [1 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 2     2 <S3: kmeans> <tibble [2 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 3     3 <S3: kmeans> <tibble [3 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 4     4 <S3: kmeans> <tibble [4 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 5     5 <S3: kmeans> <tibble [5 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 6     6 <S3: kmeans> <tibble [6 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 7     7 <S3: kmeans> <tibble [7 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 8     8 <S3: kmeans> <tibble [8 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
## 9     9 <S3: kmeans> <tibble [9 x 12~ <tibble [1 x 4~ <tibble [6,312 x 10~
clusters <- kclusts %>%
  unnest(tidied)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
assignments <- kclusts %>% 
  unnest(augmented)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
clusterings <- kclusts %>%
  unnest(glanced, .drop = TRUE)

Plot the total within cluster sum-of-squares and decide on k.

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line()

Copy the cluster assignment to the SCE object.

df <- as.data.frame(assignments)
nz.sce$kmeans5 <- as.numeric(df[df$k == 5, ".cluster"])

Check silhouette for a k of 5.

library(cluster)
clust.col <- scater:::.get_palette("tableau10medium") # hidden scater colours
sil <- silhouette(nz.sce$kmeans5, dist = my.dist)
sil.cols <- clust.col[ifelse(sil[,3] > 0, sil[,1], sil[,2])]
sil.cols <- sil.cols[order(-sil[,1], sil[,3])]
plot(sil, main = paste(length(unique(nz.sce$kmeans5)), "clusters"), 
    border=sil.cols, col=sil.cols, do.col.sort=FALSE) 

1.9.1.3 graph-based clustering

Let’s build a shared nearest-neighbour graph using cells as nodes, then perform community-based clustering.

Build graph, define clusters, check membership across samples, show membership on t-SNE.

#compute graph
snn.gr <- buildSNNGraph(nz.sce, use.dimred="PCA")
# derive clusters
cluster.out <- igraph::cluster_walktrap(snn.gr)
# count cell in each cluster for each sample
my.clusters <- cluster.out$membership
table(my.clusters, nz.sce$Sample)
##            
## my.clusters Tils20_1 Tils20_2 Tils32
##          1       250      343     22
##          2         0        2    495
##          3       157      152      0
##          4        18       35    243
##          5       485      534     19
##          6       219      257      2
##          7       437      498      9
##          8       289      290     16
##          9       521      522     30
##          10        4        8    270
##          11       47       62      0
##          12        7        4     28
##          13       13       24      0
# store membership
nz.sce$cluster <- factor(my.clusters)
# shoe clusters on TSNE
plotTSNE(nz.sce, colour_by="cluster") + fontsize

Compute modularity to assess clusters quality. The closer to 1 the better.

igraph::modularity(cluster.out)
## [1] 0.7177792

Show similarity between clusters on a network.

cluster.gr <- igraph::graph_from_adjacency_matrix(ratio, 
    mode="undirected", weighted=TRUE, diag=FALSE)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*10)  

1.9.2 Detecting genes differentially expressed between clusters

1.9.2.1 Differential expression analysis

Let’s identify genes for each cluster whose expression differ to that of other clusters, using findMarkers(). It fits a linear model to the log-expression values for each gene using limma [9] and allows testing for differential expression in each cluster compared to the others while accounting for known, uninteresting factors.

markers <- findMarkers(nz.sce, my.clusters)

Results are compiled in a single table per cluster that stores the outcome of comparisons against the other clusters. One can then select differentially expressed genes from each pairwise comparison between clusters.

Let’s define a set of genes for cluster 1 by selecting the top 10 genes of each comparison, and check test output, eg adjusted p-values and log-fold changes.

# get output table for clsuter 1:
marker.set <- markers[["1"]]
head(marker.set, 10)
## DataFrame with 10 rows and 14 columns
##                       Top                   FDR            logFC.2
##                 <integer>             <numeric>          <numeric>
## ENSG00000271503         1                     0  -1.48823037042183
## ENSG00000109475         1 3.11087666107904e-230 -0.250111652525046
## ENSG00000235576         1 2.51152043332458e-213  -2.08641507266965
## ENSG00000213741         1 1.96018109829915e-192 -0.867091105046371
## ENSG00000251562         1 3.86985319668547e-187 -0.759686207196149
## ENSG00000177954         1 5.18841055764991e-187 -0.752011008871412
## ENSG00000008517         1 5.70474617152053e-144 -0.932236822648199
## ENSG00000186827         1 6.77306461181548e-119  0.864317721587458
## ENSG00000211772         1  2.07383355504811e-88 -0.289914795337167
## ENSG00000101132         1  9.10178614957634e-63  0.162958491276682
##                             logFC.3              logFC.4
##                           <numeric>            <numeric>
## ENSG00000271503   -1.67682441077556    -1.15940677032237
## ENSG00000109475   -1.20046278678098    -1.43242288894484
## ENSG00000235576 -0.0890765986971215   -0.358169799493698
## ENSG00000213741   -1.08068139459579     -1.9383494196812
## ENSG00000251562   -1.46530208750593    -1.71377662197724
## ENSG00000177954    -1.3167097101971    -1.88327699511164
## ENSG00000008517  -0.181223861184998   -0.387885951817807
## ENSG00000186827    1.47293204219923     1.25338797340663
## ENSG00000211772  -0.117569053696624 0.000863187940300936
## ENSG00000101132   0.155740275488572    0.203608351808405
##                            logFC.5              logFC.6
##                          <numeric>            <numeric>
## ENSG00000271503  0.157126794898773    -1.06384952163656
## ENSG00000109475 0.0466164804270832    -1.28943807188176
## ENSG00000235576  0.107782233907268 -0.00775440904243648
## ENSG00000213741  0.272202434569872    -1.06769205395966
## ENSG00000251562  -1.11155723175033    -1.75156078825651
## ENSG00000177954  0.217011392557205    -1.15335700432701
## ENSG00000008517  -1.44603356715269  -0.0328979865225243
## ENSG00000186827 -0.248384145933944     1.33978796868972
## ENSG00000211772 -0.262707418077933    0.048973365464241
## ENSG00000101132 0.0634565970586403    0.091387253599955
##                             logFC.7             logFC.8            logFC.9
##                           <numeric>           <numeric>          <numeric>
## ENSG00000271503  -0.900637196502367   -3.81805908352674  -3.03040692575619
## ENSG00000109475   -1.53557041287179   0.172116017747668 -0.623791207037284
## ENSG00000235576 -0.0240420775062045  -0.596300933690309 -0.153554786550753
## ENSG00000213741   -1.22709826746302 -0.0107977573894051 -0.812648252722383
## ENSG00000251562   -1.85477889223549   -1.05374492750487  -1.74263732493059
## ENSG00000177954   -1.28934203720405  -0.206372019240512 -0.834454094275736
## ENSG00000008517  -0.228275223461285  -0.102745105908534 -0.360694692927102
## ENSG00000186827    1.38993763919276    1.56519416063097     1.592099933136
## ENSG00000211772 -0.0308212519343181  -0.108557909029816 -0.146304078995942
## ENSG00000101132   0.137961121009457   0.130877180199904   0.13243498722531
##                            logFC.10           logFC.11            logFC.12
##                           <numeric>          <numeric>           <numeric>
## ENSG00000271503   0.025927481402104  -2.76858314936854  0.0992116300101846
## ENSG00000109475  -0.269252007266126  0.856433381759021    1.10829581814226
## ENSG00000235576  -0.036505025289166 -0.158138944358747 -0.0676224932108048
## ENSG00000213741  -0.816417715608068  0.442667461695114   0.807294392473908
## ENSG00000251562   -1.32115016478086 0.0295529900883622    -2.5130272786891
## ENSG00000177954  -0.605872107915449  0.503886696091628   0.975233120657861
## ENSG00000008517   -1.97692671933442  0.400600270706476  -0.617733972694023
## ENSG00000186827 -0.0922372019395874   1.60552663606738  -0.344513211393818
## ENSG00000211772  -0.213827715963906  0.181497355041017   0.266359648815204
## ENSG00000101132  0.0585401592213391 0.0656013736473839   0.355429793386539
##                           logFC.13
##                          <numeric>
## ENSG00000271503 0.0554816146672676
## ENSG00000109475 -0.279954342988843
## ENSG00000235576  0.120446313740584
## ENSG00000213741  0.284399783498047
## ENSG00000251562 -0.524471743097364
## ENSG00000177954  0.227101803992638
## ENSG00000008517   2.11989173117137
## ENSG00000186827   1.46035216205357
## ENSG00000211772   1.41212578508558
## ENSG00000101132  0.147542782766905
# add gene annotation:
tmpDf <- marker.set
tmpDf$ensembl_gene_id <- rownames(tmpDf)
tmpDf2 <- base::merge(tmpDf, rowData(nz.sce), by="ensembl_gene_id", all.x=TRUE, all.y=F, sort=F)

Write Table to file:

rObjFile <- "Tcells_nz.sce_comb_clu1_deg.tsv"
#tmpFileName <- file.path(inpDir, dataSubDir, rObjFile)
tmpFileName <- file.path(rObjFile)
write.table(tmpDf2, file=tmpFileName, sep="\t", quote=FALSE, row.names=FALSE)

Gene set enrichment analyses learnt earlier today may be used to characterise clusters further.

1.9.2.2 Heatmap

As for bulk RNA, differences in expression profiles of the top genes can be visualised with a heatmap.

# select some top genes:
top.markers <- rownames(marker.set)[marker.set$Top <= 10]

# have matrix to annotate sample with cluster and sample:
tmpData <- logcounts(nz.sce)[top.markers,]
# concat sample and barcode names to make unique name across the whole data set
tmpCellNames <- paste(colData(nz.sce)$Sample, colData(nz.sce)$Barcode, sep="_")
# use these to namecolumn of matrix the show as heatmap:
colnames(tmpData) <- tmpCellNames # colData(nz.sce)$Barcode                    

# columns annotation with cell name:
mat_col <- data.frame(cluster = nz.sce$cluster, sample = nz.sce$Sample)
rownames(mat_col) <- colnames(tmpData)
rownames(mat_col) <- tmpCellNames # colData(nz.sce)$Barcode

# Prepare colours for clusters:
colourCount = length(unique(nz.sce$cluster))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))

mat_colors <- list(group = getPalette(colourCount))
names(mat_colors$group) <- unique(nz.sce$cluster)

# plot heatmap:
pheatmap(tmpData,
           border_color      = NA,
  show_colnames     = FALSE,
  show_rownames     = FALSE,
  drop_levels       = TRUE,
         annotation_col    = mat_col,
         annotation_colors = mat_colors
         )

One can sort both the gene and sample dendrograms to improve the heatmap.

library(dendsort)

mat <- tmpData
mat_cluster_cols <- hclust(dist(t(mat)))

sort_hclust <- function(...) as.hclust(dendsort(as.dendrogram(...)))

mat_cluster_cols <- sort_hclust(mat_cluster_cols)
#plot(mat_cluster_cols, main = "Sorted Dendrogram", xlab = "", sub = "")

mat_cluster_rows <- sort_hclust(hclust(dist(mat)))

pheatmap(tmpData,
           border_color      = NA,
           show_colnames     = FALSE,
           show_rownames     = FALSE,
           drop_levels       = TRUE,
           annotation_col    = mat_col,
           annotation_colors = mat_colors,
           cluster_cols      = mat_cluster_cols,
           cluster_rows      = mat_cluster_rows
         )

1.9.2.3 Challenges

Challenge? Compare t-SNE obtained here to that shown in the article and show expression level of reported markers genes on t-SNE plots.

Challenge? Identify genes that are upregulated in each cluster compared to others.

Challenge? Identify genes differentially expressed between a cluster and all others.

Challenge? Identify genes whose distribution of expression, rather than their average expression, differs between clusters (Hint: overlapExprs() may help).

Save session to file.

rObjFile <- "Tcells_SCE_comb_session.RData"

# check file exists:
#tmpFileName <- file.path(inpDir, dataSubDir, rObjFile)
tmpFileName <- file.path(rObjFile)

# uncomment to save sesssion # 
save.image(file=tmpFileName)

1.10 Other types of analyses beyond this brief introduction

Several tools for single cell analyses, eg Seurat [10], were not covered in this brief introduction this afternoon. Please refer to links above for more information on these and more advanced analyses such as progress along a differentiation pathway, or pseudotime, with monocle [11] or TSCAN [12], and gene set enrichment analyses used for bulk data or designed single-cell methods like scde [13]. Please also refer to the article reporting the analysis of this data set, including batch correction [14].

References

1. Andrews TS, Hemberg M. Identifying cell populations with scRNASeq. Molecular Aspects of Medicine. 2018;59:114–22. doi:https://doi.org/10.1016/j.mam.2017.07.002.

2. Lun A, McCarthy D, Marioni J. A step-by-step workflow for low-level analysis of single-cell rna-seq data with bioconductor [version 2; referees: 3 approved, 2 approved with reservations]. F1000Research. 2016;5.

3. Lun A, McCarthy D, Marioni J. SimpleSingleCell bioconductor package. https://master.bioconductor.org/packages/release/workflows/html/simpleSingleCell.html.

4. Kiselev V, Andrews T, Westoby J, McCarthy D, M B, M H. Analysis of single cell rna-seq data. https://hemberg-lab.github.io/scRNA.seq.course/index.html.

5. Field A, Miles J, Field Z. Discovering statistics using r. SAGE Publications; 2012. https://www.discoveringstatistics.com/books/discovering-statistics-using-r/.

6. Patcher L. What is principal component analysis? https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/.

8. Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: The dynamic tree cut package for r. Bioinformatics. 2008;24:719–20. doi:10.1093/bioinformatics/btm563.

9. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Research. 2015;43:e47. doi:10.1093/nar/gkv007.

10. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology. 2018. doi:10.1038/nbt.4096.

11. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32:381–6.

12. Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44:e117.

13. Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016;13:241–4.

14. Savas P, Virassamy B, Ye C, Salim A, Mintoff CP, Caramia F, et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat Med. 2018;24:986–93.