In many situations, one is studying a process where cells change continuously. This includes, for example, many differentiation processes taking place during development: following a stimulus, cells will change from one cell-type to another. Ideally, we would like to monitor the expression levels of an individual cell over time. Unfortunately, such monitoring is not possible with scRNA-seq since the cell is lysed (destroyed) when the RNA is extracted.
Instead, we must sample at multiple time-points and obtain snapshots of the gene expression profiles. Since some of the cells will proceed faster along the differentiation than others, each snapshot may contain cells at varying points along the developmental progression. We use statistical methods to order the cells along one or more trajectories which represent the underlying developmental trajectories, this ordering is referred to as “pseudotime”.
A recent benchmarking paper by Saelens et al provides a detailed summary of the various computational methods for trajectory inference from single-cell transcriptomics. They discuss 45 tools and evaluate them across various aspects including accuracy, scalability, and usability. They provide dynverse, an open set of packages to benchmark, construct and interpret single-cell trajectories (currently they have a uniform interface for 60 methods).
library(SingleCellExperiment)
library(scran)
library(scater)
library(batchelor)
library(cowplot)
library(pheatmap)
library(tidyverse)
library(SingleR)
library(destiny)
library(gam)
library(viridis)
library(msigdbr)
library(clusterProfiler)
library(cellAlign)
We load the SCE object we have generated previously. This object contains only the T-cells from 8 healthy donors. We will first prepare the data by identifying variable genes, integrating the data across donors and calculating principal components.
sce.tcell<-readRDS(file="~/Course_Materials/scRNAseq/Robjects/hca.tcell.RDS")
sce.tcell
## class: SingleCellExperiment
## dim: 20425 20522
## metadata(0):
## assays(2): counts logcounts
## rownames(20425): AL627309.1 AL669831.5 ... AC233755.1 AC240274.1
## rowData names(11): ensembl_gene_id external_gene_name ... detected
## gene_sparsity
## colnames: NULL
## colData names(19): Sample Barcode ... label cell_type
## reducedDimNames(4): MNN PCA UMAP TSNE
## mainExpName: NULL
## altExpNames(0):
Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data. Angerer et al have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create an R package called destiny.
There are 8 samples in this dataset:
table(colData(sce.tcell)$Sample.Name)
##
## MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7 MantonBM8
## 2917 2669 2701 1688 1968 2654 2994 2931
For ease of computation, we will perform pseudotime analysis only on one sample, and we will downsample the object to 1000 cells. We will select the sample named MantonBM1
.
# pull the barcodes for MantonBM1 sample & and downsample the set to 1000 genes
vec.bc <- colData(sce.tcell) %>%
data.frame() %>%
filter(Sample.Name == "MantonBM1") %>%
group_by(Sample.Name) %>%
sample_n(1000) %>%
pull(Barcode)
Number of cells in the sample:
table(colData(sce.tcell)$Barcode %in% vec.bc)
##
## FALSE TRUE
## 19522 1000
Subset cells from the main SCE object:
tmpInd <- which(colData(sce.tcell)$Barcode %in% vec.bc)
sce.tcell.BM1 <- sce.tcell[,tmpInd]
sce.tcell.BM1
## class: SingleCellExperiment
## dim: 20425 1000
## metadata(0):
## assays(2): counts logcounts
## rownames(20425): AL627309.1 AL669831.5 ... AC233755.1 AC240274.1
## rowData names(11): ensembl_gene_id external_gene_name ... detected
## gene_sparsity
## colnames: NULL
## colData names(19): Sample Barcode ... label cell_type
## reducedDimNames(4): MNN PCA UMAP TSNE
## mainExpName: NULL
## altExpNames(0):
Identify top 500 highly variable genes
dec.tcell.BM1 <- modelGeneVar(sce.tcell.BM1)
top.tcell.BM1 <- getTopHVGs(dec.tcell.BM1, n=500)
We will extract normalized counts for HVG to use in pseudotime alignment
tcell_BM1_counts<-logcounts(sce.tcell.BM1)
tcell_BM1_counts<-t(as.matrix(tcell_BM1_counts[top.tcell.BM1,]))
cellLabels <- sce.tcell.BM1$Barcode
rownames(tcell_BM1_counts)<-cellLabels
tcell_BM1_counts[1:4,1:4]
## CCL5 S100A4 NKG7 KLRB1
## AAACCTGCAAAGCGGT-13 0.000000 0.000000 0.000000 0
## AAACCTGGTAATCACC-13 0.000000 3.576974 0.000000 0
## AAACCTGTCCCTAACC-13 4.375961 3.598941 1.160406 0
## AAACCTGTCGCCAAAT-13 0.000000 2.721323 0.000000 0
And finally, we can run pseudotime alignment with destiny
dm_tcell_BM1 <- DiffusionMap(tcell_BM1_counts,n_pcs = 50)
Plot diffusion component 1 vs diffusion component 2 (DC1 vs DC2).
tmp <- data.frame(DC1 = eigenvectors(dm_tcell_BM1)[, 1],
DC2 = eigenvectors(dm_tcell_BM1)[, 2])
ggplot(tmp, aes(x = DC1, y = DC2)) +
geom_point() +
xlab("Diffusion component 1") +
ylab("Diffusion component 2") +
theme_classic()
Stash diffusion components to SCE object
sce.tcell.BM1$pseudotime_destiny_1<-eigenvectors(dm_tcell_BM1)[, 1]
sce.tcell.BM1$pseudotime_destiny_2<-eigenvectors(dm_tcell_BM1)[, 2]
After running destiny, an interesting next step may be to find genes that change their expression over the course of time We demonstrate one possible method for this type of analysis on the 500 most variable genes. We will regress each gene on the pseudotime variable we have generated, using a general additive model (GAM). This allows us to detect non-linear patterns in gene expression. We are going to use HVG we identified in the previous step, but this analysis can also be done using the whole transcriptome.
# Only look at the 500 most variable genes when identifying temporally expressesd genes.
# Identify the variable genes by ranking all genes by their variance.
# We will use the first diffusion components as a measure of pseudotime
Y<-log2(counts(sce.tcell.BM1)+1)
colnames(Y)<-cellLabels
Y<-Y[top.tcell.BM1,]
# Fit GAM for each gene using pseudotime as independent variable.
t <- eigenvectors(dm_tcell_BM1)[, 1]
gam.pval <- apply(Y, 1, function(z){
d <- data.frame(z=z, t=t)
tmp <- gam(z ~ lo(t), data=d)
p <- summary(tmp)[4][[1]][1,5]
p
})
Select top 30 genes for visualization
# Identify genes with the most significant time-dependent model fit.
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:30]
Visualize these genes in a heatmap
heatmapdata <- Y[topgenes,]
heatmapdata <- heatmapdata[,order(t, na.last = NA)]
t_ann<-as.data.frame(t)
colnames(t_ann)<-"pseudotime"
pheatmap(heatmapdata, cluster_rows = T, cluster_cols = F, color = plasma(200), show_colnames = F, annotation_col = t_ann)
Visualize how some of the temporally expressed genes change in time
Following individual genes is very helpful for identifying genes that play an important role in the differentiation process. We illustrate the procedure using the GZMA gene. We have added the pseudotime values computed with destiny to the colData slot of the SCE object. Having done that, the full plotting capabilities of the scater package can be used to investigate relationships between gene expression, cell populations and pseudotime.
plotExpression(sce.tcell.BM1, "GZMA", x = "pseudotime_destiny_1",
show_violin = TRUE,
show_smooth = TRUE)
Obtain pseudotime for one of the Caron samples.
sce_caron<-readRDS("~/Course_Materials/scRNAseq/Robjects/DataIntegration_uncorrected.rds")
rownames(sce_caron) <- uniquifyFeatureNames(rowData(sce_caron)$ID, names = rowData(sce_caron)$Symbol)
# extract one sample from the sce_caron object
vec.bc <- colData(sce_caron) %>%
data.frame() %>%
filter(SampleName == "ETV6-RUNX1_1") %>%
group_by(SampleName) %>%
pull(Barcode)
tmpInd <- which(colData(sce_caron)$Barcode %in% vec.bc)
sce.caron.S1 <- sce_caron[,tmpInd]
# Identift HVG
dec.caron.S1 <- modelGeneVar(sce.caron.S1)
top.caron.S1 <- getTopHVGs(dec.caron.S1, n=500)
# extract normalized count data for HVG
caron.S1_counts<-logcounts(sce.caron.S1)
caron.S1_counts<-t(as.matrix(caron.S1_counts[top.caron.S1,]))
cellLabels <- sce.caron.S1$Barcode
rownames(caron.S1_counts)<-cellLabels
dm_caron.S1 <- DiffusionMap(caron.S1_counts,n_pcs = 50)
tmp <- data.frame(DC1 = eigenvectors(dm_caron.S1)[, 1],
DC2 = eigenvectors(dm_caron.S1)[, 2])
ggplot(tmp, aes(x = DC1, y = DC2)) +
geom_point() +
xlab("Diffusion component 1") +
ylab("Diffusion component 2") +
theme_classic()
Y<-log2(counts(sce.caron.S1)+1)
colnames(Y)<-cellLabels
Y<-Y[top.caron.S1,]
# Fit GAM for each gene using pseudotime as independent variable.
t <- eigenvectors(dm_caron.S1)[, 1]
gam.pval <- apply(Y, 1, function(z){
d <- data.frame(z=z, t=t)
tmp <- gam(z ~ lo(t), data=d)
p <- summary(tmp)[4][[1]][1,5]
p
})
Select top 30 genes for visualization
# Identify genes with the most significant time-dependent model fit.
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:30]
Visualize these genes in a heatmap
heatmapdata <- Y[topgenes,]
heatmapdata <- heatmapdata[,order(t, na.last = NA)]
t_ann<-as.data.frame(t)
colnames(t_ann)<-"pseudotime"
pheatmap(heatmapdata, cluster_rows = T, cluster_cols = F, color = plasma(200), show_colnames = F, annotation_col = t_ann)
What kind of a dynamic process might be taking place in this cancer cell?
We can quickly check in which pathways these top genes are enriched using MSigDB.
Molecular Signatures Database contains 8 major collections:
We are going to use hallmark gene sets (H
) and perform a hypergeometric test with our top 30 genes for all HALLMARK sets.
msigdb_hallmark<-msigdbr(species = "Homo sapiens", category = "H") %>% dplyr::select(gs_name, gene_symbol)
em<-enricher(topgenes, TERM2GENE=msigdb_hallmark)
head(em)[,"qvalue",drop=F]
This notebook uses material from SVI course, OSCA Book, Broad Institute Workshop and Hemberg Group Course.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
## [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
## [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] cellAlign_0.1.0 clusterProfiler_4.0.2
## [3] msigdbr_7.4.1 viridis_0.6.1
## [5] viridisLite_0.4.0 gam_1.20
## [7] foreach_1.5.1 destiny_3.4.0
## [9] SingleR_1.6.1 forcats_0.5.1
## [11] stringr_1.4.0 dplyr_1.0.7
## [13] purrr_0.3.4 readr_1.4.0
## [15] tidyr_1.1.3 tibble_3.1.2
## [17] tidyverse_1.3.1 pheatmap_1.0.12
## [19] cowplot_1.1.1 batchelor_1.8.0
## [21] scater_1.20.1 ggplot2_3.3.5
## [23] scran_1.20.1 scuttle_1.2.0
## [25] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [27] Biobase_2.52.0 GenomicRanges_1.44.0
## [29] GenomeInfoDb_1.28.1 IRanges_2.26.0
## [31] S4Vectors_0.30.0 BiocGenerics_0.38.0
## [33] MatrixGenerics_1.4.0 matrixStats_0.59.0
## [35] knitr_1.33
##
## loaded via a namespace (and not attached):
## [1] ggthemes_4.2.4 bit64_4.0.5
## [3] irlba_2.3.3 DelayedArray_0.18.0
## [5] data.table_1.14.0 KEGGREST_1.32.0
## [7] RCurl_1.98-1.3 generics_0.1.0
## [9] ScaledMatrix_1.0.0 RSQLite_2.2.7
## [11] shadowtext_0.0.8 proxy_0.4-26
## [13] bit_4.0.4 enrichplot_1.12.2
## [15] xml2_1.3.2 lubridate_1.7.10
## [17] assertthat_0.2.1 xfun_0.24
## [19] hms_1.1.0 jquerylib_0.1.4
## [21] babelgene_21.4 evaluate_0.14
## [23] DEoptimR_1.0-9 fansi_0.5.0
## [25] dbplyr_2.1.1 readxl_1.3.1
## [27] igraph_1.2.6 DBI_1.1.1
## [29] ellipsis_0.3.2 RSpectra_0.16-0
## [31] backports_1.2.1 sparseMatrixStats_1.4.0
## [33] vctrs_0.3.8 TTR_0.24.2
## [35] abind_1.4-5 cachem_1.0.5
## [37] RcppEigen_0.3.3.9.1 withr_2.4.2
## [39] ggforce_0.3.3 robustbase_0.93-8
## [41] vcd_1.4-8 treeio_1.16.1
## [43] xts_0.12.1 cluster_2.1.2
## [45] DOSE_3.18.1 lazyeval_0.2.2
## [47] ape_5.5 laeken_0.5.1
## [49] crayon_1.4.1 labeling_0.4.2
## [51] edgeR_3.34.0 pkgconfig_2.0.3
## [53] tweenr_1.0.2 nlme_3.1-152
## [55] vipor_0.4.5 nnet_7.3-16
## [57] rlang_0.4.11 lifecycle_1.0.0
## [59] downloader_0.4 modelr_0.1.8
## [61] rsvd_1.0.5 cellranger_1.1.0
## [63] polyclip_1.10-0 RcppHNSW_0.3.0
## [65] lmtest_0.9-38 Matrix_1.3-4
## [67] aplot_0.0.6 carData_3.0-4
## [69] boot_1.3-28 zoo_1.8-9
## [71] reprex_2.0.0 beeswarm_0.4.0
## [73] png_0.1-7 knn.covertree_1.0
## [75] bitops_1.0-7 Biostrings_2.60.1
## [77] blob_1.2.1 DelayedMatrixStats_1.14.0
## [79] qvalue_2.24.0 beachmat_2.8.0
## [81] scales_1.1.1 memoise_2.0.0
## [83] magrittr_2.0.1 plyr_1.8.6
## [85] hexbin_1.28.2 zlibbioc_1.38.0
## [87] scatterpie_0.1.6 compiler_4.1.0
## [89] dqrng_0.3.0 RColorBrewer_1.1-2
## [91] pcaMethods_1.84.0 dtw_1.22-3
## [93] cli_3.0.1 XVector_0.32.0
## [95] patchwork_1.1.1 mgcv_1.8-36
## [97] ggplot.multistats_1.0.0 MASS_7.3-54
## [99] tidyselect_1.1.1 stringi_1.7.3
## [101] highr_0.9 yaml_2.2.1
## [103] GOSemSim_2.18.0 BiocSingular_1.8.1
## [105] locfit_1.5-9.4 ggrepel_0.9.1
## [107] grid_4.1.0 sass_0.4.0
## [109] fastmatch_1.1-0 tools_4.1.0
## [111] rio_0.5.27 rstudioapi_0.13
## [113] bluster_1.2.1 foreign_0.8-81
## [115] metapod_1.0.0 gridExtra_2.3
## [117] smoother_1.1 scatterplot3d_0.3-41
## [119] farver_2.1.0 ggraph_2.0.5
## [121] BiocManager_1.30.16 rvcheck_0.1.8
## [123] digest_0.6.27 Rcpp_1.0.7
## [125] car_3.0-11 broom_0.7.8
## [127] httr_1.4.2 AnnotationDbi_1.54.1
## [129] colorspace_2.0-2 rvest_1.0.0
## [131] fs_1.5.0 ranger_0.13.1
## [133] statmod_1.4.36 tidytree_0.3.4
## [135] graphlayouts_0.7.1 sp_1.4-5
## [137] jsonlite_1.7.2 ggtree_3.0.2
## [139] tidygraph_1.2.0 R6_2.5.0
## [141] pillar_1.6.1 htmltools_0.5.1.1
## [143] glue_1.4.2 fastmap_1.1.0
## [145] VIM_6.1.0 BiocParallel_1.26.1
## [147] BiocNeighbors_1.10.0 class_7.3-19
## [149] codetools_0.2-18 fgsea_1.18.0
## [151] utf8_1.2.1 lattice_0.20-44
## [153] bslib_0.2.5.1 ResidualMatrix_1.2.0
## [155] curl_4.3.2 ggbeeswarm_0.6.0
## [157] zip_2.2.0 GO.db_3.13.0
## [159] openxlsx_4.2.4 limma_3.48.1
## [161] rmarkdown_2.9 munsell_0.5.0
## [163] e1071_1.7-7 DO.db_2.9
## [165] GenomeInfoDbData_1.2.6 iterators_1.0.13
## [167] haven_2.4.1 reshape2_1.4.4
## [169] gtable_0.3.0