In many situations, one is studying a process where cells change continuously. This includes, for example, many differentiation processes taking place during development: following a stimulus, cells will change from one cell-type to another. Ideally, we would like to monitor the expression levels of an individual cell over time. Unfortunately, such monitoring is not possible with scRNA-seq since the cell is lysed (destroyed) when the RNA is extracted.

Instead, we must sample at multiple time-points and obtain snapshots of the gene expression profiles. Since some of the cells will proceed faster along the differentiation than others, each snapshot may contain cells at varying points along the developmental progression. We use statistical methods to order the cells along one or more trajectories which represent the underlying developmental trajectories, this ordering is referred to as “pseudotime”.

A recent benchmarking paper by Saelens et al provides a detailed summary of the various computational methods for trajectory inference from single-cell transcriptomics. They discuss 45 tools and evaluate them across various aspects including accuracy, scalability, and usability. They provide dynverse, an open set of packages to benchmark, construct and interpret single-cell trajectories (currently they have a uniform interface for 60 methods).

1 Setting up the data

library(SingleCellExperiment)
library(scran)
library(scater)
library(batchelor)
library(cowplot)
library(pheatmap)
library(tidyverse)
library(SingleR)
library(destiny)
library(gam)
library(viridis)
library(msigdbr)
library(clusterProfiler)
library(cellAlign)

We load the SCE object we have generated previously. This object contains only the T-cells from 8 healthy donors. We will first prepare the data by identifying variable genes, integrating the data across donors and calculating principal components.

sce.tcell<-readRDS(file="~/Course_Materials/scRNAseq/Robjects/hca.tcell.RDS")

sce.tcell

## class: SingleCellExperiment 
## dim: 20425 20522 
## metadata(0):
## assays(2): counts logcounts
## rownames(20425): AL627309.1 AL669831.5 ... AC233755.1 AC240274.1
## rowData names(11): ensembl_gene_id external_gene_name ... detected
##   gene_sparsity
## colnames: NULL
## colData names(19): Sample Barcode ... label cell_type
## reducedDimNames(4): MNN PCA UMAP TSNE
## mainExpName: NULL
## altExpNames(0):

1.1 Trajectory inference with destiny

Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data. Angerer et al have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create an R package called destiny.

There are 8 samples in this dataset:

table(colData(sce.tcell)$Sample.Name)

## 
## MantonBM1 MantonBM2 MantonBM3 MantonBM4 MantonBM5 MantonBM6 MantonBM7 MantonBM8 
##      2917      2669      2701      1688      1968      2654      2994      2931

For ease of computation, we will perform pseudotime analysis only on one sample, and we will downsample the object to 1000 cells. We will select the sample named MantonBM1.

# pull the barcodes for MantonBM1 sample & and downsample the set to 1000 genes 
vec.bc <- colData(sce.tcell) %>%
    data.frame() %>%
    filter(Sample.Name == "MantonBM1") %>%
    group_by(Sample.Name) %>%
    sample_n(1000) %>%
    pull(Barcode)

Number of cells in the sample:

table(colData(sce.tcell)$Barcode %in% vec.bc)

## 
## FALSE  TRUE 
## 19522  1000

Subset cells from the main SCE object:

tmpInd <- which(colData(sce.tcell)$Barcode %in% vec.bc)
sce.tcell.BM1 <- sce.tcell[,tmpInd]

sce.tcell.BM1

## class: SingleCellExperiment 
## dim: 20425 1000 
## metadata(0):
## assays(2): counts logcounts
## rownames(20425): AL627309.1 AL669831.5 ... AC233755.1 AC240274.1
## rowData names(11): ensembl_gene_id external_gene_name ... detected
##   gene_sparsity
## colnames: NULL
## colData names(19): Sample Barcode ... label cell_type
## reducedDimNames(4): MNN PCA UMAP TSNE
## mainExpName: NULL
## altExpNames(0):

Identify top 500 highly variable genes

dec.tcell.BM1 <- modelGeneVar(sce.tcell.BM1)
top.tcell.BM1 <- getTopHVGs(dec.tcell.BM1, n=500)

We will extract normalized counts for HVG to use in pseudotime alignment

tcell_BM1_counts<-logcounts(sce.tcell.BM1)
tcell_BM1_counts<-t(as.matrix(tcell_BM1_counts[top.tcell.BM1,]))
cellLabels <- sce.tcell.BM1$Barcode
rownames(tcell_BM1_counts)<-cellLabels

tcell_BM1_counts[1:4,1:4]

##                         CCL5   S100A4     NKG7 KLRB1
## AAACCTGCAAAGCGGT-13 0.000000 0.000000 0.000000     0
## AAACCTGGTAATCACC-13 0.000000 3.576974 0.000000     0
## AAACCTGTCCCTAACC-13 4.375961 3.598941 1.160406     0
## AAACCTGTCGCCAAAT-13 0.000000 2.721323 0.000000     0

And finally, we can run pseudotime alignment with destiny

dm_tcell_BM1 <- DiffusionMap(tcell_BM1_counts,n_pcs = 50)

Plot diffusion component 1 vs diffusion component 2 (DC1 vs DC2).

tmp <- data.frame(DC1 = eigenvectors(dm_tcell_BM1)[, 1],
                  DC2 = eigenvectors(dm_tcell_BM1)[, 2])

ggplot(tmp, aes(x = DC1, y = DC2)) +
    geom_point() + 
    xlab("Diffusion component 1") + 
    ylab("Diffusion component 2") +
    theme_classic()

Stash diffusion components to SCE object

sce.tcell.BM1$pseudotime_destiny_1<-eigenvectors(dm_tcell_BM1)[, 1]
sce.tcell.BM1$pseudotime_destiny_2<-eigenvectors(dm_tcell_BM1)[, 2]

1.2 Find temporally expressed genes

After running destiny, an interesting next step may be to find genes that change their expression over the course of time We demonstrate one possible method for this type of analysis on the 500 most variable genes. We will regress each gene on the pseudotime variable we have generated, using a general additive model (GAM). This allows us to detect non-linear patterns in gene expression. We are going to use HVG we identified in the previous step, but this analysis can also be done using the whole transcriptome.

# Only look at the 500 most variable genes when identifying temporally expressesd genes.
# Identify the variable genes by ranking all genes by their variance.
# We will use the first diffusion components as a measure of pseudotime 
Y<-log2(counts(sce.tcell.BM1)+1)
colnames(Y)<-cellLabels
Y<-Y[top.tcell.BM1,]
# Fit GAM for each gene using pseudotime as independent variable.
t <- eigenvectors(dm_tcell_BM1)[, 1]
gam.pval <- apply(Y, 1, function(z){
  d <- data.frame(z=z, t=t)
  tmp <- gam(z ~ lo(t), data=d)
  p <- summary(tmp)[4][[1]][1,5]
  p
})

Select top 30 genes for visualization

# Identify genes with the most significant time-dependent model fit.
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:30]

Visualize these genes in a heatmap

heatmapdata <- Y[topgenes,]
heatmapdata <- heatmapdata[,order(t, na.last = NA)]
t_ann<-as.data.frame(t)
colnames(t_ann)<-"pseudotime"
pheatmap(heatmapdata, cluster_rows = T, cluster_cols = F, color = plasma(200), show_colnames = F, annotation_col = t_ann)

Visualize how some of the temporally expressed genes change in time

Following individual genes is very helpful for identifying genes that play an important role in the differentiation process. We illustrate the procedure using the GZMA gene. We have added the pseudotime values computed with destiny to the colData slot of the SCE object. Having done that, the full plotting capabilities of the scater package can be used to investigate relationships between gene expression, cell populations and pseudotime.

plotExpression(sce.tcell.BM1, "GZMA", x = "pseudotime_destiny_1", 
               show_violin = TRUE,
               show_smooth = TRUE)

1.3 Extra Materials

Obtain pseudotime for one of the Caron samples.

sce_caron<-readRDS("~/Course_Materials/scRNAseq/Robjects/DataIntegration_uncorrected.rds")

rownames(sce_caron) <- uniquifyFeatureNames(rowData(sce_caron)$ID, names = rowData(sce_caron)$Symbol)

# extract one sample from the sce_caron object
vec.bc <- colData(sce_caron) %>%
    data.frame() %>%
    filter(SampleName == "ETV6-RUNX1_1") %>%
    group_by(SampleName) %>%
    pull(Barcode)

tmpInd <- which(colData(sce_caron)$Barcode %in% vec.bc)
sce.caron.S1 <- sce_caron[,tmpInd]

# Identift HVG
dec.caron.S1 <- modelGeneVar(sce.caron.S1)
top.caron.S1 <- getTopHVGs(dec.caron.S1, n=500)

# extract normalized count data for HVG 
caron.S1_counts<-logcounts(sce.caron.S1)
caron.S1_counts<-t(as.matrix(caron.S1_counts[top.caron.S1,]))
cellLabels <- sce.caron.S1$Barcode
rownames(caron.S1_counts)<-cellLabels

dm_caron.S1 <- DiffusionMap(caron.S1_counts,n_pcs = 50)

tmp <- data.frame(DC1 = eigenvectors(dm_caron.S1)[, 1],
                  DC2 = eigenvectors(dm_caron.S1)[, 2])

ggplot(tmp, aes(x = DC1, y = DC2)) +
    geom_point() + 
    xlab("Diffusion component 1") + 
    ylab("Diffusion component 2") +
    theme_classic()

Y<-log2(counts(sce.caron.S1)+1)
colnames(Y)<-cellLabels
Y<-Y[top.caron.S1,]
# Fit GAM for each gene using pseudotime as independent variable.
t <- eigenvectors(dm_caron.S1)[, 1]
gam.pval <- apply(Y, 1, function(z){
  d <- data.frame(z=z, t=t)
  tmp <- gam(z ~ lo(t), data=d)
  p <- summary(tmp)[4][[1]][1,5]
  p
})

Select top 30 genes for visualization

# Identify genes with the most significant time-dependent model fit.
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:30]

Visualize these genes in a heatmap

heatmapdata <- Y[topgenes,]
heatmapdata <- heatmapdata[,order(t, na.last = NA)]
t_ann<-as.data.frame(t)
colnames(t_ann)<-"pseudotime"
pheatmap(heatmapdata, cluster_rows = T, cluster_cols = F, color = plasma(200), show_colnames = F, annotation_col = t_ann)

What kind of a dynamic process might be taking place in this cancer cell?

We can quickly check in which pathways these top genes are enriched using MSigDB.

Molecular Signatures Database contains 8 major collections:

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

We are going to use hallmark gene sets (H) and perform a hypergeometric test with our top 30 genes for all HALLMARK sets.

msigdb_hallmark<-msigdbr(species = "Homo sapiens", category = "H")  %>% dplyr::select(gs_name, gene_symbol)
em<-enricher(topgenes, TERM2GENE=msigdb_hallmark)
head(em)[,"qvalue",drop=F]

2 Ackowledgements

This notebook uses material from SVI course, OSCA Book, Broad Institute Workshop and Hemberg Group Course.

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
##  [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
##  [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
## [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] cellAlign_0.1.0             clusterProfiler_4.0.2      
##  [3] msigdbr_7.4.1               viridis_0.6.1              
##  [5] viridisLite_0.4.0           gam_1.20                   
##  [7] foreach_1.5.1               destiny_3.4.0              
##  [9] SingleR_1.6.1               forcats_0.5.1              
## [11] stringr_1.4.0               dplyr_1.0.7                
## [13] purrr_0.3.4                 readr_1.4.0                
## [15] tidyr_1.1.3                 tibble_3.1.2               
## [17] tidyverse_1.3.1             pheatmap_1.0.12            
## [19] cowplot_1.1.1               batchelor_1.8.0            
## [21] scater_1.20.1               ggplot2_3.3.5              
## [23] scran_1.20.1                scuttle_1.2.0              
## [25] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [27] Biobase_2.52.0              GenomicRanges_1.44.0       
## [29] GenomeInfoDb_1.28.1         IRanges_2.26.0             
## [31] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [33] MatrixGenerics_1.4.0        matrixStats_0.59.0         
## [35] knitr_1.33                 
## 
## loaded via a namespace (and not attached):
##   [1] ggthemes_4.2.4            bit64_4.0.5              
##   [3] irlba_2.3.3               DelayedArray_0.18.0      
##   [5] data.table_1.14.0         KEGGREST_1.32.0          
##   [7] RCurl_1.98-1.3            generics_0.1.0           
##   [9] ScaledMatrix_1.0.0        RSQLite_2.2.7            
##  [11] shadowtext_0.0.8          proxy_0.4-26             
##  [13] bit_4.0.4                 enrichplot_1.12.2        
##  [15] xml2_1.3.2                lubridate_1.7.10         
##  [17] assertthat_0.2.1          xfun_0.24                
##  [19] hms_1.1.0                 jquerylib_0.1.4          
##  [21] babelgene_21.4            evaluate_0.14            
##  [23] DEoptimR_1.0-9            fansi_0.5.0              
##  [25] dbplyr_2.1.1              readxl_1.3.1             
##  [27] igraph_1.2.6              DBI_1.1.1                
##  [29] ellipsis_0.3.2            RSpectra_0.16-0          
##  [31] backports_1.2.1           sparseMatrixStats_1.4.0  
##  [33] vctrs_0.3.8               TTR_0.24.2               
##  [35] abind_1.4-5               cachem_1.0.5             
##  [37] RcppEigen_0.3.3.9.1       withr_2.4.2              
##  [39] ggforce_0.3.3             robustbase_0.93-8        
##  [41] vcd_1.4-8                 treeio_1.16.1            
##  [43] xts_0.12.1                cluster_2.1.2            
##  [45] DOSE_3.18.1               lazyeval_0.2.2           
##  [47] ape_5.5                   laeken_0.5.1             
##  [49] crayon_1.4.1              labeling_0.4.2           
##  [51] edgeR_3.34.0              pkgconfig_2.0.3          
##  [53] tweenr_1.0.2              nlme_3.1-152             
##  [55] vipor_0.4.5               nnet_7.3-16              
##  [57] rlang_0.4.11              lifecycle_1.0.0          
##  [59] downloader_0.4            modelr_0.1.8             
##  [61] rsvd_1.0.5                cellranger_1.1.0         
##  [63] polyclip_1.10-0           RcppHNSW_0.3.0           
##  [65] lmtest_0.9-38             Matrix_1.3-4             
##  [67] aplot_0.0.6               carData_3.0-4            
##  [69] boot_1.3-28               zoo_1.8-9                
##  [71] reprex_2.0.0              beeswarm_0.4.0           
##  [73] png_0.1-7                 knn.covertree_1.0        
##  [75] bitops_1.0-7              Biostrings_2.60.1        
##  [77] blob_1.2.1                DelayedMatrixStats_1.14.0
##  [79] qvalue_2.24.0             beachmat_2.8.0           
##  [81] scales_1.1.1              memoise_2.0.0            
##  [83] magrittr_2.0.1            plyr_1.8.6               
##  [85] hexbin_1.28.2             zlibbioc_1.38.0          
##  [87] scatterpie_0.1.6          compiler_4.1.0           
##  [89] dqrng_0.3.0               RColorBrewer_1.1-2       
##  [91] pcaMethods_1.84.0         dtw_1.22-3               
##  [93] cli_3.0.1                 XVector_0.32.0           
##  [95] patchwork_1.1.1           mgcv_1.8-36              
##  [97] ggplot.multistats_1.0.0   MASS_7.3-54              
##  [99] tidyselect_1.1.1          stringi_1.7.3            
## [101] highr_0.9                 yaml_2.2.1               
## [103] GOSemSim_2.18.0           BiocSingular_1.8.1       
## [105] locfit_1.5-9.4            ggrepel_0.9.1            
## [107] grid_4.1.0                sass_0.4.0               
## [109] fastmatch_1.1-0           tools_4.1.0              
## [111] rio_0.5.27                rstudioapi_0.13          
## [113] bluster_1.2.1             foreign_0.8-81           
## [115] metapod_1.0.0             gridExtra_2.3            
## [117] smoother_1.1              scatterplot3d_0.3-41     
## [119] farver_2.1.0              ggraph_2.0.5             
## [121] BiocManager_1.30.16       rvcheck_0.1.8            
## [123] digest_0.6.27             Rcpp_1.0.7               
## [125] car_3.0-11                broom_0.7.8              
## [127] httr_1.4.2                AnnotationDbi_1.54.1     
## [129] colorspace_2.0-2          rvest_1.0.0              
## [131] fs_1.5.0                  ranger_0.13.1            
## [133] statmod_1.4.36            tidytree_0.3.4           
## [135] graphlayouts_0.7.1        sp_1.4-5                 
## [137] jsonlite_1.7.2            ggtree_3.0.2             
## [139] tidygraph_1.2.0           R6_2.5.0                 
## [141] pillar_1.6.1              htmltools_0.5.1.1        
## [143] glue_1.4.2                fastmap_1.1.0            
## [145] VIM_6.1.0                 BiocParallel_1.26.1      
## [147] BiocNeighbors_1.10.0      class_7.3-19             
## [149] codetools_0.2-18          fgsea_1.18.0             
## [151] utf8_1.2.1                lattice_0.20-44          
## [153] bslib_0.2.5.1             ResidualMatrix_1.2.0     
## [155] curl_4.3.2                ggbeeswarm_0.6.0         
## [157] zip_2.2.0                 GO.db_3.13.0             
## [159] openxlsx_4.2.4            limma_3.48.1             
## [161] rmarkdown_2.9             munsell_0.5.0            
## [163] e1071_1.7-7               DO.db_2.9                
## [165] GenomeInfoDbData_1.2.6    iterators_1.0.13         
## [167] haven_2.4.1               reshape2_1.4.4           
## [169] gtable_0.3.0

CRUK CI Summer School 2021

Pseudotime Analysis

Zeynep Kalender-Atak, Stephane Ballereau

1 Setting up the data

1.1 Trajectory inference with destiny

1.2 Find temporally expressed genes

1.3 Extra Materials

2 Ackowledgements