In this workbook, we are starting our analysis with normalized HCA data and perform integration, clustering and dimensionality reduction. Our aim is to extract T-cells from this dataset and proceed with pseudotime analysis in the next workbook.
These are the libraries we will need in this workbook
library(SingleCellExperiment)
library(scran)
library(scater)
library(batchelor)
library(cowplot)
library(pheatmap)
library(tidyverse)
library(SingleR)
library(destiny)
library(gam)
library(viridis)
library(msigdbr)
library(clusterProfiler)
We are going to work with HCA data. This data set has been pre-processed and normalized before.
sce<-readRDS(file="~/Course_Materials/scRNAseq/pseudotime/hca_sce.bone.RDS")
We use symbols in place of ENSEMBL IDs for easier interpretation later.
#rowData(sce)$Chr <- mapIds(EnsDb.Hsapiens.v86, keys=rownames(sce), column="SEQNAME", keytype="GENEID")
#rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ensembl_gene_id, names = rowData(sce)$Symbol)
We block on the donor of origin to mitigate batch effects during highly variable gene (HVG) selection. We select a larger number of HVGs to capture any batch-specific variation that might be present.
#dec.hca <- modelGeneVar(sce, block=sce$Sample.Name)
#top.hca <- getTopHVGs(dec.hca, n=5000)
The batchelor
package provides an implementation of the MNN approach via the fastMNN() function. We apply it to our HCA data to remove the donor specific effects across the highly variable genes in top.hca
. To reduce computational work and technical noise, all cells in all samples are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space. The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses. We store it in ‘MNN’ slot in the main sce object.
#set.seed(1010001)
#merged.hca <- fastMNN(sce, batch = sce$Sample.Name, subset.row = top.hca)
#reducedDim(sce, 'MNN') <- reducedDim(merged.hca, 'corrected')
We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple samples. We see that all clusters contain contributions from each sample after correction.
#set.seed(01010100)
#sce <- runUMAP(sce, dimred="MNN")
#sce <- runTSNE(sce, dimred="MNN")
plotPCA(sce, colour_by="Sample.Name") + ggtitle("PCA")
plotUMAP(sce, colour_by="Sample.Name") + ggtitle("UMAP")
plotTSNE(sce, colour_by="Sample.Name") + ggtitle("tSNE")
Graph-based clustering generates an excessively large intermediate graph so we will instead use a two-step approach with k-means. We generate 1000 small clusters that are subsequently aggregated into more interpretable groups with a graph-based method.
#set.seed(1000)
#clust.hca <- clusterSNNGraph(sce, use.dimred="MNN",
# use.kmeans=TRUE, kmeans.centers=1000)
#colLabels(sce) <- factor(clust.hca)
table(colLabels(sce))
1 2 3 4 5 6 7 8 9 10 11 12
4790 6056 2036 2422 2600 2435 11586 707 4684 936 1191 557
plotPCA(sce, colour_by="label") + ggtitle("PCA")
plotUMAP(sce, colour_by="label") + ggtitle("UMAP")
plotTSNE(sce, colour_by="label") + ggtitle("tSNE")
We perform automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile. This is for a quick assignment of cluster identity. We are going to use Human Primary Cell Atlas (HPCA) data for that. HumanPrimaryCellAtlasData
function provides normalized expression values for 713 microarray samples from HPCA (Mabbott et al., 2013). These 713 samples were processed and normalized as described in Aran, Looney and Liu et al. (2019). Each sample has been assigned to one of 37 main cell types and 157 subtypes.
se.aggregated <- sumCountsAcrossCells(sce, id=colLabels(sce))
hpc <- HumanPrimaryCellAtlasData()
anno.hca <- SingleR(se.aggregated, ref = hpc, labels = hpc$label.main, assay.type.test="sum")
anno.hca
DataFrame with 12 rows and 5 columns
scores first.labels tuning.scores labels pruned.labels
<matrix> <character> <DataFrame> <character> <character>
1 0.281120:0.580521:0.490323:... NK_cell 0.595425:0.426528 T_cells T_cells
2 0.269576:0.720264:0.502810:... Pre-B_cell_CD34- 0.390449:0.180455 Monocyte Monocyte
3 0.269360:0.605466:0.465618:... B_cell 0.648661:0.257872 B_cell B_cell
4 0.383672:0.647291:0.789261:... MEP 0.360240:0.346833 MEP MEP
5 0.284526:0.591911:0.468152:... B_cell 0.762421:0.658735 B_cell B_cell
... ... ... ... ... ...
8 0.256000:0.645332:0.441369:... Monocyte 0.582277: 0.257172 Monocyte Monocyte
9 0.290154:0.579078:0.499469:... T_cells 0.719841:-0.114850 T_cells T_cells
10 0.381632:0.670836:0.597980:... Pro-B_cell_CD34+ 0.834774: 0.745602 Pro-B_cell_CD34+ Pro-B_cell_CD34+
11 0.291531:0.592682:0.506481:... NK_cell 0.810637: 0.729778 NK_cell NK_cell
12 0.331348:0.596984:0.525463:... B_cell 0.468819: 0.284846 B_cell B_cell
tab <- table(anno.hca$labels, colnames(se.aggregated))
# Adding a pseudo-count of 10 to avoid strong color jumps with just 1 cell.
pheatmap(log10(tab+10))
sce$cell_type<-recode(sce$label, "1" = "T_cells",
"2" = "Monocyte",
"3"="B_cell",
"4"="MEP",
"5"="B_cell",
"6"="CMP",
"7"="T_cells",
"8"="Monocyte",
"9"="T_cells",
"10"="Pro-B_cell_CD34+",
"11"="NK_cell",
"12"="B_cell")
We can now use the predicted cell types to color PCA, UMAP and tSNE.
plotPCA(sce, colour_by="cell_type", text_by="cell_type") + ggtitle("PCA")
plotUMAP(sce, colour_by="cell_type", text_by="cell_type") + ggtitle("UMAP")
plotTSNE(sce, colour_by="cell_type", text_by="cell_type") + ggtitle("tSNE")
We can also check expression of some marker genes.
CD3D and TRAC are used as marker genes for T-cells Szabo et al. 2019.
plotExpression(sce, features=c("CD3D"), x="label", colour_by="cell_type")
plotExpression(sce, features=c("TRAC"), x="label", colour_by="cell_type")
We will now extract T-cells and store in a new SCE object to use in pseudotime analysis.
Pull barcodes for T-cells
tcell.bc <- colData(sce) %>%
data.frame() %>%
group_by(cell_type) %>%
dplyr::filter(cell_type == "T_cells") %>%
pull(Barcode)
table(colData(sce)$Barcode %in% tcell.bc)
FALSE TRUE
18940 21060
Create a new SingleCellExperiment object for T-cells
tmpInd <- which(colData(sce)$Barcode %in% tcell.bc)
sce.tcell <- sce[,tmpInd]
saveRDS(sce.tcell,"~/Course_Materials/scRNAseq/pseudotime/sce.tcell.RDS")
This notebook uses material from SVI course, OSCA Book, Broad Institute Workshop and Hemberg Group Course.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] splines parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] scRNAseq_2.2.0 knitr_1.29 cellAlign_0.1.0 clusterProfiler_3.16.0
[5] msigdbr_7.1.1 viridis_0.5.1 viridisLite_0.3.0 gam_1.20
[9] foreach_1.5.0 destiny_3.2.0 SingleR_1.2.4 forcats_0.5.0
[13] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 readr_1.3.1
[17] tidyr_1.1.0 tibble_3.0.3 tidyverse_1.3.0 pheatmap_1.0.12
[21] cowplot_1.0.0 batchelor_1.4.0 scater_1.16.2 ggplot2_3.3.2
[25] scran_1.16.0 SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2 DelayedArray_0.14.1
[29] matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
[33] IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] rappdirs_0.3.1 ggthemes_4.2.0 bit64_0.9-7.1 irlba_2.3.3
[5] data.table_1.12.8 RCurl_1.98-1.2 generics_0.0.2 RSQLite_2.2.0
[9] europepmc_0.4 proxy_0.4-24 bit_1.1-15.2 enrichplot_1.8.1
[13] xml2_1.3.2 lubridate_1.7.9 httpuv_1.5.4 assertthat_0.2.1
[17] xfun_0.15 hms_0.5.3 evaluate_0.14 promises_1.1.1
[21] DEoptimR_1.0-8 fansi_0.4.1 progress_1.2.2 dbplyr_1.4.4
[25] readxl_1.3.1 igraph_1.2.5 DBI_1.1.0 ellipsis_0.3.1
[29] RSpectra_0.16-0 backports_1.1.8 bookdown_0.20 vctrs_0.3.2
[33] TTR_0.23-6 abind_1.4-5 RcppEigen_0.3.3.7.0 withr_2.2.0
[37] ggforce_0.3.2 packrat_0.5.0 triebeard_0.3.0 robustbase_0.93-6
[41] vcd_1.4-7 xts_0.12-0 prettyunits_1.1.1 DOSE_3.14.0
[45] ExperimentHub_1.14.0 laeken_0.5.1 crayon_1.3.4 labeling_0.3
[49] edgeR_3.30.3 pkgconfig_2.0.3 tweenr_1.0.1 vipor_0.4.5
[53] nnet_7.3-14 rlang_0.4.7 lifecycle_0.2.0 downloader_0.4
[57] BiocFileCache_1.12.0 modelr_0.1.8 rsvd_1.0.3 AnnotationHub_2.20.0
[61] cellranger_1.1.0 polyclip_1.10-0 RcppHNSW_0.2.0 lmtest_0.9-37
[65] Matrix_1.2-18 urltools_1.7.3 carData_3.0-4 boot_1.3-25
[69] zoo_1.8-8 reprex_0.3.0 base64enc_0.1-3 beeswarm_0.2.3
[73] ggridges_0.5.2 bitops_1.0-6 blob_1.2.1 DelayedMatrixStats_1.10.1
[77] qvalue_2.20.0 gridGraphics_0.5-0 scales_1.1.1 memoise_1.1.0
[81] magrittr_1.5 plyr_1.8.6 hexbin_1.28.1 zlibbioc_1.34.0
[85] compiler_4.0.2 scatterpie_0.1.4 dqrng_0.2.1 RColorBrewer_1.1-2
[89] pcaMethods_1.80.0 cli_2.0.2 dtw_1.21-3 XVector_0.28.0
[93] ggplot.multistats_1.0.0 MASS_7.3-51.6 tidyselect_1.1.0 stringi_1.4.6
[97] yaml_2.2.1 GOSemSim_2.14.0 BiocSingular_1.4.0 locfit_1.5-9.4
[101] ggrepel_0.8.2 grid_4.0.2 fastmatch_1.1-0 tools_4.0.2
[105] rio_0.5.16 rstudioapi_0.11 foreign_0.8-80 gridExtra_2.3
[109] smoother_1.1 scatterplot3d_0.3-41 farver_2.0.3 ggraph_2.0.3
[113] digest_0.6.25 rvcheck_0.1.8 BiocManager_1.30.10 shiny_1.5.0
[117] Rcpp_1.0.5 car_3.0-8 broom_0.7.0 BiocVersion_3.11.1
[121] later_1.1.0.1 httr_1.4.1 AnnotationDbi_1.50.1 colorspace_1.4-1
[125] rvest_0.3.5 fs_1.4.2 ranger_0.12.1 statmod_1.4.34
[129] graphlayouts_0.7.0 sp_1.4-2 ggplotify_0.0.5 xtable_1.8-4
[133] jsonlite_1.7.0 tidygraph_1.2.0 R6_2.4.1 pillar_1.4.6
[137] htmltools_0.5.0 mime_0.9 glue_1.4.1 fastmap_1.0.1
[141] VIM_6.0.0 BiocParallel_1.22.0 BiocNeighbors_1.6.0 class_7.3-17
[145] interactiveDisplayBase_1.26.3 codetools_0.2-16 fgsea_1.14.0 lattice_0.20-41
[149] curl_4.3 ggbeeswarm_0.6.0 zip_2.0.4 GO.db_3.11.4
[153] openxlsx_4.1.5 limma_3.44.3 rmarkdown_2.3 munsell_0.5.0
[157] e1071_1.7-3 DO.db_2.9 GenomeInfoDbData_1.2.3 iterators_1.0.12
[161] haven_2.3.1 reshape2_1.4.4 gtable_0.3.0