1 Identifying cell types

1.1 Introduction

In part 1 we showed how to pre-process some example scRNA-seq datasets using Seurat. In part 2 we will use a different subset of the data from the Caron et al. paper to show how to go about exploring the data and answering biological questions.

The datasets here are the four EVT6_RUNX1 cALL samples and the three healthy donor PBMMC samples. We will use these datasets to identify cell types, and then identify genes which vary in expression between patients in particular cell types. But first, because we are analysing a different subset of the data, we need to do the QC and batch correction for these datasets.

# load packages
library(Seurat)
library(sctransform)
library(tidyr)
library(dplyr)

# set seed for randomisation (e.g. UMAP dimension reduction)
set.seed = 123

1.2 Read in datasets

Read in a list of samples and CellRanger results locations. This is a convenient way to manage reading in a list of samples without having to list them explicitly in R. You can change the list and not have to change your code.

message(getwd())
## /mnt/c/Users/hugot/Documents/BTF/UnivCambridge_ScRnaSeqIntro_Base/course_files
samples_df <- read.csv("Data/samples_to_downsample.csv", header=TRUE)
samples <- setNames(samples_df[,2], samples_df[,1])

Below, we have written a function in R. This function takes care of several important steps. It allows us to avoid explicitly calling each processing step for each sample in the code. For each row in the data frame of samples the function will:

  1. Read in the CellRanger results
  2. Make a Seurat object
  3. Rename the cells to make them unique across samples
  4. Filter the cells using some preset cutoffs
  5. Subsample the data (to 800 cells per sample) so that our processing steps run more quickly later on. Note: you should not do this last step in your own analysis (we are only doing it to make things run faster for demonstration purposes).

First we define the function, and then we call it for each row in the sample data frame using apply(). We end up with a list of Seurat objects.

# Define function
get_cell_sample <- function(n, cells=800){
  names(n) <- NULL
  print(n[1])
  # Read in 10X results
  data <- Read10X(data.dir = n[2])
  # Create a Seurat object
  so <- CreateSeuratObject(counts = data, project=n[1])
  # Rename the cells
  so <- RenameCells(so, add.cell.id = n[1])
  # Identify percentage of mitochondrial reads in each cell
  so[["percent.mt"]] <- PercentageFeatureSet(so, pattern = "^MT-")
  # filter the cells 
  so <- subset(so, 
               subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 10)
  # Subsample the filtered cells to reduce dataset size 
  # Do not subsample YOUR data like this - for demonstration purposes only!
  so800 <- so[, sample(colnames(so), size = cells, replace=F)]
}

# Call sampling function
subsample_list <- apply(samples_df, 1, get_cell_sample)
## [1] "ETV6_RUNX1_1"
## [1] "ETV6_RUNX1_2"
## [1] "ETV6_RUNX1_3"
## [1] "ETV6_RUNX1_4"
## [1] "PBMMC_1"
## [1] "PBMMC_2"
## [1] "PBMMC_3"
# Rename list of objects
names(subsample_list) <- samples_df$name

We then merge the datasets without batch correction and normalise the counts using scTransform:

all_merge <- merge(subsample_list[[1]],
                   y = subsample_list[2:7],
                   add.cell.ids = c("E_1", "E_2", "E_3", 
                                    "E_4", "PBMMC_1", "PBMMC_2", 
                                    "PBMMC_3"), 
                   project = "PBMMC")

# Normalise the data, do linear dimension reduction
all_merge <- SCTransform(all_merge, vars.to.regress = "percent.mt", variable.features.n = 3000)
all_merge <- RunPCA(all_merge, npcs = 30, verbose = FALSE)

We check the PC variance plot:

ElbowPlot(all_merge)

18 PCs looks like a reasonable cut off (we could go even higher, to make sure we retain as much biological variance as possible).

all_merge <- RunUMAP(all_merge, reduction = "pca", dims = 1:18)

DimPlot(all_merge, reduction = "umap", group.by="orig.ident")

The PBMMC datasets cluster reasonably well, but appear to have moderate batch effects. However each ETV6_RUNX1 dataset is largely disconnected from each other dataset. If this is really how things are, then we have no way to differentiate batch effects from real biological differences.

However, it is reasonable to assume that ETV6_RUNX1 should share cell types/states with the other datasets. Therefore, we will correct the data for batch effects, assuming that there are a decent number of cells in each dataset which are similar and can be used to anchor the datasets to each other.

1.3 Batch correction

When Seurat integrates the data it does so in a pairwise fashion. It looks for two similar datasets, and integrates these, then combines the result of that integration with another dataset. Because the uncorrected datasets are quite different from each other, we will specify the order in which this pairwise integration should happen. Specifically we will merge the “ETV6_RUNX1” datasets with each other, merge the PBMMC datasets with each other and then merge the two sample types at the end. This is specified with the sample.tree argument to the IntegrateData() function. If you want to understand what the apprently random string of numbers means, look at the function’s documentation.

N.b. this takes a while to run!

# Normalise each sample
call_list <- lapply(X = subsample_list, FUN = function(x) {
  x <- SCTransform(x,  verbose = TRUE, variable.features.n = 1000)
})

# Select features for integration
call_features <- SelectIntegrationFeatures(object.list = call_list, 
                                           nfeatures=1000)

# Perform integration - this step takes a while!!!
call_anchors <- FindIntegrationAnchors(object.list = call_list, 
                                       anchor.features = call_features)

# this is for all PBMMC and ETV6 samples
call_int <- IntegrateData(anchorset = call_anchors, 
                          normalization.method = "SCT", 
                          sample.tree = matrix(c(-1, 1, 2, -5, 4, 3, -2, -3, -4, -6, -7, 5), ncol = 2))

call_int

We now normalise the integrated data and run our dimensionality reduction on the integrated values. Note that the default assay has been changed to the “integrated” (batch-corrected) matrix. We want to make sure to be using this assay for this analysis.

# check default assay
DefaultAssay(call_int)
## [1] "integrated"
# scale the integrated datasets to normalise them
call_int <- ScaleData(call_int, verbose = FALSE)

# run PCA
call_int <- RunPCA(call_int, npcs = 20, verbose = FALSE)

# check the elbow plot again
ElbowPlot(call_int)

Finally, we run our UMAP projection again:

# UMAP
call_int <- RunUMAP(call_int, reduction = "pca", dims = 1:10, n.neighbors=10)
## 16:05:39 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:05:39 Read 5600 rows and found 10 numeric columns
## 16:05:39 Using Annoy for neighbor search, n_neighbors = 10
## 16:05:39 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:05:39 Writing NN index file to temp file /tmp/Rtmpn2a5I8/file52b514875b42
## 16:05:39 Searching Annoy index using 1 thread, search_k = 1000
## 16:05:40 Annoy recall = 100%
## 16:05:40 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 10
## 16:05:41 Initializing from normalized Laplacian + noise (using irlba)
## 16:05:41 Commencing optimization for 500 epochs, with 74440 positive edges
## 16:05:43 Optimization finished
DimPlot(call_int, reduction = "umap", group.by="orig.ident")

The overlap between datasets is now very good. We have a large cluster containing all datasets and the individual “EVT6_RUNX1” datasets are well integrated with the PBMMC cells.

1.4 Identifying cell types

A key part of any single-cell RNA-seq analysis is determining what cell types each cluster represents. Here we will identify clusters, then use markers for cell types we are expecting to determine what cell types are in each cluster. Finally we will use markers derived from the data to identify clusters we are not sure about.

Let’s call clusters in the data:

# make sure we are still using the integrated assay
DefaultAssay(call_int) <- 'integrated'

# build the nearest-neighbour graph and do clustering on it
call_int <- FindNeighbors(call_int, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
call_int <- FindClusters(call_int, resolution = 0.3, algorithm=2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 5600
## Number of edges: 195740
## 
## Running Louvain algorithm with multilevel refinement...
## Maximum modularity in 10 random starts: 0.9194
## Number of communities: 9
## Elapsed time: 0 seconds
# visualise it
DimPlot(call_int, reduction = "umap", label = TRUE)

We have our clusters, now let’s look at where our markers of interest are expressed. Based on published data, we know that:

  • B cells express CD79A
  • CD20+ B cells express MS4A1
  • Monocytes express CST3
  • Immature hematopoietic cells express SPN
  • Red blood cells express HBA1
  • T cells express CD3D

The batch-corrected read counts are not suitable for looking at gene expression. Therefore we will use the SCT-normalised data when examining marker expression.

# change default assay
DefaultAssay(call_int) <- "SCT"

# Make a vector of gene names for the markers we know about
# Let's pretend we don't know about NKT cell markers here
features <- c("CD79A", "MS4A1", "CST3", "SPN", "HBA1", "CD3D")

# Plot the expression values of each markers on the UMAP
FeaturePlot(call_int, features = features, pt.size = 0.1, label = TRUE)

# Draw violin plots of the distribution of expression values
# for each marker in each cluster 
VlnPlot(call_int, features = features)

We can see that clusters 0, 1, 3, 5 and 7 all express CD79A and are therefore likely B cells. Cluster 3 represents the CD20+ B cells because these cells also express MS4A1. Cluster 1 and 7 also express SPN and so are likely to be immature hematopoietic cells, presumably immature B cells. Cluster 2 and 8 express CD3D and so we can label these as T cells. Cluster 6 expresses CST3 and so is probably monocytes.

1.5 Cluster marker genes

Cluster 8 looks to comprise T cells, but what makes them distinct from the other cluster of T cells, cluster 2? Let’s determine markers of cluster 8, by looking for genes which are more highly expressed in this cluster than other clusters.

# When working with merged datasets, first we need to run this
call_int <- PrepSCTFindMarkers(call_int)
## Found 7 SCT models. Recorrecting SCT counts using minimum median counts: 2013
# now find markers for cluster 8
c8_markers <- FindMarkers(call_int, ident.1 = 8)

head(c8_markers)
##       p_val avg_log2FC pct.1 pct.2 p_val_adj
## GNLY      0   2.648152 0.476 0.009         0
## CTSW      0   1.112248 0.601 0.022         0
## KLRD1     0   1.084170 0.519 0.008         0
## CCL5      0   2.858279 0.841 0.020         0
## NKG7      0   2.532095 0.841 0.042         0
## CST7      0   1.298939 0.692 0.008         0

Now we will plot expression for first few markers on the UMAP plot to check that they really are good markers for cluster 8.

# look at the top 6 of these markers
FeaturePlot(call_int, 
            reduction = "umap",
            features = rownames(head(c8_markers)), 
            pt.size = 0.1)

They look good and very specific to this cluster.

Go to the Human Protein Atlas to find out what sort of cells express these genes. Put the name of one of the genes in the search bar, click “Search”, then select the “single cell” data column next to the gene of interest.

When there are a lot of clusters that you are not sure about, it is helpful to show a couple of markers for each cluster, like this:

# find specific markers for all clusters
call_markers_all <- FindAllMarkers(call_int, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

# get the top 2 hits for each cluster
call_markers_all %>%
  group_by(cluster) %>%
  slice_max(n = 2, order_by = avg_log2FC)

You may have noticed that the monocyte marker CST3 was expressed in some cells from the B cell clusters. Which samples was this in?

# UMAP for ETV6 samples only
FeaturePlot(subset(call_int, orig.ident %in% c("ETV6_RUNX1_1", "ETV6_RUNX1_2", "ETV6_RUNX1_3", "ETV6_RUNX1_4")), 
            features=c('CST3', 'CD79A', 'HBA1', 'SPN'), reduction="umap")

# UMAP for PBMMC samples only
FeaturePlot(subset(call_int, orig.ident %in% c("PBMMC_1", "PBMMC_2", "PBMMC_3")), 
            features=c('CST3', 'CD79A', 'HBA1', 'SPN'), reduction="umap")

The ETV6_RUNX1 cALL samples do not have cells in the B cell cluster which express much CST3. PBMMCs seem to have a separate immature (SPN+) population within the larger B cell cluster, which does not express the canonical B cell marker CD79A and which does express the monocyte marker CST3. This warrants further investigation, which we unfortunately don’t have time for here. You could try doing a web search for “B cells expressing CST3” and see if this sort of thing has been observed before.

There seems to be some HBA1 expressed across several cell clusters in some samples. HBA1 is a common component of the “soup” or ambient RNA in scRNA-seq experiments involving blood cells. Hemoglobin chains, such as HBA1, are very highly expressed in red blood cells and burst red blood cells will contribute their RNA to the medium from which the cells are loaded into the 10X Chromium machine. Some of this medium, or “soup”, containing ambient RNA is captured with each cell. There are methods available to correct for soup in your analysis such as SoupX.

Now we will add more meaningful cluster labels to make our analysis more pleasingly biological

# Define labels
new_cluster_ids <- c("B1", "B2", "T cells", "CD20 B", "Erythrocytes",
                     "B3", "Monocytes", "B4", "NKT")
names(new_cluster_ids) <- levels(call_int)

# Add labels to the Seurat object
call_int <- RenameIdents(call_int, new_cluster_ids)

# UMAP plot with new labels
DimPlot(call_int, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Now let’s look at how each cell type is distributed across the different samples.

table(Cluster = call_int@active.ident, Batch = call_int$orig.ident)
##               Batch
## Cluster        ETV6_RUNX1_1 ETV6_RUNX1_2 ETV6_RUNX1_3 ETV6_RUNX1_4 PBMMC_1
##   B1                    340          382          186          279     249
##   B2                    300          266          123          176      89
##   T cells                 3           17          293           42     116
##   CD20 B                  7           23           85           30      65
##   Erythrocytes            1            0           32          103      26
##   B3                     87           65           39          103      72
##   Monocytes               1            3            7            5     102
##   B4                     57           37            9           50      48
##   NKT                     4            7           26           12      33
##               Batch
## Cluster        PBMMC_2 PBMMC_3
##   B1                47     155
##   B2                53      62
##   T cells          235     215
##   CD20 B            91     133
##   Erythrocytes     236      33
##   B3                12      43
##   Monocytes         43      82
##   B4                 5      29
##   NKT               78      48

We can see that some cell types are barely present in some sample types. Monocytes are largely from the healthy PBMMC samples. Other cell, such as B cells of each type are well represented in all samples. Others still are variable within sample types. Erythrocytes are much more common in some PBMMC samples than others, suggesting variation in preparation.

1.6 Differentially expressed genes between samples

Now that we have our annotated cell clusters, we can ask questions about differences between samples across the cell types we identified.

To run differential expression, we make use of “corrected counts” that are stored in the data slot of the the SCT assay. Corrected counts are obtained by setting the sequencing depth for all the cells to a fixed value and reversing the learned regularized negative-binomial regression model.

First we make a cell_labels meta data column of the annotation, which will be useful to use for the differential expression:

# add cell labels to the metadata column
call_int[['cell_labels']] <- call_int@active.ident

Now we can run the FindMarkers() function, but this time using the “SCT” assay:

ERTV6_1_vs_2.markers <- FindMarkers(call_int, 
                                    assay = "SCT", 
                                    ident.1 = "ETV6_RUNX1_1", 
                                    ident.2 = "ETV6_RUNX1_2", 
                                    group.by="orig.ident", 
                                    min.pct = 0.5)

head(ERTV6_1_vs_2.markers)
##                  p_val avg_log2FC pct.1 pct.2     p_val_adj
## RPS24    5.985947e-192   1.376144 0.986 0.905 1.136133e-187
## FOS      7.936944e-166  -2.072794 0.364 0.889 1.506432e-161
## RPS15A   2.158157e-163   1.129525 0.993 0.991 4.096181e-159
## RPS12    1.816179e-145   1.179158 0.988 0.936 3.447107e-141
## HLA-DRB1 3.761299e-145  -1.292972 0.616 0.931 7.138945e-141
## RPL13    2.017376e-141   1.018534 0.993 0.981 3.828980e-137

Amongst these genes, which are variable between two individuals we see ribosomal genes (with names starting RPS or RPL) and FOS as mentioned in the original Caron et al. paper.

Now set the default idents back to sample name, rather than cell type labels so that we can use them to plot features for particular samples within the data:

call_int <- SetIdent(call_int, value = 'orig.ident')
f1 <- FeaturePlot(subset(call_int, idents= c("ETV6_RUNX1_1")), features = c("FOS"))
f2 <- FeaturePlot(subset(call_int, idents= c("ETV6_RUNX1_2")), features = c("FOS"))
f1 + f2

We see higher expression of FOS in ETV6_RUNX1_2 across many cells.