In this section we will begin the process of analysing the RNAseq data in R. In the next section we will use DESeq2 for differential analysis. A detailed analysis workflow, recommended by the authors of DESeq2 can be found on the Bionconductor website.
We first need to:
First, let’s load all the packages we will need to analyse the data.
library(DESeq2)
library(tidyverse)
The data for this tutorial comes from a Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (Fu et al. 2015). The raw data (sequence reads) can be downloaded from the ‘Sequence Read Archive’ (SRA) under SRP045534, and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE60450. Please see extended material for instructions on downloading raw files from SRA and aligning fastq using HISAT2.
This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.
The SampleInfo.txt
file contains basic information about the samples that we will need for the analysis today: name, cell type, status.
# Read the sample information into a data frame
sampleinfo <- read_tsv("data/SampleInfo.txt")
sampleinfo
## # A tibble: 12 x 4
## FileName Sample CellType Status
## <chr> <chr> <chr> <chr>
## 1 MCL1.DG.bam MCL1.DG luminal virgin
## 2 MCL1.DH.bam MCL1.DH basal virgin
## 3 MCL1.DI.bam MCL1.DI basal pregnant
## 4 MCL1.DJ.bam MCL1.DJ basal pregnant
## 5 MCL1.DK.bam MCL1.DK basal lactate
## 6 MCL1.DL.bam MCL1.DL basal lactate
## 7 MCL1.LA.bam MCL1.LA basal virgin
## 8 MCL1.LB.bam MCL1.LB luminal virgin
## 9 MCL1.LC.bam MCL1.LC luminal pregnant
## 10 MCL1.LD.bam MCL1.LD luminal pregnant
## 11 MCL1.LE.bam MCL1.LE luminal lactate
## 12 MCL1.LF.bam MCL1.LF luminal lactate
The raw reads were aligned using HISAT2 (Kim, Langmead, and Salzberg 2015) to the GRCm38 mouse reference genome from Ensembl. featureCounts (Liao, Smyth, and Shi 2014) was used to count reads against the Ensembl gene annotation and generate a counts matrix (as described in the previous session).
First we need to read the data into R from the file kept in the data directory.
# Read the data into R
seqdata <- read_tsv("data/GSE60450_Lactation.featureCounts", comment="#")
seqdata
## # A tibble: 55,573 x 18
## Geneid Chr Start End Strand Length MCL1.DG.bam MCL1.DH.bam MCL1.DI.bam
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSMU… 1 3073… 3074… + 1070 0 0 0
## 2 ENSMU… 1 3102… 3102… + 110 0 0 0
## 3 ENSMU… 1;1;… 3205… 3207… -;-;-… 6094 665 448 86
## 4 ENSMU… 1 3252… 3253… + 480 0 0 0
## 5 ENSMU… 1 3365… 3368… - 2819 0 0 0
## 6 ENSMU… 1 3375… 3377… - 2233 0 0 0
## 7 ENSMU… 1 3464… 3467… - 2309 1 0 1
## 8 ENSMU… 1;1 3466… 3466… +;+ 250 0 0 0
## 9 ENSMU… 1 3512… 3514… - 2057 0 2 0
## 10 ENSMU… 1 3531… 3532… + 926 0 0 0
## # … with 55,563 more rows, and 9 more variables: MCL1.DJ.bam <dbl>,
## # MCL1.DK.bam <dbl>, MCL1.DL.bam <dbl>, MCL1.LA.bam <dbl>, MCL1.LB.bam <dbl>,
## # MCL1.LC.bam <dbl>, MCL1.LD.bam <dbl>, MCL1.LE.bam <dbl>, MCL1.LF.bam <dbl>
In the seqdata
object each row represents a gene and columns are:
1 Geneid - Gene Ensembl ID
2-5 Chr, Start, End, Strand - Genomic locations of exons of the gene
6 Length - Transcript length of the gene
7-18 One column for each sample with the number of reads assigned to the gene by featureCounts.
dplyr
One of the most complex aspects of learning to work with data in R
is getting to grips with subsetting and manipulating data tables. The package dplyr
(Wickham et al. 2018) was developed to make this process more intuitive than it is using standard base R
processes. It also makes use of a new symbol %>%
, called the “pipe”, which makes the code a bit tidier.
In particular we will use the commands:
select
to select columns from a tablefilter
to filter rows based on the contents of a column in the tablerename
to rename columnsWe will encounter a few more dplyr
commands during the course, we will explain their use as we come to them.
If you are familiar with R but not dplyr
or tidyverse
then we have a very brief introduction here. A more detailed introduction can be found in our online R course
We will be manipulating and reformating the counts matrix into a suitable format for DESeq2.
The seqdata
object is a dataframe
in which the first six columns contain gene annotation information and the remaining columns the count data.
DESeq2 requires a simple object containing only the count data, we’ll keep the gene ID by setting them as the row names.
We will create a new counts data object, countdata
, that contains only the counts for the 12 samples.
Our sampleinfo
object contains a column with the sample names. We should adjust the column names of our count matrix to match them - we just need to remove the .bam
suffix.
It is also critical to ensure that the samples in the columns are in the same order as the rows of sampleinfo
. When we load these objects into DESeq2 for the analysis it will not guess which row of the sampleinfo
belongs to which column of the counts matrix, it will assume the same order.
We’ll use the new commands:
column_to_rownames
to set the rownames using a named columnrename_all
which allows to rename all the columns using a string functioncountdata <- seqdata %>%
column_to_rownames("Geneid") %>% # turn the geneid column into rownames
rename_all(str_remove, ".bam") %>% # remove the ".bam" from the column names
select(sampleinfo$Sample) %>% # keep sample columns using sampleinfo$Sample
as.matrix()
# check manual for help
# ?tibble::column_to_rownames
# ?dplyr::select
head(countdata)
## MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA
## ENSMUSG00000102693 0 0 0 0 0 0 0
## ENSMUSG00000064842 0 0 0 0 0 0 0
## ENSMUSG00000051951 665 448 86 361 559 447 0
## ENSMUSG00000102851 0 0 0 0 0 0 0
## ENSMUSG00000103377 0 0 0 0 1 0 0
## ENSMUSG00000104017 0 0 0 0 0 0 0
## MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
## ENSMUSG00000102693 0 0 0 0 0
## ENSMUSG00000064842 0 0 0 0 0
## ENSMUSG00000051951 0 0 0 2 0
## ENSMUSG00000102851 0 0 0 0 0
## ENSMUSG00000103377 0 0 0 0 0
## ENSMUSG00000104017 0 0 0 0 0
Here, we used str_remove
to remove the unwanted suffix from the column names. The stringr
package has a lots of useful functions for manipulating strings (text), e.g. str_replace
or str_extract
.
For many analysis methods it is advisable to filter out as many genes as possible before the analysis to decrease the impact of multiple testing correction on false discovery rates. This is normally done by filtering out genes with low numbers of reads and thus likely to be uninformative.
With DESeq2
this is however not necessary as it applies independent filtering
during the analysis. On the other hand, some filtering for genes that are very lowly expressed does reduce the size of the data matrix, meaning that less memory is required and processing steps are carried out faster.
We will keep all genes where the total number of reads across all samples is greater than 5.
# check dimension of count matrix
dim(countdata)
## [1] 55573 12
# for each gene, compute total count and compare to threshold
# keeping outcome in vector of 'logicals' (ie TRUE or FALSE, or NA)
keep <- rowSums(countdata) > 5
# summary of test outcome: number of genes in each class:
table(keep, useNA="always")
## keep
## FALSE TRUE <NA>
## 30204 25369 0
# subset genes where test was TRUE
countdata <- countdata[keep,]
# check dimension of new count matrix
dim(countdata)
## [1] 25369 12
Before moving on to doing the actual differential expression analysis it is important to assess the quality of our data.
Differential expression calculations with DESeq2 uses raw read counts as input, but for visualization purposes we use transformed counts.
Why not raw counts? Two issues:
summary(countdata)
## MCL1.DG MCL1.DH MCL1.DI MCL1.DJ
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 4.0 1st Qu.: 4.0 1st Qu.: 4 1st Qu.: 3.0
## Median : 61.0 Median : 58.0 Median : 56 Median : 51.0
## Mean : 999.9 Mean : 941.5 Mean : 1029 Mean : 959.8
## 3rd Qu.: 644.0 3rd Qu.: 583.0 3rd Qu.: 598 3rd Qu.: 532.0
## Max. :221046.0 Max. :211985.0 Max. :388439 Max. :416741.0
## MCL1.DK MCL1.DL MCL1.LA MCL1.LB
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.0 1st Qu.: 3.0 1st Qu.: 3.0 1st Qu.: 3.0
## Median : 43.0 Median : 43.0 Median : 49.0 Median : 54.0
## Mean : 899.3 Mean : 837.7 Mean : 900.3 Mean : 957.4
## 3rd Qu.: 486.0 3rd Qu.: 480.0 3rd Qu.: 625.0 3rd Qu.: 662.0
## Max. :425510.0 Max. :331980.0 Max. :231886.0 Max. :239397.0
## MCL1.LC MCL1.LD MCL1.LE MCL1.LF
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 2.0 1st Qu.: 2.0 1st Qu.: 1 1st Qu.: 1
## Median : 29.0 Median : 28.0 Median : 19 Median : 19
## Mean : 975.2 Mean : 963.3 Mean : 1060 Mean : 1059
## 3rd Qu.: 495.0 3rd Qu.: 443.0 3rd Qu.: 290 3rd Qu.: 291
## Max. :1583913.0 Max. :1748384.0 Max. :3995999 Max. :3943245
# few outliers affect distribution visualization
boxplot(countdata, main='Raw counts', las=2)
# Raw counts mean expression Vs standard Deviation (SD)
plot(rowMeans(countdata), rowSds(countdata),
main='Raw counts: sd vs mean',
xlim=c(0,10000),
ylim=c(0,5000))
To avoid problems posed by raw counts, they can be transformed. Several transformation methods exist to limit the dependence of variance on mean gene expression:
Because some genes are not expressed (detected) in some samples, their count are 0
. As log2(0) returns -Inf in R which triggers errors by some functions, we add 1 to every count value to create ‘pseudocounts’. The lowest value then is 1, or 0 on the log2 scale (log2(1) = 0).
# Get log2 counts
logcounts <- log2(countdata + 1)
# summary(logcounts[,1]) # summary for first column
# summary(logcounts) # summary for each column
We will check the distribution of read counts using a boxplot and add some colour to see if there is any difference between sample groups.
# make a colour vector
statusCol <- match(sampleinfo$Status, c("virgin", "pregnant", "lactate")) + 1
# '+1' to avoid color '1' i.e. black
# Check distributions of samples using boxplots
boxplot(logcounts,
xlab="",
ylab="Log2(Counts)",
las=2,
col=statusCol,
main="Log2(Counts)")
# Let's add a blue horizontal line that corresponds to the median
abline(h=median(logcounts), col="blue")
From the boxplots we see that overall the density distributions of raw log-counts are not identical but still not very different. If a sample is really far above or below the blue horizontal line (overall median) we may need to investigate that sample further.
# Log2 counts standard deviation (sd) vs mean expression
plot(rowMeans(logcounts), rowSds(logcounts),
main='Log2 Counts: sd vs mean')
In contrast to raw counts, with log2 transformed counts lowly expressed genes show higher variation.
Variance stabilizing transformation (VST) aims at generating a matrix of values for which variance is constant across the range of mean values, especially for low mean.
The vst
function computes the fitted dispersion-mean relation, derives the transformation to apply and accounts for library size.
vst_counts <- vst(countdata)
## converting counts to integer mode
# make a colour vector
statusCol <- match(sampleinfo$Status, c("virgin", "pregnant", "lactate")) + 1
# Check distributions of samples using boxplots
boxplot(vst_counts,
xlab="",
ylab="VST counts",
las=2,
col=statusCol)
# Let's add a blue horizontal line that corresponds to the median
abline(h=median(vst_counts), col="blue")
# VST counts standard deviation (sd) vs mean expression
plot(rowMeans(vst_counts), rowSds(vst_counts),
main='VST counts: sd vs mean')
Challenge 1
- Use the
DESeq2
functionrlog
to transform the count data. This function also normalises for library size.- Plot the count distribution boxplots with this data
How has this affected the count distributions?
A principal component analysis (PCA) is an example of an unsupervised analysis, where we don’t specify the grouping of the samples. If the experiment is well controlled and has worked well, we should find that replicate samples cluster closely, whilst the greatest sources of variation in the data should be between treatments/sample groups. It is also an incredibly useful tool for checking for outliers and batch effects.
To run the PCA we should first normalise our data for library size and transform to a log scale. DESeq2 provides two separate commands to do this (vst
and rlog
). Here we will use the command rlog
. rlog
performs a log2 scale transformation in a way that compensates for differences between samples for genes with low read count and also normalizes between samples for library size.
You can read more about rlog
, its alternative vst
and the comparison between the two here.
To plot the PCA results we will use the autoplot
function from the ggfortify
package (Tang, Horikoshi, and Li 2016). ggfortify
is built on top of ggplot2
and is able to recognise common statistical objects such as PCA results or linear model results and automatically generate summary plot of the results in an appropriate manner.
library(ggfortify)
rlogcounts <- rlog(countdata)
# run PCA
pcDat <- prcomp(t(rlogcounts))
# plot PCA
autoplot(pcDat)
We can use colour and shape to identify the Cell Type and the Status of each sample.
autoplot(pcDat,
data = sampleinfo,
colour="CellType",
shape="Status",
size=5)
Discussion
What does the PCA plot tell us?
Let’s identify these samples. The package ggrepel
allows us to add text to the plot, but ensures that points that are close together don’t have their labels overlapping (they repel each other).
library(ggrepel)
# setting shape to FALSE causes the plot to default to using the labels instead of points
autoplot(pcDat,
data = sampleinfo,
colour="CellType",
shape="Status",
size=5) +
geom_text_repel(aes(x=PC1, y=PC2, label=Sample), box.padding = 0.8)
The mislabelled samples are MCL1.DG, which is labelled as luminal but should be basal, and MCL1.LA, which is labelled as basal but should be luminal. Let’s fix the sample sheet.
We’re going to use another dplyr
command: mutate
. mutate
creates new columns in the data frame.
sampleinfo <- sampleinfo %>%
mutate(CellType=ifelse(Sample=="MCL1.DG", "basal", CellType)) %>%
mutate(CellType=ifelse(Sample=="MCL1.LA", "luminal", CellType))
…and export it so that we have the correct version for later use.
write_tsv(sampleinfo, "results/SampleInfo_Corrected.txt")
Let’s look at the PCA now.
autoplot(pcDat,
data = sampleinfo,
colour="CellType",
shape="Status",
size=5)
Replicate samples from the same group cluster together in the plot, while samples from different groups form separate clusters. This indicates that the differences between groups are larger than those within groups. The biological signal of interest is stronger than the noise (biological and technical) and can be detected. Also, differences between virgin, pregnant and lactating are greater for luminal than basal cells.
Clustering in the PCA plot can be used to motivate changes to the design matrix in light of potential batch effects. For example, imagine that the first replicate of each group was prepared at a separate time from the second replicate. If the PCA plot showed separation of samples by time, it might be worthwhile including time in the downstream analysis to account for the time-based effect.
We can save a few data objects to use later so we don’t have to rerun everything
save(countdata, sampleinfo, file="results/preprocessing.RData")
Fu, Nai Yang, Anne C Rios, Bhupinder Pal, Rina Soetanto, Aaron T L Lun, Kevin Liu, Tamara Beck, et al. 2015. “EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival.” Nature Cell Biology 17 (4): 365–75. https://doi.org/10.1038/ncb3117.
Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nature Methods 12 (March): 357 EP. http://dx.doi.org/10.1038/nmeth.3317.
Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics (Oxford, England) 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.
Tang, Yuan, Masaaki Horikoshi, and Wenxuan Li. 2016. “Ggfortify: Unified Interface to Visualize Statistical Result of Popular R Packages.” The R Journal 8 (2). https://journal.r-project.org/.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.