In this section we will begin the process of analysing the RNAseq in R. In the next section we will use DESeq2 for differential analysis. Before we do that we need to:

We will also look at the effects of normalisation for composition bias.

A detailed analysis workflow, recommended by the authors of DESeq2 can be found on the Bionconductor website.

Data import

First, let’s load all the packages we will need to analyse the data.


Mouse mammary gland dataset

The data for this tutorial comes from a Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (Fu et al. 2015). The raw data (sequence reads) can be downloaded from SRA under SRP045534, and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE60450. Please see Supplementary material for instructions on downloading raw files from SRA and aligning fastq using HISAT2.

This study examines the expression profiles of basal stem-cell enriched cells (B) and committed luminal cells (L) in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Each group contains two biological replicates.

Reading in the sample metadata

The sampleinfo file contains basic information about the samples that we will need for the analysis today.

# Read the sample information into a data frame
sampleinfo <- read_tsv("data/SampleInfo.txt")
## # A tibble: 12 x 5
##    FileName    Sample  CellType Status   Group           
##    <chr>       <chr>   <chr>    <chr>    <chr>           
##  1 MCL1.DG.bam MCL1.DG luminal  virgin  
##  2 MCL1.DH.bam MCL1.DH basal    virgin    
##  3 MCL1.DI.bam MCL1.DI basal    pregnant basal.pregnant  
##  4 MCL1.DJ.bam MCL1.DJ basal    pregnant basal.pregnant  
##  5 MCL1.DK.bam MCL1.DK basal    lactate  basal.lactate   
##  6 MCL1.DL.bam MCL1.DL basal    lactate  basal.lactate   
##  7 MCL1.LA.bam MCL1.LA basal    virgin    
##  8 MCL1.LB.bam MCL1.LB luminal  virgin  
##  9 MCL1.LC.bam MCL1.LC luminal  pregnant luminal.pregnant
## 10 MCL1.LD.bam MCL1.LD luminal  pregnant luminal.pregnant
## 11 MCL1.LE.bam MCL1.LE luminal  lactate  luminal.lactate 
## 12 MCL1.LF.bam MCL1.LF luminal  lactate  luminal.lactate

Reading in the count data

The raw reads were aligned using HISAT2 (Kim, Langmead, and Salzberg 2015) to the GRCm38 mouse reference genome from Ensembl. featureCounts (Liao, Smyth, and Shi 2014) was used to count reads against the Ensembl gene annotation and generate a counts matrix (as described in Section 1).

First we need to read the data into R from the file in the data directory.

# Read the data into R
seqdata <- read_tsv("data/GSE60450_Lactation.featureCounts", comment="#")
## # A tibble: 55,573 x 18
##    Geneid Chr   Start End   Strand Length MCL1.DG.bam MCL1.DH.bam
##    <chr>  <chr> <chr> <chr> <chr>   <dbl>       <dbl>       <dbl>
##  1 ENSMU… 1     3073… 3074… +        1070           0           0
##  2 ENSMU… 1     3102… 3102… +         110           0           0
##  3 ENSMU… 1;1;… 3205… 3207… -;-;-…   6094         665         448
##  4 ENSMU… 1     3252… 3253… +         480           0           0
##  5 ENSMU… 1     3365… 3368… -        2819           0           0
##  6 ENSMU… 1     3375… 3377… -        2233           0           0
##  7 ENSMU… 1     3464… 3467… -        2309           1           0
##  8 ENSMU… 1;1   3466… 3466… +;+       250           0           0
##  9 ENSMU… 1     3512… 3514… -        2057           0           2
## 10 ENSMU… 1     3531… 3532… +         926           0           0
## # … with 55,563 more rows, and 10 more variables: MCL1.DI.bam <dbl>,
## #   MCL1.DJ.bam <dbl>, MCL1.DK.bam <dbl>, MCL1.DL.bam <dbl>,
## #   MCL1.LA.bam <dbl>, MCL1.LB.bam <dbl>, MCL1.LC.bam <dbl>,
## #   MCL1.LD.bam <dbl>, MCL1.LE.bam <dbl>, MCL1.LF.bam <dbl>

In the seqdata object each row represents a gene. The columns contains:

1 Geneid - Ensembl ID
2-5 Chr, Start, End, Strand - Genomic locations of exons of the gene
6 Length - Transcript length of the gene
7-18 One column for each sample with the count of how many reads were assigned
to each gene by featureCounts.

Format the data

We will be manipulating and reformating the counts matrix into a suitable format for DESeq2.

The seqdata is a dataframe in which the first six columns contain annotation information and the remaining columns contain the count data.

DESeq2 requires a simple object containing only the count data, we’ll keep the gene ID by setting them as the row names.

Let’s create new counts data object, countdata, that contains only the counts for the 12 samples.

Our sampleinfo object contains a column with the sample names. We should adjust the column names of our matrix to match them - we just need to remove the .bam suffix.

It is also critical to ensure that the samples in the columns are in the same order as the rows of sampleinfo. When we load these objects into DESeq2 for the analysis it will not guess which row of the sampleinfo belongs to which column of the counts matrix, it will assume the same order.

We’ll use to new dplyr commands:

  • columns_to_rownames to set the rownames using a named column
  • rename_all which allows to rename all the columns using a string function
countdata <- seqdata %>%
    column_to_rownames("Geneid") %>% # turn the geneid column into rownames
    rename_all(str_remove, ".bam") %>% # remove the ".bam" from the column names
    select(sampleinfo$Sample) %>% # keep sample columns using sampleinfo$Sample

##                    MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA
## ENSMUSG00000102693       0       0       0       0       0       0       0
## ENSMUSG00000064842       0       0       0       0       0       0       0
## ENSMUSG00000051951     665     448      86     361     559     447       0
## ENSMUSG00000102851       0       0       0       0       0       0       0
## ENSMUSG00000103377       0       0       0       0       1       0       0
## ENSMUSG00000104017       0       0       0       0       0       0       0
##                    MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
## ENSMUSG00000102693       0       0       0       0       0
## ENSMUSG00000064842       0       0       0       0       0
## ENSMUSG00000051951       0       0       0       2       0
## ENSMUSG00000102851       0       0       0       0       0
## ENSMUSG00000103377       0       0       0       0       0
## ENSMUSG00000104017       0       0       0       0       0

Here, we used str_remove to remove the unwanted suffix from the column names. The stringr package has a lots of useful functions for manipulating strings (text), e.g. str_replace or str_extract.

Filtering the genes

For many analysis methods it is advisable to filter out as many genes as possible prior to starting the analysis in order to decrease the impact on fasle discovery rates when applying multiple testing correction. This is normally done by filtering out genes with low numbers of reads, which are likely to be uninformative.

With DESeq this is not necessary as it applies a process it calls independent filtering during the analysis process. On the other hand, some filtering for genes that are very lowly expressed does reduce the size of the data matrix, meaning that less memory is required and processing steps are carried out faster.

We will keep all genes where the total number of reads across all samples is greater than 5.

## [1] 55573    12
keep <- rowSums(countdata) > 5
countdata <- countdata[keep,]
## [1] 25369    12

Quality assessment

Before moving on to doing the actually differential expression analysis it important do assess the quality of our data.

Library sizes bar plot

First, we can plot how many reads we have for each sample. Whilst normalisation can account for imbalance in coverage across the samples, extreme differences may be indicative of underlying problems in the samples.

librarySizes <- colSums(countdata)
        main="Barplot of library sizes")
abline(h=20e6, lty=2)

Count distribution boxplots

Count data is not normally distributed, so if we want to examine the distributions of the raw counts it is helpful to transform the data on to a log scale. Typically we use a log2 transformation, however, because the data is count data and will contain many 0s we need to add a count of 1 to every value in order to prevent attempting log2(0) from creating errors.

# Get log2 counts
logcounts <- log2(countdata + 1)

We’ll check the distribution of read counts using a boxplot and well add some colour to see if there is any difference between sample groups.

# make a colour vector
statusCol <- match(sampleinfo$Status, c("virgin", "pregnant", "lactate")) + 1

# Check distributions of samples using boxplots
# Let's add a blue horizontal line that corresponds to the median
abline(h=median(as.matrix(logcounts)), col="blue")

From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further.

Challenge 1

  1. Use the DESeq2 function rlog to transform the count data. This function also normalises for library size.
  2. Plot the count distribution boxplots with this data
    How has this effected the count distributions?

Principle Component Analysis

A principle components analysis (PCA) is an example of an unsupervised analysis, where we don’t specify the grouping of the samples. If your experiment is well controlled and has worked well, we should that replicate samples cluster closely, whilst the greatest sources of variation in the data should be between treatments/sample groups. It is also an incredibly useful tool for checking for outliers and batch effects.

To run the PCA we should first normalise our data for library size and transform to a log scale.DESeq2 provides two commands that can be used to do this, here we will use the command rlog. rlog performs a log2 scale transformation in a way that compensates for differences between samples for genes with low read count and also normalizes between samples for library size.

You can read more about rlog, it’s alternative vst and the comparison between the two here.

To plot the PCA results we will use the autoplot function from the ggfortify package (Tang, Horikoshi, and Li 2016). ggfortify is built on top of ggplot2 and is able to recognise common statistical objects such as PCA results or linear model results and automatically generate summary plot of the results in an appropriate manner.


rlogcounts <- rlog(countdata)

# run PCA
pcDat <- prcomp(t(rlogcounts))
# plot PCA

# We can use colour and shape to identify the Cell Type and the Status of each
# sample
         data = sampleinfo, 


Look at the last PCA plot. What is the greatest source of variation? Is there something strange going on with the samples? Let’s identify these samples:

# setting shape to FALSE causes the plot to default to using the labels
         data = sampleinfo,