Introduction to Differential Gene Expression Analysis with Bulk RNAseq

November 2024

Introduction

In this session we will:

import our counts into R
look at the effects of variance and how to mitigate this with data transformation
do some initial exploration of the raw count data using principle component analysis

Sample metadata

First we load a table of sample metadata

library(tidyverse)
sampleinfo <- read_tsv("data/samplesheet.tsv", col_types = c("cccc"))
arrange(sampleinfo, Status, TimePoint, Replicate)

## # A tibble: 12 × 4
##    SampleName Replicate Status     TimePoint
##    <chr>      <chr>     <chr>      <chr>    
##  1 SRR7657878 1         Infected   d11      
##  2 SRR7657881 2         Infected   d11      
##  3 SRR7657880 3         Infected   d11      
##  4 SRR7657874 1         Infected   d33      
##  5 SRR7657882 2         Infected   d33      
##  6 SRR7657872 3         Infected   d33      
##  7 SRR7657877 1         Uninfected d11      
##  8 SRR7657876 2         Uninfected d11      
##  9 SRR7657879 3         Uninfected d11      
## 10 SRR7657883 1         Uninfected d33      
## 11 SRR7657873 2         Uninfected d33      
## 12 SRR7657875 3         Uninfected d33

Reading in the count data

Salmon was used to quantify gene expression from raw reads.
We need to read the data into R from the quant.sf files
We use the tximport package to do this
Salmon quantified expression at transcript level, we need to summarise this to gene level …
… so we also need a table relating Transcript IDs to Gene IDs

tx2gene <- read_tsv("references/tx2gene.tsv")
tx2gene

## # A tibble: 119,414 × 2
##   TxID                 GeneID            
##   <chr>                <chr>             
## 1 ENSMUST00000177564.1 ENSMUSG00000096176
## 2 ENSMUST00000196221.1 ENSMUSG00000096749
## 3 ENSMUST00000179664.1 ENSMUSG00000096749
## 4 ENSMUST00000178537.1 ENSMUSG00000095668
## 5 ENSMUST00000178862.1 ENSMUSG00000094569
## 6 ENSMUST00000179520.1 ENSMUSG00000094028
## # ℹ 119,408 more rows

Reading in the count data

library(tximport)
txi <- tximport(salmon_files, type = "salmon", tx2gene = tx2gene)
str(txi)

## List of 4
##  $ abundance          : num [1:35896, 1:12] 20.39 0 1.97 1.06 0.95 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35896] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000037" ...
##   .. ..$ : chr [1:12] "SRR7657878" "SRR7657881" "SRR7657880" "SRR7657874" ...
##  $ counts             : num [1:35896, 1:12] 1039 0 65 39 8 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35896] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000037" ...
##   .. ..$ : chr [1:12] "SRR7657878" "SRR7657881" "SRR7657880" "SRR7657874" ...
##  $ length             : num [1:35896, 1:12] 2903 541 1883 2098 480 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35896] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000037" ...
##   .. ..$ : chr [1:12] "SRR7657878" "SRR7657881" "SRR7657880" "SRR7657874" ...
##  $ countsFromAbundance: chr "no"

The counts matrix

class(txi$counts)

## [1] "matrix" "array"

typeof(txi$counts)

## [1] "double"

txi$counts[1:8, 1:5]

##                    SRR7657878 SRR7657881 SRR7657880 SRR7657874 SRR7657882
## ENSMUSG00000000001   1039.000   1005.889    892.000    917.360   1136.690
## ENSMUSG00000000003      0.000      0.000      0.000      0.000      0.000
## ENSMUSG00000000028     65.000     73.999     72.000     44.000     46.000
## ENSMUSG00000000037     39.000     47.000     29.000     53.999     67.000
## ENSMUSG00000000049      8.000      9.000      4.000      4.000      4.000
## ENSMUSG00000000056   2163.469   2067.819   2006.925   1351.675   2367.801
## ENSMUSG00000000058    555.000    470.000    619.000    434.326    725.263
## ENSMUSG00000000078   1324.000   1456.202   1752.178   1125.101   1288.000

Total counts per sample

Ideally we look for > 20 million reads per sample (more if we are interested in very lowly expressed genes)
Samples with < 10 million reads may be problematic

Distribution of counts per gene

There are a large number of genes with very low counts, we first removed these before carrying on with data exploration.

rawCounts <- round(txi$counts, 0) 
keep <- rowSums(rawCounts) > 5
filtCounts <- rawCounts[keep,]

Distribution of counts per gene

There are a large number of genes with very low counts, we first removed these before carrying on with data exploration.
We also need to tranform the data because:
- There is a very large range in the raw counts:

summary(filtCounts[,1:4])

##    SRR7657878       SRR7657881       SRR7657880       SRR7657874    
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:    14   1st Qu.:    17   1st Qu.:    15   1st Qu.:    22  
##  Median :   327   Median :   351   Median :   333   Median :   346  
##  Mean   :  1387   Mean   :  1346   Mean   :  1330   Mean   :  1200  
##  3rd Qu.:  1305   3rd Qu.:  1297   3rd Qu.:  1268   3rd Qu.:  1193  
##  Max.   :652318   Max.   :590723   Max.   :435516   Max.   :444448

Distribution of counts per gene

There are a large number of genes with very low counts, we first removed these before carrying on with data exploration.
We also need to tranform the data because:
- There is a very large range in the raw counts:

Distribution of counts per gene

There are a large number of genes with very low counts, we first removed these before carrying on with data exploration.
We also need to tranform the data because:
- There is a very large range in the raw counts
- There is a relationship between expression and variance:

Data transformation

To avoid problems posed by raw counts, they can be transformed. DESeq2 provides 2 transformation methods:

VST : variance stabilizing transformation
rlog : regularized log transformation

Both log2 transform the data and attempt to stabilise the variance with regard to the mean expression of the gene.

rlog is slower but is more suitable when there is a very large range of counts across samples (10x difference or more)
VST is faster and less sensitive to large count outliers
Either is suitable but generally use rlog on small dataset and VST on larger ones

Data transformation

rlogCounts <- rlog(filtCounts)

Distibution of transformed counts

Principal Component Analysis

A principal component analysis (PCA) is an example of an unsupervised analysis
If the experiment has worked well we should find that:
- replicate samples cluster closely
- the greatest sources of variation should be between treatments/sample groups
It is also an incredibly useful tool for checking for outliers and batch effects.
We use the top most variable genes for the PCA (in this case I have used the top 500).

Principal Component Analysis

Principal Component Analysis - Sample Swap

Principal Component Analysis - Fixed

sampleinfo <- mutate(sampleinfo, Status=case_when(
                                          SampleName=="SRR7657882" ~ "Uninfected",
                                          SampleName=="SRR7657873" ~ "Infected", 
                                          TRUE ~ Status))

More exploratory plots

See the Extended Materials …