Introduction to Differential Gene Expression Analysis with Bulk RNAseq

November 2024

RNAseq Workflow

Library Preparation

Sequencing

Bioinformatics Analysis

Image adapted from: Wang, Z., et al. (2009), Nature Reviews Genetics, 10, 57–63.

Library preparation

RNA → Reverse Transcription → ctDNA …
Fragmentation - short fragments ~200-300 nt …
Adapter and Index binding …
PCR Amplification.

Sequencing

Bioinformatics Analysis Preprocessing

Fastq file format

QC with FastQC

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Alignment based quantification

Quantification with Quasi-mapping (Salmon)

QC of aligned reads

Alignment Rate
Duplication Rate
Insert Size
Transcript coverage

Picard Tools:

https://broadinstitute.github.io/picard/

QC of aligned reads - Transcript coverage

Bioinformatics Analysis Data Exploration

Reading in the count data

library(tximport)
txi <- tximport(salmon_files, type = "salmon", tx2gene = tx2gene)
str(txi)

## List of 4
##  $ abundance          : num [1:35896, 1:12] 20.381 0 1.966 1.059 0.949 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35896] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000037" ...
##   .. ..$ : chr [1:12] "SRR7657878" "SRR7657881" "SRR7657880" "SRR7657874" ...
##  $ counts             : num [1:35896, 1:12] 1039 0 65 39 8 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35896] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000037" ...
##   .. ..$ : chr [1:12] "SRR7657878" "SRR7657881" "SRR7657880" "SRR7657874" ...
##  $ length             : num [1:35896, 1:12] 2905 541 1884 2100 480 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:35896] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000037" ...
##   .. ..$ : chr [1:12] "SRR7657878" "SRR7657881" "SRR7657880" "SRR7657874" ...
##  $ countsFromAbundance: chr "no"

Total counts per sample

Distribution of counts per gene

VST : variance stabilizing transformation
rlog : regularized log transformation

rlogCounts <- rlog(filtCounts)

Principal Component Analysis

Bioinformatics Analysis Differential Gene Expression Analysis

DESeq2 analysis workflow

Normalization

Differential Expression - Modelling population distributions

Differential Expression - estimating dispersion

GLM for Differential Expression Analysis

One factor - three levels

Two factors - two levels each - Additive Model

Two factors - two levels each - Interaction Model

Multiple testing correction

When we do lots of tests we increase the chances of false positive results.
We apply an adjustment to the pvalue - Benjamini-Hochberg (FDR).

Case Study

Applying using DESeq2

Load Data

txiObj <- readRDS("RObjects/txi.rds")
sampleinfo <- read_tsv("data/samplesheet_corrected.tsv", col_types="cccc") %>%
  mutate(Status = fct_relevel(Status, "Uninfected"))

Define model

model <- as.formula(~ TimePoint + Status + TimePoint:Status)

Create DESeqDataSet object

ddsObj <- DESeqDataSetFromTximport(txi = txiObj,
                                   colData = sampleinfo,
                                   design = model)

Applying using DESeq2

Filter out uninformative genes

keep <- rowSums(counts(ddsObj) > 5
ddsObj <- ddsObj[keep,]

Run DESeq workflow: estimate size factors, estimate dispersion, run GLM

ddsObj <- DESeq(ddsObj)

Extract results

results.day11 <- results(ddsObj,
                         name="Status_Infected_vs_Uninfected",
                         alpha=0.05)

results.day33 <- results(ddsObj,
                         contrast = list(c("Status_Infected_vs_Uninfected", "TimePointd33.StatusInfected")),
                         alpha=0.05)

RNAseq Workflow

Library preparation

Sequencing

Bioinformatics Analysis Preprocessing

Fastq file format

QC with FastQC

Alignment based quantification

Quantification with Quasi-mapping (Salmon)

QC of aligned reads

QC of aligned reads - Transcript coverage

Bioinformatics Analysis Data Exploration

Reading in the count data

Total counts per sample

Distribution of counts per gene

Principal Component Analysis

Bioinformatics Analysis Differential Gene Expression Analysis

DESeq2 analysis workflow

Normalization

Normalization

Differential Expression - Modelling population distributions

Differential Expression - Modelling population distributions

Differential Expression - estimating dispersion

GLM for Differential Expression Analysis

One factor - three levels

Two factors - two levels each - Additive Model

Two factors - two levels each - Interaction Model

Multiple testing correction

Case Study

Applying using DESeq2

Load Data

Define model

Create DESeqDataSet object

Applying using DESeq2

Filter out uninformative genes

Run DESeq workflow: estimate size factors, estimate dispersion, run GLM

Extract results

DESeq2 Results Table