crukci-cluster-transition

CRUKCI Cluster Transition - Hands-on training

View the Project on GitHub

Session 5: Advanced analysis and downstream analysis in R

Learning Objectives

Analysis of RNA-seq data

Counting

Once our reads have been aligned against the genome, we need to summarise the information across genes or exons. In the BAM file, there is a chromosomal location for every read that mapped uniquely. We can determine if the region each read is aligned to corresponds to a particular gene or exon and then summarise across the entire BAM file to get total read counts for each gene or exon.

We will use featureCounts programme from the subRead package to do the counting. In addition to the BAM files, we also need to provide featureCounts with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF format can be found at UCSC website). featureCounts can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.

For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level. Lets have a quick look at the top of a GTF file so we can see what data it contains and what feature type and attribute type mean:

head /scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38/annotation/Homo_sapiens.GRCh38.84.gtf

Running featureCounts generates two output file SLX-14572.FourSamples.featureCounts SLX-14572.FourSamples.featureCounts.summary.

Reading the count data in R

Data manipulations

Convert Gene names into row names, remove columns, and rename columns

rownames(countdata) <- countdata$Geneid
countdata <- countdata[,-(1:6)]
samplesheet$CountTableNames <- gsub("-", ".", samplesheet$BamFile)
colnames(countdata) <- samplesheet$SampleName[match(colnames(countdata), samplesheet$CountTableNames)]

Filtering out low expressed genes

Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline.

Converting counts to DGEList object

Create the design matrix

# Create the groups
groups <- samplesheet$Group
# Specify a design matrix
design <- model.matrix(~groups)

Differential expression with edgeR

Output table

Volcano plot

deg <- which(results$Padj<=0.05)
plot(results$logFC, -log10(results$Padj), pch=21, col="black", bg="black", xlab="log2(FoldChange)", ylab="-log10(adjusted p-value)")
points(results$logFC[deg], -log10(results$Padj[deg]), pch=21, col="black", bg="red")

Add annotations

There are a number of ways to add annotation, but we will demonstrate how to do this using the org.Mm.eg.db package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.

Writing results file to disk

write.csv(results, "data/SLX-12345.DEGresultsTable.csv", quote=FALSE, row.names=FALSE)

Analysis of ChIP-seq data

Peak Calling using MACS2

Once our reads have been aligned against the genome, we need to identify regions of enrichment (peaks). There are a variety of tools available for calling peaks: SICER, MACS2, EPIC, Enriched Domain Detector (EDD), BayesPeak etc. Here we will use MACS2.

Different types of ChIP data have differently shaped peaks. Generally, TF peaks are narrow, whilst epignomic data, such as histone marks, can be narrow, broad, or a mixture of both. It is important to use a peak caller that is appropriate for the peak type being sought. MACS2 has both narrow and broad modes and so is widely applicable.

When calling peaks for a sample it is also necessary to provide an appropriate input (control) sample. This is a negative control that will allow the peak caller to estimate the background signal. Some peak callers will work without an input sample but this is not recommended.

Peak Annotation

Differential Binding

Reference materials