CRUKCI Cluster Transition - Hands-on training
Once our reads have been aligned against the genome, we need to summarise the information across genes or exons. In the BAM file, there is a chromosomal location for every read that mapped uniquely. We can determine if the region each read is aligned to corresponds to a particular gene or exon and then summarise across the entire BAM file to get total read counts for each gene or exon.
We will use featureCounts
programme from the subRead package to do the counting. In addition to the BAM files, we also need to provide featureCounts
with an annotation file. Usually this will be a GTF/GFF file corresponding to the genome assembly used (a description of the GTF format can be found at UCSC website). featureCounts
can also use a simpler annotation format called SAF, this is particularly useful for defining custom/novel features that you wish to count against.
For RNAseq we most commonly wish to count reads aligning to exons, and then to summarise at the gene level. Lets have a quick look at the top of a GTF file so we can see what data it contains and what feature type and attribute type mean:
head /scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38/annotation/Homo_sapiens.GRCh38.84.gtf
featureCounts
from subRead package documentation./home/bioinformatics/software/subread/subread-1.5.3/bin/featureCounts
ln -s /home/bioinformatics/software/subread/subread-1.5.3/bin/featureCounts ~/bin/.
featureCounts \
--primary \
-C \
-t exon \
-g gene_id \
-a Homo_sapiens.GRCh38.84.gtf \
-o SLX-14572.FourSamples.featureCounts \
SLX-14572.i706_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam SLX-14572.i706_i502.HHMJ3BGX3.s_1.r_1.small.fq.bam SLX-14572.i706_i503.HHMJ3BGX3.s_1.r_1.small.fq.bam SLX-14572.i703_i517.HHMJ3BGX3.s_1.r_1.small.fq.bam
--primary
only count primary alignment-C
do not count reads where the pairs are mapped to different chromosomes-t exon
the feature type to count reads against, in this case exons-g gene_id
the attribute type to summarise counts by, in this case the gene IDRunning featureCounts generates two output file SLX-14572.FourSamples.featureCounts
SLX-14572.FourSamples.featureCounts.summary
.
Unzip crukci-cluster-transition-master.zip
crukci-cluster-transition-master
, click OpenLearn R with Half-day introduction to the R language crash course
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
biocLite("org.Mm.eg.db")
samplesheet <- read.csv("data/samplesheet.SLX-12345.csv")
View(samplesheet)
countdata <- read.delim("data/SLX-12345.AllSamples.featureCounts", skip=1)
View(countdata)
Convert Gene names into row names, remove columns, and rename columns
rownames(countdata) <- countdata$Geneid
countdata <- countdata[,-(1:6)]
samplesheet$CountTableNames <- gsub("-", ".", samplesheet$BamFile)
colnames(countdata) <- samplesheet$SampleName[match(colnames(countdata), samplesheet$CountTableNames)]
Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline.
cpm
function from edgeR packagesource("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
cpmdata <- cpm(countdata)
head(cpmdata)
thresh <- cpmdata > 0.5
keep <- rowSums(thresh) >= 2
counts.keep <- countdata[keep,]
dgeObj <- DGEList(counts.keep)
dgeObj <- calcNormFactors(dgeObj)
# Create the groups
groups <- samplesheet$Group
# Specify a design matrix
design <- model.matrix(~groups)
dgeObj <- estimateCommonDisp(dgeObj)
dgeObj <- estimateGLMTrendedDisp(dgeObj)
dgeObj <- estimateTagwiseDisp(dgeObj)
plotBCV(dgeObj)
fit <- glmFit(dgeObj, design)
names(fit)
head(coef(fit))
qlf <- glmQLFTest(fit, coef=2)
de <- decideTestsDGE(qlf)
summary(de)
detags <- rownames(dgeObj)[as.logical(de)]
plotSmear(qlf, de.tags=detags)
qlf
object
results <- qlf$table
results$ENSEMBL <- rownames(results)
View(results)
results$Padj <- p.adjust(results$PValue, method="BH")
results <- results[order(results$Padj),]
View(results)
deg <- which(results$Padj<=0.05)
plot(results$logFC, -log10(results$Padj), pch=21, col="black", bg="black", xlab="log2(FoldChange)", ylab="-log10(adjusted p-value)")
points(results$logFC[deg], -log10(results$Padj[deg]), pch=21, col="black", bg="red")
There are a number of ways to add annotation, but we will demonstrate how to do this using the org.Mm.eg.db
package. This package is one of several organism-level packages which are re-built every 6 months. These packages are listed on the annotation section of the Bioconductor, and are installed in the same way as regular Bioconductor packages.
source("https://bioconductor.org/biocLite.R")
biocLite("org.Mm.eg.db")
library(org.Mm.eg.db)
columns(org.Mm.eg.db)
ann <- select(org.Hs.eg.db,keytype="ENSEMBL", keys=rownames(results), columns=c("ENSEMBL","SYMBOL","GENENAME"))
results <- merge(x=results, y=ann, by= "ENSEMBL", all.x=TRUE)
View(results)
write.csv(results, "data/SLX-12345.DEGresultsTable.csv", quote=FALSE, row.names=FALSE)
Once our reads have been aligned against the genome, we need to identify regions of enrichment (peaks). There are a variety of tools available for calling peaks: SICER, MACS2, EPIC, Enriched Domain Detector (EDD), BayesPeak etc. Here we will use MACS2.
Different types of ChIP data have differently shaped peaks. Generally, TF peaks are narrow, whilst epignomic data, such as histone marks, can be narrow, broad, or a mixture of both. It is important to use a peak caller that is appropriate for the peak type being sought. MACS2 has both narrow and broad modes and so is widely applicable.
When calling peaks for a sample it is also necessary to provide an appropriate input (control) sample. This is a negative control that will allow the peak caller to estimate the background signal. Some peak callers will work without an input sample but this is not recommended.
ln -s /home/bioinformatics/software/... ~/bin/.
mkdir macs
macs2 callpeak \
--treatment **ChIPsample.bam** \
--control **InputSample.bam** \
--gsize "2671858539" \
--outdir "macs" \
--name "ChIPSampleName" `
--gsize
argument - this is the “effective genome size” - this parameter is dependent on the read length and the actual genome size,
please see the MACS documentation for further details.