1 GenomicRanges

1.1 Turn MACS peaks into GRanges objects

bash


pwd

cd "/home/participant/Course_Materials/ChIPSeq/Materials"

#mkdir Utils
cd Utils

# show current directory
getwd()

# set new working directory in R 
setwd("/home/participant/Course_Materials/ChIPSeq/Materials/Utils")

# shows directory contents
dir()

Install and load packages needed for the tutorial. Uncomment install commands if your computer doesn’t have the packages.

#load libraries

library("GenomicRanges")
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
library("EnsDb.Hsapiens.v86")
library("org.Hs.eg.db")
library("ChIPseeker")
library("ChIPpeakAnno")
library("chipenrich")
library ("rtracklayer")

Read the peak files into GRanges objects.

A complete collection of peak files are in the “../Course_Materials/ChIPSeq/Preprocessed”" folder. We will read the (Excel) xls file for one of the TP73 replicate into a R dataframe.

library("GenomicRanges")


peakfile1 <- "~/Course_Materials/ChIPSeq/Preprocessed/Peaks/TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.xls"
peakfile2 <- "~/Course_Materials/ChIPSeq/Preprocessed/Peaks/TAp73alpha_r2.fastq_trimmed.fastq_sorted_peaks.xls"


# Process the two peakfiles or look at peak calling practical from day3
# or readPeakFile from ChIPseeker

peaks_DF1 <- read.delim2(peakfile1, comment.char="#")
peaks_DF2 <- read.delim2(peakfile2, comment.char="#")

# look at the first 3 rows

peaks_DF1[1:3,]
peaks_DF2[1:3,]


library(GenomicRanges)
peaks_GR1 <- GRanges(
  seqnames=peaks_DF1[,"chr"],
  IRanges(peaks_DF1[,"start"],
  peaks_DF1[,"end"]
  )
)

# GRanges object
peaks_GR1 

# dataframe
df1 <- data.frame(seqnames=seqnames(peaks_GR1),
  starts=start(peaks_GR1)-1,
  ends=end(peaks_GR1),
  names=c(rep(".", length(peaks_GR1))),
  scores=c(rep(".", length(peaks_GR1))),
  strands=strand(peaks_GR1))



# second file

library(GenomicRanges)
peaks_GR2 <- GRanges(
  seqnames=peaks_DF2[,"chr"],
  IRanges(peaks_DF2[,"start"],
  peaks_DF2[,"end"]
  )
)


df2 <- data.frame(seqnames=seqnames(peaks_GR2),
  starts=start(peaks_GR2)-1,
  ends=end(peaks_GR2),
  names=c(rep(".", length(peaks_GR2))),
  scores=c(rep(".", length(peaks_GR2))),
  strands=strand(peaks_GR2))

# change the name according to sample and create bed files

write.table(df1, file="TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed", quote=F, sep="\t", row.names=F, col.names=F)
write.table(df2, file="TAp73alpha_r2.fastq_trimmed.fastq_sorted_peaks.bed", quote=F, sep="\t", row.names=F, col.names=F)

2 Bedtools

These commands need to be run in a bash shell. Create any bed files needed using the previous code chunk.

cd into working directory in bash shell

bash


pwd
cd /home/participant/Course_Materials/ChIPSeq/Materials/Utils

Bedtools is a command-line tool. To bring up the help, just type

bash


bedtools

Use bedtools with the appropriate subcommand.

Examples:

bash


bedtools intersect -h
bedtools merge -h
bedtools subtract -h

Version:

bash


bedtools --version

Main functionality

2.1 Intersections

Default behaviour of intersect is to reports the intervals that represent overlaps between two files. To demonstrate, let’s identify all of the chip peaks that overlap with CpG islands.

UCSC Table Browser

bash


# Download the CpG island track from UCSC table browser and name it hg38_CpG_Islands.bed

bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed | \
head -5

#write to file

bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed > \
TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks_CPG.bed

head -n 5 TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks_CPG.bed

2.2 Reporting the original feature in each file

The -wa (write A) and -wb (write B) options allow one to see the original records from the A and B files that overlapped. As such, instead of not only showing you where the intersections occurred, it shows you what intersected.

bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -wa -wb | \
head -5


bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -wa -wb > \
TAp73alpha_r1_CPG_all.bed

2.3 Count of overlaped nucleotides.

bedtools intersect -wo -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b  /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed | head -10

bedtools intersect -wo -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b  /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed > TAp73alpha_r1_CPG_overlap_nt.bed

2.4 Count of overlapping features.

bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -c | head

bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -c > \
TAp73alpha_r1_CPG_overlap_ft.bed

2.5 Find features that DO NOT overlap

bash


bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -v | head -10


bedtools intersect -a TAp73alpha_r1.fastq_trimmed.fastq_sorted_peaks.bed -b /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -v > \
TAp73alpha_r1_CPG_overlap_notoverlap.bed

2.6 Find the complement

bash

pwd

# various sort options
bedtools sort -i /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed -g /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38.chrom.sizes > \ /home/participant/Course_Materials/ChIPSeq/Materials/Utils/hg38_CpG_Islands_sorted3.bed

# unix sort by chromosome and then by start position
# sort -k 1,1 -k2,2n /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38_CpG_Islands.bed >  /home/participant/Course_Materials/ChIPSeq/Materials/Utils/hg38_CpG_Islands_sorted2.bed

# complement
bedtools complement -i /home/participant/Course_Materials/ChIPSeq/Materials/Utils/hg38_CpG_Islands_sorted3.bed \
-g /home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/hg38.chrom.sizes > non_cpg_regions.bed

3 ChIPseeker

3.0.1 Generate tag matrix around putative promoter region

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ChIPseeker)
library(clusterProfiler)


txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)

# use GRanges object generated in the previous section

tagMatrix <- getTagMatrix(peaks_GR1, windows=promoter)

3.0.2 Heatmap of ChIP binding to TSS regions

tagHeatmap(tagMatrix, xlim=c(-3000, 3000), color="red")

3.0.3 plotAnnoBar(peakAnno)

Average Profile of ChIP peaks binding to TSS region

plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

# or use this single function (without generating a tag matrix)

plotAvgProf2(peaks_GR1, TxDb=txdb, upstream=3000, downstream=3000, 
             xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

# confidance intervals by resampling
plotAvgProf(tagMatrix, xlim=c(-3000, 3000), conf = 0.95, resample = 1000)

3.0.4 Peak annotation by genomic features

peakAnno <- annotatePeak(peaks_GR1, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")

3.0.5 Plot genomic feature profiles of peaks

#pie chart
plotAnnoPie(peakAnno)

#bar plot
plotAnnoBar(peakAnno)

3.0.6 Venn plot

upsetplot(peakAnno, vennpie=TRUE)

3.0.7 Visualize distribution of TF-binding loci relative to TSS

plotDistToTSS(peakAnno, title="Distribution of transcription factor-binding loci\nrelative to TSS")

4 Annotating genes using ChIPpeakAnno

library("EnsDb.Hsapiens.v86")
library("ChIPpeakAnno")

# Prepare annotation data with toGRanges

annoData <- toGRanges(EnsDb.Hsapiens.v86)
annoData[1:2]

# Annotate the peaks with annotatePeakInBatch

## keep the seqnames in the same style
seqlevelsStyle(peaks_GR1) <- seqlevelsStyle(annoData)

## do annotation by nearest TSS
anno <- annotatePeakInBatch(peaks_GR1, AnnotationData=annoData)
anno[1:2]

4.0.1 A pie chart can be used to demonstrate the overlap features of the peaks.

pie1(table(anno$insideFeature))

4.0.2 Additional annotation

library(org.Hs.eg.db)
anno <- addGeneIDs(anno, orgAnn="org.Hs.eg.db", 
                   feature_id_type="ensembl_gene_id",
                   IDs2Add=c("symbol"))
head(anno)

df <- data.frame(seqnames=seqnames(anno),
  starts=start(anno)-1,
  ends=end(anno),
  names=c(rep(".", length(anno))),
  scores=c(rep(".", length(anno))),
  strands=strand(anno))

write.table(df, file="TP73_anno.bed", quote=F, sep="\t", row.names=F, col.names=F)
str("TP73_anno.bed")

5 chipenrich

Gene Set Enrichment analysis of genmic regions

library("chipenrich")

# This practical gives a brief overview of the chipenrich package
# A locus definition is a way of defining a gene regulatory region, and enables us to associate peaks with genes.
# Nearest TSS, Nearest gene, exon, intron, proximal promoter etc

head(supported_locusdefs())

# Can define custome locus definitions (in the format of: chr, start, end, entrez_id)

# Gene sets

head(supported_genesets())

# Can define custom gene sets (gene_set_id, entrez_gene_id)

# Enrichment testing

## broadenrich() -use  with broad reagions that intersect multiple features

data(peaks_E2F4, package = 'chipenrich.data')
data(peaks_H3K4me3_GM12878, package = 'chipenrich.data')

gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = broadenrich(peaks = peaks_H3K4me3_GM12878, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores=1)
results.be = results$results
print(results.be[1:5,1:5])

## chipenrich()- use with 1-10k narrow regions (ex: TF ChIP-seq)

# Without mappability
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores = 1)
results.ce = results$results
print(results.ce[1:5,1:5])

# With mappability
gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = chipenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", mappability=24, qc_plots = FALSE,
    out_name = NULL,n_cores=1)
results.cem = results$results
print(results.cem[1:5,1:5])

## polyenrich() -narrow regions mappiing to > 10k features

gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = polyenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,  method = 'polyenrich',
    locusdef = "nearest_tss", qc_plots = FALSE, out_name = NULL, n_cores = 1)
results.pe = results$results
print(results.pe[1:5,1:5])

# If many gene loci have multiple peaks assigned to them, polyenrich is likely an appropriate method. If there are a low number of peaks per gene, then chipenrich() may be the appropriate method

plot_polyenrich_spline(peaks = peaks_E2F4, locusdef = 'nearest_tss', genome = 'hg19')

## hybridenrich() - intermediate between chipenrich and polyenrich

gs_path = system.file('extdata','vignette_genesets.txt', package='chipenrich')
results = hybridenrich(peaks = peaks_E2F4, genome = 'hg19', genesets = gs_path,
    locusdef = "nearest_tss", qc_plots = F, out_name = NULL, n_cores = 1)
results.hybrid = results$results
print(results.hybrid[1:5,1:5])

6 deepTools2

deepTools2 support a wide range of functions, such as various quality controls, different normalization schemes and genome-wide visualizations. A brief example is given below:

bash


heatmapFolder="/home/participant/Course_Materials/ChIPSeq/Preprocessed/deeptools_heatmap/"
bigWigFolder="/home/participant/Course_Materials/ChIPSeq/Preprocessed/differential_binding/"
preprocessedPeakFolder="/home/participant/Course_Materials/ChIPSeq/Preprocessed/Peaks/"

cat ${preprocessedPeakFolder}*_summits.bed > ${heatmapFolder}allPeakSummits.bed

computeMatrix reference-point -S ${bigWigFolder}TAp73beta_versus_tp53_-s1-rep0.bw \
            ${bigWigFolder}TAp73beta_versus_tp53_-s1-rep1.bw \
            ${bigWigFolder}TAp73beta_versus_tp53_-s2-rep0.bw \
            ${bigWigFolder}TAp73beta_versus_tp53_-s2-rep1.bw \
        -R ${heatmapFolder}allPeakSummits.bed \
        --referencePoint "center" \
        -a 3000 -b 3000 \
        -p max/2 \
        -out ${heatmapFolder}TAp73beta_versus_tp53.mat.gz

plotHeatmap -m ${heatmapFolder}TAp73beta_versus_tp53.mat.gz \
            -out ${heatmapFolder}TAp73beta_versus_tp53.png \
            --plotFileFormat "png" \
            --colorMap coolwarm \
            --kmeans 1 \
            --whatToShow 'plot, heatmap and colorbar' \
            --startLabel "peak start" --endLabel "peak end" \
            --regionsLabel 1 \
            --refPointLabel "Peak center" \
            --outFileSortedRegions ${heatmapFolder}TAp73beta_versus_tp53.bed \
            --outFileNameMatrix ${heatmapFolder}TAp73beta_versus_tp53.tab \
            --samplesLabel "TAp73beta r1" "TAp73beta r2" "tp53 r1" "tp53 r2"

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 23 (Workstation Edition)
## 
## Matrix products: default
## BLAS: /home/SS861/src/R_3.5.0/lib64/R/lib/libRblas.so
## LAPACK: /home/SS861/src/R_3.5.0/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.8.2
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.0  backports_1.1.2 bookdown_0.7    magrittr_1.5   
##  [5] rprojroot_1.3-2 tools_3.5.0     htmltools_0.3.6 yaml_2.1.19    
##  [9] Rcpp_0.12.17    stringi_1.2.4   rmarkdown_1.10  knitr_1.20     
## [13] xfun_0.3        stringr_1.3.1   digest_0.6.15   evaluate_0.11

Practical 9: Useful software utilities for computational genomics

July 2018

Contents

1 GenomicRanges

1.1 Turn MACS peaks into GRanges objects

2 Bedtools

2.1 Intersections

2.2 Reporting the original feature in each file

2.3 Count of overlaped nucleotides.

2.4 Count of overlapping features.

2.5 Find features that DO NOT overlap

2.6 Find the complement

3 ChIPseeker

3.0.1 Generate tag matrix around putative promoter region

3.0.2 Heatmap of ChIP binding to TSS regions

3.0.3 plotAnnoBar(peakAnno)

3.0.4 Peak annotation by genomic features

3.0.5 Plot genomic feature profiles of peaks

3.0.6 Venn plot

3.0.7 Visualize distribution of TF-binding loci relative to TSS

4 Annotating genes using ChIPpeakAnno

4.0.1 A pie chart can be used to demonstrate the overlap features of the peaks.

4.0.2 Additional annotation

5 chipenrich

6 deepTools2