The spp R package can be used for strand cross-correlation (not run in this practical). Two cross-correlation peaks are usually observed in a ChIP experiment, one corresponding to the read length (“phantom” peak) and one to the average fragment length of the library.
see Landt et al., 2012, Genome Res. “ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia” for more details.
The absolute and relative height of the two peaks are useful determinants of the success of a ChIP-seq experiment. A high-quality immunoprecipitation is characterized by a ChIP peak that is much higher than the “phantom” peak, while often very small or no such peak is seen in failed experiments.
Some parameters to consider: * Adjust the sequence tags to better represent the original DNA fragment (by ‘shifting tags in the 3 prime direction’ or by ‘extending tags’ to the estimated length of the original fragment length) * Background model used * Use of strand dependent bimodality * fragment length, read length, replicates, duplication, total read count
bash
# Filter duplicates
# Write down the count of reads for each sample after duplicate removal
macs2 filterdup -i "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/tp53_r2.fastq_trimmed.fastq_sorted.bam" -f BAM -g hs --keep-dup=1 --verbose=3 -o "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_filterdup.bed"
macs2 filterdup -i "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/TAp73beta_r2.fastq_trimmed.fastq_sorted.bam" -f BAM -g hs --keep-dup=1 --verbose=3 -o "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/TAp73beta_r2.fastq_trimmed.fastq_sorted_filterdup.bed"
macs2 filterdup -i "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -f BAM -g hs --keep-dup=1 --verbose=3 -o "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/input.fastq_trimmed.fastq_sorted_filterdup.bed"
bash
# predict fragment length
# write down the fragment length for each sample
macs2 predictd -i /home/participant/Course_Materials/ChIPSeq/Materials/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_filterdup.bed -g hs -m 5 20
macs2 predictd -i /home/participant/Course_Materials/ChIPSeq/Materials/Peaks/TAp73beta_r2.fastq_trimmed.fastq_sorted_filterdup.bed -g hs -m 5 20
macs2 predictd -i /home/participant/Course_Materials/ChIPSeq/Materials/Peaks/input.fastq_trimmed.fastq_sorted_filterdup.bed -g hs -m 5 20
bash
# MACS2 callpeak options
macs2 callpeak -h
# -t sample -c control -g effective genome size = needs to be empirically computed using a hg38.fa genome file) for hg38 but for this practical use 'hs' which is = 2.6e9 which is the value for hg19
# -f filetype --bdg generate bedgraph
bash
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/tp53_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n tp53_r2.fastq_trimmed.fastq_sorted_standard -f BAM --keep-dup auto --bdg --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/"
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/TAp73beta_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n TAp73beta_r2.fastq_trimmed.fastq_sorted_standard -f BAM --keep-dup auto --bdg --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/"
–keep-dup How should MACS handle duplicate tags at the same location? The default is to keep just one. The auto option (recommended for high-coverage samples) has MACS calculate the expected maximum tags at the same location (based on the binomal distribution).
The ChIPQC or SPP Bioconductor packages (or the MACS2 predictd) can be used to determine the fragment size used as input for (–extsize)
bash
# use d calculated above for --extsize
# mkdir "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/nomodel/"
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/tp53_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n tp53_r2.fastq_trimmed.fastq_sorted_extended -f BAM --keep-dup auto --bdg --nomodel --extsize 113 --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/nomodel/"
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/TAp73beta_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n TAp73beta_r2.fastq_trimmed.fastq_sorted_extended -f BAM --keep-dup auto --bdg --nomodel --extsize 270 --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/nomodel/"
MACS2 peak calls are written to the user specified output directory. the files have an “_peaks.xls" extension.
In addition this file contains information on samples and parameters used.
R
#set working directory to where you saved peak files
#setwd("/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/")
peakfile <-"/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_standard_peaks.xls"
peaks_DF <- read.delim(peakfile, header=FALSE)
peaks_DF[1:25,]
importing peaks into R
R
peakfile <- "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_standard_peaks.xls"
peaks_DF <- read.delim2(peakfile, comment.char="#")
peaks_DF[1:3,]
Create GRanges object made of chromosome names and intervals stored as IRanges.
R
#method1
library(GenomicRanges)
peaks_GR <- GRanges(
seqnames=peaks_DF[,"chr"],
IRanges(peaks_DF[,"start"],
peaks_DF[,"end"]
)
)
peaks_GR
#method2
library("rtracklayer")
file_narrowPeak = "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_standard_peaks.narrowPeak"
extraCols_narrowPeak <- c(signalValue = "numeric", pValue = "numeric",
qValue = "numeric", peak = "integer")
gr_narrowPeak <- import(file_narrowPeak, format = "BED",
extraCols = extraCols_narrowPeak)
Explore the GRanges object
R
seqnames(peaks_GR)
R
ranges(peaks_GR)
Compare peak sets
R
peakfile1 <- "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/nomodel/tp53_r2.fastq_trimmed.fastq_sorted_extended_peaks.xls"
peaks_DF1 <- read.delim2(peakfile1, comment.char="#")
# load into GenomicRanges
peaks_GR1 <- GRanges (
seqnames=peaks_DF1[,"chr"],
IRanges(peaks_DF1[,"start"],
peaks_DF1[,"end"]
)
)
peakfile2 <- "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/nomodel/TAp73beta_r2.fastq_trimmed.fastq_sorted_extended_peaks.xls"
peaks_DF2 <- read.delim(peakfile2, comment.char="#")
# load into GenomicRanges
peaks_GR2 <- GRanges(
seqnames=peaks_DF2[,"chr"],
IRanges(peaks_DF2[,"start"],
peaks_DF2[,"end"]
)
)
The number of peaks in first replicate overlapping those in second is different from number of second replicate peaks overlapping first. This is because 2 peaks from one replicate may overlap 1 peak in the other replicate (Transitive
property of Venn operations between genomic ranges).
R
OnlyfirstPeakSet <- peaks_GR1[!peaks_GR1 %over% peaks_GR2]
OnlySecondPeakSet <- peaks_GR2[!peaks_GR2 %over% peaks_GR1]
firstANDsecondPeakSets <- peaks_GR1[peaks_GR1 %over% peaks_GR2]
secondANDfirstPeakSets <- peaks_GR2[peaks_GR2 %over% peaks_GR1]
length(OnlyfirstPeakSet)
length(OnlySecondPeakSet)
length (firstANDsecondPeakSets)
length (secondANDfirstPeakSets)
Finally convert Granges to bed
R
df <- data.frame(seqnames=seqnames(peaks_GR1),
starts=start(peaks_GR1)-1,
ends=end(peaks_GR1))
write.table(df, file="/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/peaks_GR1_converted.bed", quote=F, sep="\t", row.names=F, col.names=F)