The spp R package can be used for strand cross-correlation (not run in this practical). Two cross-correlation peaks are usually observed in a ChIP experiment, one corresponding to the read length (“phantom” peak) and one to the average fragment length of the library.
see Landt et al., 2012, Genome Res. “ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia” for more details.
The absolute and relative height of the two peaks are useful determinants of the success of a ChIP-seq experiment. A high-quality immunoprecipitation is characterized by a ChIP peak that is much higher than the “phantom” peak, while often very small or no such peak is seen in failed experiments.
example: system(“your shell code here”)
Some parameters to consider: * Adjust the sequence tags to better represent the original DNA fragment (by ‘shifting tags in the 3 prime direction’ or by ‘extending tags’ to the estimated length of the original fragment length) * Background model used * Use of strand dependent bimodality * fragment length, read length, replicates, duplication, total read count
bash-script
# Filter duplicates
# Write down the count of reads for each sample after duplicate removal
cd "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/"
#p53 sample
macs2 filterdup -i "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/tp53_r2.fastq_trimmed.fastq_sorted.bam" -f BAM -g hs --keep-dup=1 --verbose=3 -o "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_filterdup.bed"
#p73 sample
macs2 filterdup -i "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/TAp73beta_r2.fastq_trimmed.fastq_sorted.bam" -f BAM -g hs --keep-dup=1 --verbose=3 -o "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/TAp73beta_r2.fastq_trimmed.fastq_sorted_filterdup.bed"
#input
macs2 filterdup -i "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -f BAM -g hs --keep-dup=1 --verbose=3 -o "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/input.fastq_trimmed.fastq_sorted_filterdup.bed"
Calculate using values from above step
bash-script
# Predict fragment length
# The location of sequenced read may only tell you the end of a DNA fragment that you are interested in (such as TFBS or DNA
# hypersensitive regions), and you have to estimate how long this DNA fragment is in order to recover the actual enrichment.
# the default behavior of macs2 callpeak uses MFOLD of 5, 50
# write down the fragment length predictd calculates for each sample
# p53
macs2 predictd -i "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_filterdup.bed" -g hs -m 5 20
# p73
macs2 predictd -i "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/TAp73beta_r2.fastq_trimmed.fastq_sorted_filterdup.bed" -g hs -m 5 20
# input
macs2 predictd -i "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/input.fastq_trimmed.fastq_sorted_filterdup.bed" -g hs -m 5 20
bash-script
# generate a pileup track in BEDGRAPH format for the p53 ChIP sample
macs2 pileup -i /home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_filterdup.bed -o /home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_filterdup.pileup.bdg -f BED--extsize 113
bash-script
# MACS2 callpeak options
macs2 callpeak -h
# -t sample -c control -g effective genome size = needs to be empirically computed using a hg38.fa genome file) for hg38 but for this practical use 'hs' which is = 2.6e9 which is the value for hg19
# -f filetype --bdg generate bedgraph
bash-script
#call p53 peaks
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/tp53_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n tp53_r2.fastq_trimmed.fastq_sorted_standard -f BAM --keep-dup auto --bdg --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/"
#call p73 peaks
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/TAp73beta_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n TAp73beta_r2.fastq_trimmed.fastq_sorted_standard -f BAM --keep-dup auto --bdg --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/"
–keep-dup How should MACS handle duplicate tags at the same location? The default is to keep just one. The auto option (recommended for high-coverage samples) has MACS calculate the expected maximum tags at the same location (based on the binomal distribution).
The ChIPQC or SPP Bioconductor packages (or the MACS2 predictd) can be used to determine the fragment size used as input for (–extsize)
bash-script
# use d calculated above for --extsize
# mkdir "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/nomodel/"
#p53
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/tp53_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n tp53_r2.fastq_trimmed.fastq_sorted_extended -f BAM --keep-dup auto --bdg --nomodel --extsize 113 --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/nomodel/"
#p73
macs2 callpeak -t "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/TAp73beta_r2.fastq_trimmed.fastq_sorted.bam" -c "/home/participant/Course_Materials/ChIPSeq/Preprocessed/Alignment_BWA/input.fastq_trimmed.fastq_sorted.bam" -g hs -n TAp73beta_r2.fastq_trimmed.fastq_sorted_extended -f BAM --keep-dup auto --bdg --nomodel --extsize 270 --outdir "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/nomodel/"
MACS2 peak calls are written to the user specified output directory. the files have an "_peaks.xls" extension.
In addition this file contains information on samples and parameters used.
R-script
#set working directory to where you saved peak files
#setwd("/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/")
peakfile <-"/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_standard_peaks.xls"
peaks_DF <- read.delim(peakfile, header=FALSE)
peaks_DF[1:25,]
importing peaks into R
R-script
peakfile <- "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_standard_peaks.xls"
peaks_DF <- read.delim2(peakfile, comment.char="#")
peaks_DF[1:3,]
Create GRanges object made of chromosome names and intervals stored as IRanges.
R-script
#method1
library(GenomicRanges)
peaks_GR <- GRanges(
seqnames=peaks_DF[,"chr"],
IRanges(peaks_DF[,"start"],
peaks_DF[,"end"]
)
)
peaks_GR
#method2
library("rtracklayer")
file_narrowPeak = "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/tp53_r2.fastq_trimmed.fastq_sorted_standard_peaks.narrowPeak"
extraCols_narrowPeak <- c(signalValue = "numeric", pValue = "numeric",
qValue = "numeric", peak = "integer")
gr_narrowPeak <- import(file_narrowPeak, format = "BED",
extraCols = extraCols_narrowPeak)
Explore the GRanges object
R-script
seqnames(peaks_GR)
R-script
ranges(peaks_GR)
Compare peak sets
R-script
peakfile1 <- "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/nomodel/tp53_r2.fastq_trimmed.fastq_sorted_extended_peaks.xls"
peaks_DF1 <- read.delim2(peakfile1, comment.char="#")
# load into GenomicRanges
peaks_GR1 <- GRanges (
seqnames=peaks_DF1[,"chr"],
IRanges(peaks_DF1[,"start"],
peaks_DF1[,"end"]
)
)
peakfile2 <- "/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/nomodel/TAp73beta_r2.fastq_trimmed.fastq_sorted_extended_peaks.xls"
peaks_DF2 <- read.delim(peakfile2, comment.char="#")
# load into GenomicRanges
peaks_GR2 <- GRanges(
seqnames=peaks_DF2[,"chr"],
IRanges(peaks_DF2[,"start"],
peaks_DF2[,"end"]
)
)
The number of peaks in first replicate overlapping those in second is different from number of second replicate peaks overlapping first.
This is because 2 peaks from one replicate may overlap 1 peak in the other replicate (Transitive
property of Venn operations between genomic ranges).
R-script
OnlyfirstPeakSet <- peaks_GR1[!peaks_GR1 %over% peaks_GR2]
OnlySecondPeakSet <- peaks_GR2[!peaks_GR2 %over% peaks_GR1]
firstANDsecondPeakSets <- peaks_GR1[peaks_GR1 %over% peaks_GR2]
secondANDfirstPeakSets <- peaks_GR2[peaks_GR2 %over% peaks_GR1]
length(OnlyfirstPeakSet)
length(OnlySecondPeakSet)
length (firstANDsecondPeakSets)
length (secondANDfirstPeakSets)
Finally convert Granges to bed
R-script
df <- data.frame(seqnames=seqnames(peaks_GR1),
starts=start(peaks_GR1)-1,
ends=end(peaks_GR1))
write.table(df, file="/home/participant/Course_Materials/ChIPSeq/Materials/Results/Peaks/peaks_GR1_converted.bed", quote=F, sep="\t", row.names=F, col.names=F)