In this practical, we will be going through various downstream analysis of ChiP-seq. This material was adapted from the practical material written by Shamith Samarajiwa in 2019.
During this tutorial you will learn how to:
Part 1
Part 2
During Part1, all the commands can be run by clicking the green arrow or using “Console” tab. During Part 2, we need to type in commands in “Terminal” tab of rstudio.
Summits are the actual binding sites, we would like to make a list of regions with summits in the middle. This is good especially for motif analysis, as you expect to see the motif at the binding site.
Frist, load the summit file and have a look inside. We will use read.table
to load bam file in R.
R
peakfile <-read.table("/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_summits.bed")
head(peakfile)
## V1 V2 V3
## 1 chr3 51424 51425
## 2 chr3 51593 51594
## 3 chr3 111241 111242
## 4 chr3 440975 440976
## 5 chr3 441075 441076
## 6 chr3 710041 710042
## V4
## 1 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_1
## 2 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_2
## 3 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_3
## 4 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_4
## 5 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_5
## 6 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_6
## V5
## 1 8.64697
## 2 6.87494
## 3 12.35170
## 4 13.13554
## 5 9.19477
## 6 4.39078
The regions are only 1bp length.
We would like to extend these genomic regions by 200bp in both directions, and have a set of genomic regions 400bp in length with the binding site in the middle.
R
# get a 400bp window around summit (-200 and +200)
Motif_Peaks <- data.frame((peakfile$V1), (peakfile$V2-200), (peakfile$V2+200))
head(Motif_Peaks)
## X.peakfile.V1. X.peakfile.V2...200. X.peakfile.V2...200..1
## 1 chr3 51224 51624
## 2 chr3 51393 51793
## 3 chr3 111041 111441
## 4 chr3 440775 441175
## 5 chr3 440875 441275
## 6 chr3 709841 710241
Now we would like to write this out into a bed file. We can use write.table
function.
R
?write.table
We have to modify the arguments of write.table
so that it would match bed file format.
R
write.table(Motif_Peaks, "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed")
Let’s have a look at top 10 rows of the file.
bash
head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed"
## "X.peakfile.V1." "X.peakfile.V2...200." "X.peakfile.V2...200..1"
## "1" "chr3" 51224 51624
## "2" "chr3" 51393 51793
## "3" "chr3" 111041 111441
## "4" "chr3" 440775 441175
## "5" "chr3" 440875 441275
## "6" "chr3" 709841 710241
## "7" "chr3" 854190 854590
## "8" "chr3" 963331 963731
## "9" "chr3" 984287 984687
Exercise 1
Optimize the arguments of
write.table
so that it would output a properly formatted bed file.
R
# create bed file
options(scipen=999) # This forces R not to use exponential notations (eg. e+10)
write.table(Motif_Peaks,
file= "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed",
row.names=FALSE,
col.names =FALSE,
sep="\t",
quote =FALSE)
bash
head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed"
## chr3 51224 51624
## chr3 51393 51793
## chr3 111041 111441
## chr3 440775 441175
## chr3 440875 441275
## chr3 709841 710241
## chr3 854190 854590
## chr3 963331 963731
## chr3 984287 984687
## chr3 1069119 1069519
We will use ChIPseeker to annotate genomic features.
R
library(ChIPseeker)
##
## ChIPseeker v1.24.0 For help: https://guangchuangyu.github.io/software/ChIPseeker
##
## If you use ChIPseeker in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Qing-Yu He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383
First, let’s load peak files we just created. ChIPseeker provides readPeakFile
to load the peak and store in GRanges object. GRanges object is an object for genomic locations widely used by Bioconductor tools. If you want to find out more about this object, please read GenomicRanges
vignette.
R
peak <- readPeakFile("/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed")
peak
## GRanges object with 2962 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr3 51225-51624 *
## [2] chr3 51394-51793 *
## [3] chr3 111042-111441 *
## [4] chr3 440776-441175 *
## [5] chr3 440876-441275 *
## ... ... ... ...
## [2958] chr3 198089364-198089763 *
## [2959] chr3 198103083-198103482 *
## [2960] chr3 198130136-198130535 *
## [2961] chr3 198150244-198150643 *
## [2962] chr3 198184891-198185290 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Peak Annotation is performed by annotatePeak
. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb.
Let’s first have a look at the documentation of annotatePeak
.
R
?annotatePeak
All the peak information contained in peakfile will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.
N.B. 1. Downstream is defined as the downstream of gene end. 2. ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.
R
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Annotation of transcripts (transcripts' genomic regions)
peakAnno <- annotatePeak(peak, annoDb="org.Hs.eg.db", TxDb = txdb, tssRegion = c(-3000, 3000))
## >> preparing features information... 2020-07-30 17:47:41
## >> identifying nearest features... 2020-07-30 17:47:43
## >> calculating distance from peak to TSS... 2020-07-30 17:47:43
## >> assigning genomic annotation... 2020-07-30 17:47:43
## >> adding gene annotation... 2020-07-30 17:48:05
## Loading required package: org.Hs.eg.db
##
## 'select()' returned 1:many mapping between keys and columns
## >> assigning chromosome lengths 2020-07-30 17:48:05
## >> done... 2020-07-30 17:48:05
Now, let’s check peakAnno
object we just made.
R
head(as.data.frame(peakAnno))
## seqnames start end width strand
## 1 chr3 51225 51624 400 *
## 2 chr3 51394 51793 400 *
## 3 chr3 111042 111441 400 *
## 4 chr3 440776 441175 400 *
## 5 chr3 440876 441275 400 *
## 6 chr3 709842 710241 400 *
## annotation geneChr geneStart geneEnd
## 1 Distal Intergenic 3 23757 24501
## 2 Distal Intergenic 3 23757 24501
## 3 Exon (ENST00000663345.1/101927174, exon 3 of 3) 3 109707 196721
## 4 Distal Intergenic 3 389445 405739
## 5 Distal Intergenic 3 389445 405739
## 6 Intron (ENST00000661103.1/101927215, intron 2 of 3) 3 546192 845322
## geneLength geneStrand geneId transcriptId distanceToTSS
## 1 745 1 102723448 ENST00000440867.1 27468
## 2 745 1 102723448 ENST00000440867.1 27637
## 3 87015 2 101927174 ENST00000663345.1 85280
## 4 16295 1 10752 ENST00000445697.1 51331
## 5 16295 1 10752 ENST00000445697.1 51431
## 6 299131 1 101927215 ENST00000653731.1 163650
## ENSEMBL SYMBOL GENENAME
## 1 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 2 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 3 ENSG00000224318 CHL1-AS2 CHL1 antisense RNA 2
## 4 ENSG00000134121 CHL1 cell adhesion molecule L1 like
## 5 ENSG00000134121 CHL1 cell adhesion molecule L1 like
## 6 ENSG00000224957 LINC01266 long intergenic non-protein coding RNA 1266
You can see that each of the regions is assigned to a gene by the proximity and feature (annotation).
Let’s make a bar plot.
R
plotAnnoBar(peakAnno)
The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. Function plotDistToTSS
is used to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.
R
plotDistToTSS(peakAnno,
title="Distribution of transcription factor-binding loci\nrelative to TSS")
annotatePeak
function of ChIPseeker assign the nearest gene’s name to each of the genomic regions. Using the assigned gene, ChIPseeker can perform functional enrichment analysis.
Enrichment analysis is widely used to make sense of a list of genes. There are several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler for Gene Ontology and KEGG enrichment analysis. We will perform reactome pathway analysis as an example.
R
library(ReactomePA)
## ReactomePA v1.32.0 For help: https://guangchuangyu.github.io/ReactomePA
##
## If you use ReactomePA in published research, please cite:
## Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479
If you remember each of the genes are assigned to the nearest gene. We will use these genes to perform function enrichment analysis. (Make sure the method suits your biological questions)
R
head(as.data.frame(peakAnno))
## seqnames start end width strand
## 1 chr3 51225 51624 400 *
## 2 chr3 51394 51793 400 *
## 3 chr3 111042 111441 400 *
## 4 chr3 440776 441175 400 *
## 5 chr3 440876 441275 400 *
## 6 chr3 709842 710241 400 *
## annotation geneChr geneStart geneEnd
## 1 Distal Intergenic 3 23757 24501
## 2 Distal Intergenic 3 23757 24501
## 3 Exon (ENST00000663345.1/101927174, exon 3 of 3) 3 109707 196721
## 4 Distal Intergenic 3 389445 405739
## 5 Distal Intergenic 3 389445 405739
## 6 Intron (ENST00000661103.1/101927215, intron 2 of 3) 3 546192 845322
## geneLength geneStrand geneId transcriptId distanceToTSS
## 1 745 1 102723448 ENST00000440867.1 27468
## 2 745 1 102723448 ENST00000440867.1 27637
## 3 87015 2 101927174 ENST00000663345.1 85280
## 4 16295 1 10752 ENST00000445697.1 51331
## 5 16295 1 10752 ENST00000445697.1 51431
## 6 299131 1 101927215 ENST00000653731.1 163650
## ENSEMBL SYMBOL GENENAME
## 1 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 2 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 3 ENSG00000224318 CHL1-AS2 CHL1 antisense RNA 2
## 4 ENSG00000134121 CHL1 cell adhesion molecule L1 like
## 5 ENSG00000134121 CHL1 cell adhesion molecule L1 like
## 6 ENSG00000224957 LINC01266 long intergenic non-protein coding RNA 1266
R
pathway <- enrichPathway(as.data.frame(peakAnno)$geneId)
pathway
## #
## # over-representation test
## #
## #...@organism human
## #...@ontology Reactome
## #...@keytype ENTREZID
## #...@gene chr [1:833] "102723448" "101927174" "10752" "101927215" "27255" "152330" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...0 enriched terms found
## #...Citation
## Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for
## reactome pathway analysis and visualization. Molecular BioSystems
## 2016, 12(2):477-479
Unfortunately, the dataset we are using is a small modified dataset, we cannot find any enrichment here.
If the set of genes are biologically meaningful, we can make a nice plot out of it.
R
gene <- c("11171", "8243", "112464", "2194",
"9318", "79026", "1654", "65003",
"6240", "3476", "6238", "3836",
"4176", "1017", "249")
yy = enrichPathway(gene, pvalueCutoff=0.05)
head(summary(yy))
## Warning in summary(yy): summary method to convert the object to data.frame is
## deprecated, please use as.data.frame instead.
## ID Description
## R-HSA-68962 R-HSA-68962 Activation of the pre-replicative complex
## R-HSA-69242 R-HSA-69242 S Phase
## R-HSA-176187 R-HSA-176187 Activation of ATR in response to replication stress
## GeneRatio BgRatio pvalue p.adjust qvalue
## R-HSA-68962 2/11 33/10654 0.0005028708 0.02510404 0.01709873
## R-HSA-69242 3/11 161/10654 0.0005112494 0.02510404 0.01709873
## R-HSA-176187 2/11 37/10654 0.0006328751 0.02510404 0.01709873
## geneID Count
## R-HSA-68962 4176/1017 2
## R-HSA-69242 8243/4176/1017 3
## R-HSA-176187 4176/1017 2
dotplot(yy)
In this section, we are going to make a heatmap which shows the distribution of peaks around the regions specified. These regions could be regions of the genome with specific annotation (eg. TSS) or the regions which are called as peaks. One of the popular packages used for this purpose is called Deeptools. In this practical, we could like to make a heatmap of input and ChIP seq data using peak regions.
Deeptools offers a function (bamCoverage
) which can convert BAM files into BigWig file with various normalization methods. In this practical, we use BigWig files of input and ChIP seq created from MACS2 callpeak
bedGraph output, as they are already normalized for library sizes.
First, we need to prepare a intermediate file that can be used to plot heatmap. computeMatrix
function takes BigWig file and bed file with the genomic regions of your interest as input, compute the values (score per genome region) needed for heatmaps and summary plots.
For example, we can run the following command.
bash
computeMatrix reference-point -S /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_treat_pileup.bw /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_control_lambda.bw -R /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed -a 2000 -b 2000 --skipZeros --sortRegions descend -o /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --referencePoint "center"
Exercise 2
What does each argument in the code above mean? (Hint: run
computeMatrix reference-point
in Terminal.)
Answer 2
- -S: bigWig file(s) containing the scores to be plotted.
- -R: bed file(s) containing the regions to plot.
- -a: –afterRegionStartLength. Distance downstream of the reference-point selected
- -b: –beforeRegionStartLength. Distance upstream of the reference-point selected.
- –skipZeros: Whether regions with only scores of zero should be included or not.
- –sortRegions: Whether the output file should present the regions sorted.
- -o: File name to save the gzipped matrix file needed to plot heatmap.
- –referencePoint: The reference point for the plotting.
The function uses the arguments as follows:
--referencePoint
) specified in the BED file are taken.Now, we can visualize the read coverages for genomic regions using the matrix we just made. We use a function called plotHeatmap
.
bash
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap.png
Exercise 3
What does each argument in the code above mean? (Hint: run
plotHeatmap
in Terminal.)
Answer 3
- –matrixFile: Matrix file from the computeMatrix tool.
- –outFileSortedRegions: The name of bed file into which the regions are sorted (The order of genomic regions is the same as the one used in heatmap).
- –outFileName: output file name.
plotHeatmap
can sort the genomic regions using the coverage scores calculated in computeMatrix
and plots heatmap. By default, the regions are sorted in the decsending order.
Let’s have a look at the sorted bed file.
bash
head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed
You can see that the genomic regions are reordered. This order matches the order in heatmap. Let’s have a look at the heatmap.png file. (Use the tab “Files” on the right side of Rstudio.)
Exercise 4
- Make a heatmap with 1000bp regions (500bp upstream, 500bp downstream) with summits at their centre. Use “matrix_1.gz” as the name of the file, and output heatmap as “heatmap_1”.
- Change the color of the heatmap (eg. to ‘copper’) and write it out as “heatmap_2”. (Hint: Have a look at
--colorMap
section ofplotHeatmap
)- Sort the genomic regions in the heatmap in ascending order and write it out as “heatmap_3”. (Hint: Have a look at
--sortRegions
section ofplotHeatmap
)
Answer 4
bash
#1
computeMatrix reference-point -S /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_treat_pileup.bw /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_control_lambda.bw -R /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed -a 500 -b 500 --skipZeros --sortRegions descend -o /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix_1.gz --referencePoint "center"
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix_1.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap_1.png
#2
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap.png --colorMap 'copper'
#3
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap_3.png --sortRegions ascend
Let’s use MEME Suite to do the motif analysis. There are various different tools available.
MEME-ChIP performs several motif analysis steps on a set of user provided DNA sequences in fasta format. It is especially useful for analyzing peaks from ChIP-seq experiments. MEME-ChIP can
First, we have to convert our bed file into fasta file. We can use bedtools for this. (It is a very useful tool especially to edit bed files!)
We use a function called bedtools getfasta
. First, as usual, let’s have a look what arguments the function takes.
bash
bedtools getfasta -h
You need the bed file (“Peaks_for_motif_detection.bed”) and the actual sequence of hg38 (“hg38_chr3.fa”) as input to convert genomic regions specified in bed file into the actual sequences of those regions. We will take a look what the fasta file “hg38_chr3.fa” looks like. You can download this kind of data from gencode or UCSC websites. Ideally, you use the same fasta file as you used for the upstream analysis.
bash
# If strand information is known use -s option as well
bedtools getfasta -fi "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/hg38_chr3.fa" -bed "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed" -fo "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa"
Now, let’s have a look at the first 10 lines of this fasta file.
bash
head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa
With this fasta file, we are ready to run meme-chip!
bash
meme-chip
First, it will detect de-novo motifs using MEME
and DREME
. These two use different algorhythms to detect motifs (MEME: Expectation Maximisation (EM) algorithm, DREME: pattern matching). MEME is good at finding relatively long motif, whereas DREME is better at finding short motifs.
Ideally the sequences should be all the same length, between 100 and 500 base-pairs long and centrally enriched for motifs. (Ours is 400bp, centrally enriched.)
The de-novo motifs discovered by MEME
and DREME
are compared against known motifs from PWM databases using TOMTOM
. The motifs found are also checked by CentriMo
for central enrichment (enrichment of the motif in the central 100bp regions compared to more distal regions of 400bp).
SpaMo
uses each of the discovered motifs as the “primary” motif, and all the motifs in the database as potential “secondary” motifs and reports the secondary motifs whose occurrences are enriched at particular distances relative to the primary motif’s occurrences in the input sequences.
bash
# This takes a while to run (30-45 mins)
meme-chip -oc /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/meme -db /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/JASPAR/JASPAR_CORE_2016_vertebrates.meme /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Tp73_peaks.fa
The function above takes a while to run, so let’s have a look at the results first.