In this practical, we will be going through various downstream analysis of ChIP-seq.
During this tutorial you will learn how to:
Part 1
Part 2
During Part1, all the commands can be run by clicking the green arrow or using “Console” tab. During Part 2, we need to type in commands in the main docker “Terminal”.
Summits are the actual binding sites, we would like to make a list of regions with summits in the middle. This is good especially for motif analysis, as you expect to see the motif at the binding site.
Frist, load the summit file and have a look inside. We will use read.table
to load bam file in R.
R
peakfile <-read.table("/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_summits.bed")
head(peakfile)
The summit regions are only 1bp length.
We would like to extend these genomic regions by 100bp in both directions, and have a set of genomic regions 200bp in length with the binding site in the middle.
R
# get a 200bp window around summit (-100 and +100)
Motif_Peaks <- data.frame((peakfile$V1), (peakfile$V2-100), (peakfile$V2+100))
head(Motif_Peaks)
Now we would like to write this out into a bed file. We can use write.table
function.
R
?write.table
We have to modify the arguments of write.table
so that it would match bed file format.
R
write.table(Motif_Peaks, "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed")
Let’s have a look at top 10 rows of the file.
bash
head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed"
Exercise 1
Optimize the arguments of
write.table
so that it would output a properly formatted bed file.
bash
head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed"
We will use ChIPseeker to annotate genomic features.
R
library(ChIPseeker)
First, let’s load peak files we just created. ChIPseeker provides readPeakFile
to load the peak and store in GRanges object. GRanges object is an object for storing genomic locations widely used by Bioconductor tools. If you want to find out more about this object, please read GenomicRanges
vignette.
R
#The peak file was created in exercise
peak <- readPeakFile("/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed")
peak
Peak Annotation is performed by annotatePeak
. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb.
Let’s first have a look at the documentation of annotatePeak
.
R
?annotatePeak
All the peak information contained in peakfile will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.
N.B. 1. Downstream is defined as the downstream of gene end. 2. ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierarchy.
R
library(org.Hs.eg.db) #Ensembl and Entrez gene ID conversions
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Annotation of transcripts (transcripts' genomic regions)
peakAnno <- annotatePeak(peak, annoDb="org.Hs.eg.db", TxDb = txdb, tssRegion = c(-3000, 3000))
Now, let’s check peakAnno
object we just made.
R
head(as.data.frame(peakAnno))
You can see that each of the regions is assigned to a gene by the proximity and feature (annotation).
Let’s make a bar plot.
R
plotAnnoBar(peakAnno)
The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. Function plotDistToTSS
is used to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.
R
plotDistToTSS(peakAnno, title="Distribution of transcription factor-binding loci\nrelative to TSS")
annotatePeak
function of ChIPseeker assign the nearest gene’s name to each of the genomic regions. Using the assigned gene, ChIPseeker can perform functional enrichment analysis.
Enrichment analysis is widely used to make sense of a list of genes. There are several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler for Gene Ontology and KEGG enrichment analysis. We will perform reactome pathway analysis as an example.
R
library(ReactomePA)
If you remember each of the genes are assigned to the nearest gene. We will use these genes to perform function enrichment analysis. (Make sure the method suits your biological questions)
R
head(as.data.frame(peakAnno))
R
pathway <- enrichPathway(as.data.frame(peakAnno)$geneId)
pathway
Unfortunately, the dataset we are using is a small modified (single chromosome) dataset, we cannot find any enrichment here.
If the set of genes are biologically meaningful, we can make a nice plot out of it.
R
gene <- c("11171", "8243", "112464", "2194",
"9318", "79026", "1654", "65003",
"6240", "3476", "6238", "3836",
"4176", "1017", "249")
yy = enrichPathway(gene, pvalueCutoff=0.05)
head(summary(yy))
dotplot(yy)
Let’s switch to terminal.
Docker
docker exec -ti -u 0 docker /bin/bash
In this section, we are going to make a heatmap which shows the distribution of peaks around the regions specified. These regions could be regions of the genome with specific annotation (eg. TSS) or the regions which are called as peaks. One of the popular packages used for this purpose is called Deeptools. In this practical, we could like to make a heatmap of input and ChIP seq data using peak regions.
Deeptools offers a function (bamCoverage
) which can convert BAM files into BigWig file with various normalization methods. In this practical, we use BigWig files of input and ChIP-seq created from MACS2 callpeak
bedGraph output, as they are already normalized for library sizes.
First, we need to prepare an intermediate file that can be used to plot heatmap. computeMatrix
function takes BigWig file and bed file with the genomic regions of your interest as input, compute the values (score per genome region) needed for heatmaps and summary plots.
For example, we can run the following command.
bash
computeMatrix reference-point -S /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_treat_pileup.bw /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_control_lambda.bw -R /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed -a 2000 -b 2000 --skipZeros --sortRegions descend -o /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --referencePoint "center"
Exercise 2
What does each argument in the code above mean? (Hint: run
computeMatrix reference-point
in Terminal.)
The function uses the arguments as follows:
--referencePoint
) specified in the BED file are taken.Now, we can visualize the read coverages for genomic regions using the matrix we just made. We use a function called plotHeatmap
.
bash
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap.png
Exercise 3
What does each argument in the code above mean? (Hint: run
plotHeatmap
in Terminal.)
plotHeatmap
can sort the genomic regions using the coverage scores calculated in computeMatrix
and plots heatmap. By default, the regions are sorted in the decsending order.
Let’s have a look at the sorted bed file.
bash
head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed
You can see that the genomic regions are reordered. This order matches the order in heatmap. Let’s have a look at the heatmap.png file. (Use the tab “Files” on the right side of Rstudio.)
Exercise 4
- Make a heatmap with 1000bp regions (500bp upstream, 500bp downstream) with summits at their centre. Use “matrix_1.gz” as the name of the file, and output heatmap as “heatmap_1”.
- Change the color of the heatmap (eg. to ‘copper’) and write it out as “heatmap_2”. (Hint: Have a look at
--colorMap
section ofplotHeatmap
)- Sort the genomic regions in the heatmap in ascending order and write it out as “heatmap_3”. (Hint: Have a look at
--sortRegions
section ofplotHeatmap
)
Let’s use MEME Suite to do the motif analysis. There are various different tools available.
MEME-ChIP performs several motif analysis steps on a set of user provided DNA sequences in fasta format. It is especially useful for analyzing peaks from ChIP-seq experiments. MEME-ChIP can
First, we have to convert our bed file into fasta file. We can use bedtools for this. (It is a very useful tool especially to edit bed files!)
We use a function called bedtools getfasta
. First, as usual, let’s have a look what arguments the function takes.
bash
bedtools getfasta -h
You need the bed file (“Peaks_for_motif_detection.bed”) and the actual sequence of hg38 (“hg38_chr3.fa”) as input to convert genomic regions specified in bed file into the actual sequences of those regions. We will take a look what the fasta file “hg38_chr3.fa” looks like. You can download this kind of data from gencode or UCSC websites. Ideally, you use the same fasta file as you used for the upstream analysis.
bash
# If strand information is known use -s option as well
bedtools getfasta -fi "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/hg38_chr3.fa" -bed "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed" -fo "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa"
Now, let’s have a look at the first 10 lines of this fasta file.
bash
head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa
With this fasta file, we are ready to run meme-chip! Please open up your terminal to taste the meme-chip!
bash
meme-chip
First, it will detect de-novo motifs using MEME
and DREME
. These two use different algorithms to detect motifs (MEME: Expectation Maximisation (EM) algorithm, DREME: pattern matching). MEME is good at finding relatively long motif, whereas DREME is better at finding short motifs.
Ideally the sequences should be all the same length, between 100 and 500 base-pairs long and centrally enriched for motifs. (Ours is 200bp, centrally enriched.)
The de-novo motifs discovered by MEME
and DREME
are compared against known motifs from PWM databases using TOMTOM
. The motifs found are also checked by CentriMo
for central enrichment (enrichment of the motif in the central 100bp regions compared to more distal regions of 200bp).
SpaMo
uses each of the discovered motifs as the “primary” motif, and all the motifs in the database as potential “secondary” motifs and reports the secondary motifs whose occurrences are enriched at particular distances relative to the primary motif’s occurrences in the input sequences.
bash
# This takes a while to run (30-45 mins)
meme-chip -oc /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/meme -db /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/JASPAR2020_CORE_vertebrates_non-redundant_pfms.meme /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa
The function above takes a while to run, so let’s have a look at the results first.
A few examples from the vignette.
R
#load package
library("STRINGdb")
string_db <- STRINGdb$new( version="11", species=9606,
score_threshold=0, input_directory="" )
STRINGdb$methods() # To list all the methods available.
STRINGdb$help("get_graph") # To visualize their documentation.
data(diff_exp_example1)
head(diff_exp_example1)
example1_mapped <- string_db$map( diff_exp_example1, "gene", removeUnmappedRows = TRUE )
options(SweaveHooks=list(fig=function()
par(mar=c(2.1, 0.1, 4.1, 2.1))))
#par(mar=c(1.1, 0.1, 4.1, 2.1))))
hits <- example1_mapped$STRING_id[1:200]
#plot network
getOption("SweaveHooks")[["fig"]]()
string_db$plot_network( hits )
# filter by p-value and add a colour column
# (i.e. green down-regulated genes and red for up-regulated genes)
example1_mapped_pval05 <- string_db$add_diff_exp_color( subset(example1_mapped, pvalue<0.05),
logFcColStr="logFC" )
# post payload information to the STRING server
payload_id <- string_db$post_payload( example1_mapped_pval05$STRING_id,
colors=example1_mapped_pval05$color)
getOption("SweaveHooks")[["fig"]]()
# display a STRING network png with the "halo"
string_db$plot_network( hits, payload_id=payload_id )
options(SweaveHooks=list(fig=function()
par(mar=c(2.1, 2.1, 4.1, 2.1))))
#plot enrichment
getOption("SweaveHooks")[["fig"]]()
# plot the enrichment for the best 1000 genes
string_db$plot_ppi_enrichment(example1_mapped$STRING_id[1:1000], quiet=TRUE )
GeneMANIA (web app and Cytoscape module) finds connectivity between a genelist or other genes that are related to a set of input genes, using a very large set of functional association data. Association data include protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity. You can use GeneMANIA to find new members of a pathway or complex, find additional genes you may have missed in your screen or find new genes with a specific function, such as protein kinases. Your question is defined by the set of genes you input.
Genes co-regulated by TFs or associated with certain epigenomic modifications may belong to common functional groups. Gene Ontology enrichment analysis, GSEA, pathway and interaction analysis helps identify these enriched biological features and their connectivity.
Ontology enrichment analysis may be performed on the sets of genes with peaks associated to them. ChIPpeakAnno can map peaks to genes. In this example we will consider genes with peaks within 5000bp of a gene’s TSS.
How to associate genomic regions to genes: *basalPlusExt: mode ‘Basal plus extension’. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene’s basal domain but no more than the maximum extension in one direction.
If using the older GREAT ver=3 and you want to use one of the hg38 peak files you generated use the LiftOver tool to convert your hg38 peak file to hg19. you will need a chain file and the liftover tool from UCSC and the linux version of the liftover tool in “/home/ubuntu/Desktop/Course_Materials/ChIPSeq/practicals/data/liftOver”
Not necessary for this practical: You can also download liftover tools (for your operating system) from here liftover
Usage: liftOver input.bed hg38ToHg19.over.chain.gz output.bed unlifted.bed
Terminal
bash
# DO NOT RUN!
# A hg19 converted lifted over file is available at "6/home/ubuntu/Desktop/Course_Materials/ChIPSeq/practicals/data/liftOver/STAT1_hg19_peaks_student.bed
#liftOver tool converts genome coordinates and genome annotation files between assemblies (reference genomes). hg38 to hg19 in this case:
#This is an example of how to convert hg38 coordinates to hg19 without re-aligning
cd "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/liftOver"
# Example of hg38 to hg19 conversion of a STAT1 dataset (You can use TP73 if you'd like to be consistant)
liftOver SRR502363_trim_merged.bl_peaks.bed hg38ToHg19.over.chain.gz STAT1_hg19_peaks_student.bed unlifted.bed
Now run rGREAT example
R
library("circlize")
library("rGREAT")
# Toy example on randomly generated bed file
set.seed(123)
bed = circlize::generateRandomBed(nr = 1000, nc = 0)
bed[1:2, ]
# submit job (set for once every 300seconds)
job <- submitGreatJob(bed,request_interval = 300, species="hg19", version = "3.0.0")
# list of dataframes, each one a table of single ontology
tb <- getEnrichmentTables(job)
names(tb)
tb[[1]][1:2, ]
job
# list available ontologies or categories
availableOntologies(job)
availableCategories(job)
availableOntologies(job, category = "GO")
res = plotRegionGeneAssociationGraphs(job)
# You can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies)
tb1 = getEnrichmentTables(job, ontology = c("GO Molecular Function", "Ensembl Genes"))
head(tb1)
tb2 = getEnrichmentTables(job, category = c("GO"))
head(tb2)
R
par(mfrow = c(1, 3))
#res is a GenomicRanges object
res = plotRegionGeneAssociationGraphs(job)
R
res[1:2, ]
By specifying ontology and term ID, you can get the association in a certain term. Here the term ID is from the first column of the data frame which is returned by getEnrichmentTables().
R
par(mfrow = c(1, 3))
res = plotRegionGeneAssociationGraphs(job, ontology = "GO Molecular Function",
termID = "GO:0004984")
Exercise 5
Convert “TAp73beta_r1.fastq_trimmed.fastq_sorted_peaks.narrowPeak” file to a BED file or GRange object and run rGREAT. The rGREAT Vignette has worked examples.
Joanna A. Krupka MRC Cancer Unit, University of Cambridge, UK
Shoko Hirosue MRC Cancer Unit, University of Cambridge, UK