Contents

1 Introduction

In this practical, we will be going through various downstream analysis of ChIP-seq.

1.1 Learning objectives

During this tutorial you will learn how to:

Part 1

  • Preprocess the summit files so that it is ready to be used by downstream analysis
  • Annotate peaks
  • perform functional enrichment analysis

Part 2

  • make heatmaps
  • perform motif enrichment analysis
  • Network analysis using STRINGdb and GeneMania
  • Ontology enrichment using rGREAT
  • Using liftOver tool to convert between reference genomes

During Part1, all the commands can be run by clicking the green arrow or using “Console” tab. During Part 2, we need to type in commands in the main docker “Terminal”.

1.2 Extract regions around peak summits

Summits are the actual binding sites, we would like to make a list of regions with summits in the middle. This is good especially for motif analysis, as you expect to see the motif at the binding site.

Frist, load the summit file and have a look inside. We will use read.table to load bam file in R.

R

peakfile <-read.table("/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_summits.bed")
head(peakfile)

The summit regions are only 1bp length.

We would like to extend these genomic regions by 100bp in both directions, and have a set of genomic regions 200bp in length with the binding site in the middle.

R

# get a 200bp window around summit (-100 and +100)

Motif_Peaks <- data.frame((peakfile$V1), (peakfile$V2-100), (peakfile$V2+100))
head(Motif_Peaks)

Now we would like to write this out into a bed file. We can use write.table function.

R

?write.table

We have to modify the arguments of write.table so that it would match bed file format.

R

write.table(Motif_Peaks, "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed")

Let’s have a look at top 10 rows of the file.

bash


head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed"

Exercise 1

Optimize the arguments of write.table so that it would output a properly formatted bed file.

bash


head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed"

2 Downstream Analysis Part 1

2.1 Annotation of genomic features to peaks using ChIPseeker

We will use ChIPseeker to annotate genomic features.

R

library(ChIPseeker)

First, let’s load peak files we just created. ChIPseeker provides readPeakFile to load the peak and store in GRanges object. GRanges object is an object for storing genomic locations widely used by Bioconductor tools. If you want to find out more about this object, please read GenomicRanges vignette.

R

#The peak file was created in exercise
peak <- readPeakFile("/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed")
peak

Peak Annotation is performed by annotatePeak. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb. Let’s first have a look at the documentation of annotatePeak.

R

?annotatePeak

All the peak information contained in peakfile will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.

  • Promoter
  • 5’ UTR
  • 3’ UTR
  • Exon
  • Intron
  • Downstream
  • Intergenic

N.B. 1. Downstream is defined as the downstream of gene end. 2. ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierarchy.

R

library(org.Hs.eg.db) #Ensembl and Entrez gene ID conversions
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Annotation of transcripts (transcripts' genomic regions)

peakAnno <- annotatePeak(peak, annoDb="org.Hs.eg.db", TxDb = txdb, tssRegion = c(-3000, 3000))

Now, let’s check peakAnno object we just made.

R

head(as.data.frame(peakAnno))

You can see that each of the regions is assigned to a gene by the proximity and feature (annotation).

Let’s make a bar plot.

R

plotAnnoBar(peakAnno)

The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. Function plotDistToTSS is used to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.

R

plotDistToTSS(peakAnno, title="Distribution of transcription factor-binding loci\nrelative to TSS")

2.2 Functional enrichment analysis using ChIPseeker

annotatePeak function of ChIPseeker assign the nearest gene’s name to each of the genomic regions. Using the assigned gene, ChIPseeker can perform functional enrichment analysis.

Enrichment analysis is widely used to make sense of a list of genes. There are several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler for Gene Ontology and KEGG enrichment analysis. We will perform reactome pathway analysis as an example.

R

library(ReactomePA)

If you remember each of the genes are assigned to the nearest gene. We will use these genes to perform function enrichment analysis. (Make sure the method suits your biological questions)

R

head(as.data.frame(peakAnno))

R

pathway <- enrichPathway(as.data.frame(peakAnno)$geneId)
pathway

Unfortunately, the dataset we are using is a small modified (single chromosome) dataset, we cannot find any enrichment here.

If the set of genes are biologically meaningful, we can make a nice plot out of it.

R

gene <- c("11171", "8243", "112464", "2194",
            "9318", "79026", "1654", "65003",
            "6240", "3476", "6238", "3836",
            "4176", "1017", "249")
yy = enrichPathway(gene, pvalueCutoff=0.05)
head(summary(yy))
dotplot(yy)

3 Downstream Analysis Part 2

Let’s switch to terminal.

Docker


docker exec -ti -u 0 docker /bin/bash 

3.1 Normalization and Visualization using Deeptools

In this section, we are going to make a heatmap which shows the distribution of peaks around the regions specified. These regions could be regions of the genome with specific annotation (eg. TSS) or the regions which are called as peaks. One of the popular packages used for this purpose is called Deeptools. In this practical, we could like to make a heatmap of input and ChIP seq data using peak regions.

Deeptools offers a function (bamCoverage) which can convert BAM files into BigWig file with various normalization methods. In this practical, we use BigWig files of input and ChIP-seq created from MACS2 callpeak bedGraph output, as they are already normalized for library sizes.

3.2 Deeptools

First, we need to prepare an intermediate file that can be used to plot heatmap. computeMatrix function takes BigWig file and bed file with the genomic regions of your interest as input, compute the values (score per genome region) needed for heatmaps and summary plots.

For example, we can run the following command.

bash

computeMatrix reference-point -S /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_treat_pileup.bw /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_control_lambda.bw -R /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed -a 2000 -b 2000 --skipZeros --sortRegions descend -o /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --referencePoint "center"

Exercise 2

What does each argument in the code above mean? (Hint: run computeMatrix reference-point in Terminal.)

The function uses the arguments as follows:

  1. The “center” of the regions (--referencePoint) specified in the BED file are taken.
  2. Values indicated with –beforeRegionStartLength (-b) and –afterRegionStartLength (-a) (=2000bp) are added.
  3. The resulting regions are split up into 50 bp bins (can be changed via (–binSize))
  4. The mean score based on the scores given in the bigWig file is calculated (the kind of score can be changed via –averageTypeBins).

Now, we can visualize the read coverages for genomic regions using the matrix we just made. We use a function called plotHeatmap.

bash


plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap.png

Exercise 3

What does each argument in the code above mean? (Hint: run plotHeatmap in Terminal.)

plotHeatmap can sort the genomic regions using the coverage scores calculated in computeMatrix and plots heatmap. By default, the regions are sorted in the decsending order.

Let’s have a look at the sorted bed file.

bash


head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed

You can see that the genomic regions are reordered. This order matches the order in heatmap. Let’s have a look at the heatmap.png file. (Use the tab “Files” on the right side of Rstudio.)

Exercise 4

  1. Make a heatmap with 1000bp regions (500bp upstream, 500bp downstream) with summits at their centre. Use “matrix_1.gz” as the name of the file, and output heatmap as “heatmap_1”.
  2. Change the color of the heatmap (eg. to ‘copper’) and write it out as “heatmap_2”. (Hint: Have a look at --colorMap section of plotHeatmap)
  3. Sort the genomic regions in the heatmap in ascending order and write it out as “heatmap_3”. (Hint: Have a look at --sortRegions section of plotHeatmap)

3.3 Motif Analysis using MEME Suite

Let’s use MEME Suite to do the motif analysis. There are various different tools available.

MEME-ChIP performs several motif analysis steps on a set of user provided DNA sequences in fasta format. It is especially useful for analyzing peaks from ChIP-seq experiments. MEME-ChIP can

  1. discover novel DNA-binding motifs in peak summits (100 characters by default) of the input sequences (with MEME and DREME),
  2. determine which motifs are most centrally enriched (with CENTRIMO),
  3. analyse them for similarity to known binding motifs (with TOMTOM), and
  4. automatically group significant motifs by similarity,
  5. perform a motif spacing analysis (with SpaMo), and,
  6. create a GFF file for viewing each motif’s predicted sites in a genome browser. (quoted from http://meme-suite.org/doc/meme-chip.html?man_type=web)

First, we have to convert our bed file into fasta file. We can use bedtools for this. (It is a very useful tool especially to edit bed files!)

We use a function called bedtools getfasta. First, as usual, let’s have a look what arguments the function takes.

bash

bedtools getfasta -h

You need the bed file (“Peaks_for_motif_detection.bed”) and the actual sequence of hg38 (“hg38_chr3.fa”) as input to convert genomic regions specified in bed file into the actual sequences of those regions. We will take a look what the fasta file “hg38_chr3.fa” looks like. You can download this kind of data from gencode or UCSC websites. Ideally, you use the same fasta file as you used for the upstream analysis.

bash

# If strand information is known use -s option as well

bedtools getfasta -fi "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/hg38_chr3.fa" -bed "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed" -fo "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa"

Now, let’s have a look at the first 10 lines of this fasta file.

bash

head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa

With this fasta file, we are ready to run meme-chip! Please open up your terminal to taste the meme-chip!

bash


meme-chip

First, it will detect de-novo motifs using MEME and DREME. These two use different algorithms to detect motifs (MEME: Expectation Maximisation (EM) algorithm, DREME: pattern matching). MEME is good at finding relatively long motif, whereas DREME is better at finding short motifs.

Ideally the sequences should be all the same length, between 100 and 500 base-pairs long and centrally enriched for motifs. (Ours is 200bp, centrally enriched.)

The de-novo motifs discovered by MEME and DREME are compared against known motifs from PWM databases using TOMTOM. The motifs found are also checked by CentriMo for central enrichment (enrichment of the motif in the central 100bp regions compared to more distal regions of 200bp).

SpaMo uses each of the discovered motifs as the “primary” motif, and all the motifs in the database as potential “secondary” motifs and reports the secondary motifs whose occurrences are enriched at particular distances relative to the primary motif’s occurrences in the input sequences.

bash

# This takes a while to run (30-45 mins)

meme-chip -oc /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/meme -db /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/JASPAR2020_CORE_vertebrates_non-redundant_pfms.meme /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa

The function above takes a while to run, so let’s have a look at the results first.

3.3.1 PScanChIP

PScanChIP website

  • Upload your centred bed file (Peaks_for_motif_detection.bed)

4 Network Analysis

4.1 STRING

STRING website

A few examples from the vignette.

R

#load package

library("STRINGdb")
string_db <- STRINGdb$new( version="11", species=9606, 
                           score_threshold=0, input_directory="" )
                           
STRINGdb$methods()              # To list all the methods available.
STRINGdb$help("get_graph")      # To visualize their documentation.

data(diff_exp_example1)
head(diff_exp_example1)


example1_mapped <- string_db$map( diff_exp_example1, "gene", removeUnmappedRows = TRUE )

options(SweaveHooks=list(fig=function()
par(mar=c(2.1, 0.1, 4.1, 2.1))))
#par(mar=c(1.1, 0.1, 4.1, 2.1))))

hits <- example1_mapped$STRING_id[1:200]

#plot network

getOption("SweaveHooks")[["fig"]]()
string_db$plot_network( hits )

# filter by p-value and add a colour column 
# (i.e. green down-regulated genes and red for up-regulated genes)
example1_mapped_pval05 <- string_db$add_diff_exp_color( subset(example1_mapped, pvalue<0.05), 
                                                            logFcColStr="logFC" ) 
                                                            
# post payload information to the STRING server
payload_id <- string_db$post_payload( example1_mapped_pval05$STRING_id, 
                                        colors=example1_mapped_pval05$color)     
getOption("SweaveHooks")[["fig"]]()
# display a STRING network png with the "halo"
string_db$plot_network( hits, payload_id=payload_id )

options(SweaveHooks=list(fig=function()
par(mar=c(2.1, 2.1, 4.1, 2.1))))

#plot enrichment
getOption("SweaveHooks")[["fig"]]()
# plot the enrichment for the best 1000 genes
string_db$plot_ppi_enrichment(example1_mapped$STRING_id[1:1000], quiet=TRUE ) 

4.2 GeneMANIA

GeneMANIA website

GeneMANIA (web app and Cytoscape module) finds connectivity between a genelist or other genes that are related to a set of input genes, using a very large set of functional association data. Association data include protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity. You can use GeneMANIA to find new members of a pathway or complex, find additional genes you may have missed in your screen or find new genes with a specific function, such as protein kinases. Your question is defined by the set of genes you input.

5 Ontology Enrichment Analysis

5.1 GREAT (Genomic Regions Enrichment of Annotations Tool)

Genes co-regulated by TFs or associated with certain epigenomic modifications may belong to common functional groups. Gene Ontology enrichment analysis, GSEA, pathway and interaction analysis helps identify these enriched biological features and their connectivity.

GREAT website

Ontology enrichment analysis may be performed on the sets of genes with peaks associated to them. ChIPpeakAnno can map peaks to genes. In this example we will consider genes with peaks within 5000bp of a gene’s TSS.

How to associate genomic regions to genes: *basalPlusExt: mode ‘Basal plus extension’. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene’s basal domain but no more than the maximum extension in one direction.

  • adv_upstream: proximal extension to upstream (unit: kb)
  • adv_downstream: proximal extension to downstream (unit: kb)
  • adv_span: maximum extension (unit: kb)
  • twoClosest: mode ‘Two nearest genes’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the nearest gene’s TSS but no more than the maximum extension in one direction.
  • adv_twoDistance: maximum extension (unit: kb)
  • oneClosest: mode ‘Single nearest gene’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the midpoint between the gene’s TSS and the nearest gene’s TSS but no more than the maximum extension in one direction.
  • adv_oneDistance: maximum extension (unit: kb)

If using the older GREAT ver=3 and you want to use one of the hg38 peak files you generated use the LiftOver tool to convert your hg38 peak file to hg19. you will need a chain file and the liftover tool from UCSC and the linux version of the liftover tool in “/home/ubuntu/Desktop/Course_Materials/ChIPSeq/practicals/data/liftOver”

Not necessary for this practical: You can also download liftover tools (for your operating system) from here liftover

Usage: liftOver input.bed hg38ToHg19.over.chain.gz output.bed unlifted.bed

5.1.1 Using the liftOver tool

Terminal

bash


# DO NOT RUN!
# A hg19 converted lifted over file is available at "6/home/ubuntu/Desktop/Course_Materials/ChIPSeq/practicals/data/liftOver/STAT1_hg19_peaks_student.bed


#liftOver tool converts genome coordinates and genome annotation files between assemblies (reference genomes). hg38 to hg19 in this case:
#This is an example of how to convert hg38 coordinates to hg19 without re-aligning 

cd "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/liftOver"

# Example of hg38 to hg19 conversion of a STAT1 dataset (You can use TP73 if you'd like to be consistant)
liftOver SRR502363_trim_merged.bl_peaks.bed hg38ToHg19.over.chain.gz STAT1_hg19_peaks_student.bed unlifted.bed

Now run rGREAT example

R

library("circlize")
library("rGREAT")

# Toy example on randomly generated bed file
set.seed(123)
bed = circlize::generateRandomBed(nr = 1000, nc = 0)
bed[1:2, ]


# submit job (set for once every 300seconds)
job <-  submitGreatJob(bed,request_interval = 300, species="hg19", version = "3.0.0")

# list of dataframes, each one a table of single ontology
tb  <-  getEnrichmentTables(job)
names(tb)
tb[[1]][1:2, ]
job

# list available ontologies or categories

availableOntologies(job)
availableCategories(job)

availableOntologies(job, category = "GO")
res = plotRegionGeneAssociationGraphs(job)

# You can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies)

tb1 = getEnrichmentTables(job, ontology = c("GO Molecular Function", "Ensembl Genes"))

head(tb1)

tb2 = getEnrichmentTables(job, category = c("GO"))

head(tb2)

5.1.2 Association between genomic regions and genes

R

par(mfrow = c(1, 3))

#res is a GenomicRanges object
res = plotRegionGeneAssociationGraphs(job)

R

res[1:2, ]

By specifying ontology and term ID, you can get the association in a certain term. Here the term ID is from the first column of the data frame which is returned by getEnrichmentTables().

R

par(mfrow = c(1, 3))
res = plotRegionGeneAssociationGraphs(job, ontology = "GO Molecular Function",
    termID = "GO:0004984")

Exercise 5

Convert “TAp73beta_r1.fastq_trimmed.fastq_sorted_peaks.narrowPeak” file to a BED file or GRange object and run rGREAT. The rGREAT Vignette has worked examples.

6 Acknowledgements

Joanna A. Krupka MRC Cancer Unit, University of Cambridge, UK

Shoko Hirosue MRC Cancer Unit, University of Cambridge, UK

7 References