Downstream Analysis of ChIP-seq data

0.1 Peak Analysis
0.2 Motif detection
- 0.2.1 Get fasta for peak regions
- 0.2.2 Meme and Dreme
0.3 Motif Enrichment Analysis
- 0.3.1 MemeChIP
- 0.3.2 PScanChIP
0.4 Network Analysis
- 0.4.1 STRING
- 0.4.2 GeneMANIA
0.5 Ontology Enrichment
- 0.5.1 GREAT (Genomic Regions Enrichment of Annotations Tool)

0.1 Peak Analysis

0.1.1 Set working directory

bash

cd /home/participant/Course_Materials/ChIPSeq/Materials/

#mkdir motifs if it doesn't exist

cd motifs

0.1.2 Extract region around peak summit

# read in the peak summit file
 
setwd("/home/participant/Course_Materials/ChIPSeq/Materials/motifs")
peakfile <-read.table("/home/participant/Course_Materials/ChIPSeq/Materials/Peaks/TAp73beta_r2.fastq_trimmed.fastq_sorted_filterdup.bed")


# get a 400bp window around summit (-200 and +200)
Motif_Peaks <- data.frame((peakfile$V1), (peakfile$V2-250), (peakfile$V2+250), stringsAsFactors = FALSE)

# create bed file
options(scipen=999)

write.table(Motif_Peaks, file= "/home/participant/Course_Materials/ChIPSeq/Materials/motifs/Peaks_for_motif_detection.bed", 
            row.names=FALSE,col.names = FALSE, sep="\t", quote =FALSE)

dir()

0.1.3 ChIP peak annotation

library("ChIPpeakAnno")
library("GenomicRanges")

#library("rtracklayer")
#bed <- import("/home/participant/Course_Materials/ChIPSeq/Materials/motifs/Peaks_for_motif_detection.bed")
#gr1 <- GRanges(seqnames=bed$V1, ranges=IRanges(start=bed$V2, end=bed$V3)) 


df1<-read.table("Peaks_for_motif_detection.bed", header=FALSE)

gr1 <- GRanges(seqnames=df1$V1, ranges=IRanges(start=df1$V2, end=df1$V3))



##annotation


#source("https://bioconductor.org/biocLite.R")
#biocLite("EnsDb.Hsapiens.v86")

library("EnsDb.Hsapiens.v86")
library("ChIPpeakAnno")

## create annotation file from EnsDb or TxDb

annoData <- toGRanges(EnsDb.Hsapiens.v86, feature="gene")
annoData[1:2]

anno.gr1 <- annotatePeakInBatch(gr1, 
AnnotationData=annoData, 
output="nearestBiDirectionalPromoters",
bindingRegion=c(-2000, 500))


head(anno.gr1)

#annotate with Entrez IDs

library("org.Hs.eg.db")
anno.gr1 <- addGeneIDs(anno.gr1,
                            "org.Hs.eg.db",
                            IDs2Add = "entrez_id")
head(anno.gr1)

0.2 Motif detection

MEME-ChIP performs several motif analysis steps on a set of user provided DNA sequences in fasta format. It is especially useful for analyzing peaks from ChIP-seq experiments. MEME-ChIP can

Discover novel DNA-binding motifs (MEME and DREME)
Analyze them for similarity to known binding motifs (TOMTOM)
Visualize the arrangement of the predicted motif sites in your input sequences (SPAMO, Fimo)
Identify motif enrichment relative to a background model (Centrimo)

0.2.1 Get fasta for peak regions

bash

# If strand information is known use -s option as well
bedtools getfasta -fi /home/participant/Course_Materials/Introduction/SS_DB/Reference/hg38_chr3.fa \
-bed /home/participant/Course_Materials/ChIPSeq/Materials/motifs/Peaks_for_motif_detection.bed -fo /home/participant/Course_Materials/ChIPSeq/Materials/motifs/Tp73_peaks.fa




# find a line (3rd line in this case) and see its contents
# sed -n '3p'  filename

# delete a line (DON'T RUN unless there is a need to remove a line)
# sed -i".bak" '3d' filename

0.2.2 Meme and Dreme

bash


#look at the options in meme-chip, meme and dreme

meme-chip

meme

dreme

Meme is an EM algorithm and Dreme uses pattern matching to finds de novo motifs.

The fasta file should be 500bp for each peak, Meme looks at a 100bp window around the peak center for motifs. Repeat masking is recommended.

TOMTOM compares motifs discovered with known motifs from PWM databases.
Centrimo looks for motif enrichment.
MAST can visualize the location of the discovered motifs within a sequence.
AMA measures how strongly a motif is associated with each sequence
AME discovers subtly enriched known bindng motifs in the sequences.
FIMO detects motif locations

0.3 Motif Enrichment Analysis

Motif database for meme analysis has been setup. If it’s NOT there, follow the instructions below:

Download meme motif PWM databases onto your working folder PWM databases

bash


cd "/home/participant/Course_Materials/ChIPSeq/Materials/motifs"

wget "http://meme-suite.org/meme-software/Databases/motifs/motif_databases.12.17.tgz"
chmod 755 motif_databases.12.17.tgz

#unzip them

tar -zxvf motif_databases.12.17.tgz

0.3.1 MemeChIP

bash


# If the meme directory is missing create it
# mkdir meme

meme-chip -oc /home/participant/Course_Materials/ChIPSeq/Materials/motifs/meme/ \
-db /home/participant/Course_Materials/ChIPSeq/Materials/motif_databases/EUKARYOTE/jolma2013.meme \
-db /home/participant/Course_Materials/ChIPSeq/Materials/motif_databases/JASPAR/JASPAR_CORE_2016_vertebrates.meme \
-meme-mod zoops -meme-minw 6 -order 1 \
-meme-maxw 30 -meme-nmotifs 10 -dreme-e 0.05 -centrimo-score 5.0 -centrimo-ethresh 10.0 /home/participant/Course_Materials/ChIPSeq/Materials/motifs/Tp73_peaks.fa

or upload fasta file meme-chip

0.3.2 PScanChIP

Demonstration on web version of PScanChIP

PScanChIP website

Upload your bed file

0.4 Network Analysis

0.4.1 STRING

STRING website

STRING

0.4.2 GeneMANIA

GeneMANIA website

GeneMANIA (web app and Cytoscape module) finds connectivity between a genelist or other genes that are related to a set of input genes, using a very large set of functional association data. Association data include protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity. You can use GeneMANIA to find new members of a pathway or complex, find additional genes you may have missed in your screen or find new genes with a specific function, such as protein kinases. Your question is defined by the set of genes you input.

GeneMANIA

0.5 Ontology Enrichment

0.5.1 GREAT (Genomic Regions Enrichment of Annotations Tool)

Genes co-regulated by TFs or associated with certain epigenomic modifications may belong to common functional groups. Gene Ontology enrichment analysis, GSEA, pathway and interaction analysis helps identify these enriched biological features and their connectivity.

GREAT website

Ontology enrichment analysis may be performed on the sets of genes with peaks associated to them. ChIPpeakAnno can map peaks to genes. In this example we will consider genes with peaks within 5000bp of a gene’s TSS.

How to associate genomic regions to genes: * basalPlusExt: mode ‘Basal plus extension’. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene’s basal domain but no more than the maximum extension in one direction. * adv_upstream: proximal extension to upstream (unit: kb) * adv_downstream: proximal extension to downstream (unit: kb) * adv_span: maximum extension (unit: kb) * twoClosest: mode ‘Two nearest genes’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the nearest gene’s TSS but no more than the maximum extension in one direction. adv_twoDistance: maximum extension (unit: kb) * oneClosest: mode ‘Single nearest gene’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the midpoint between the gene’s TSS and the nearest gene’s TSS but no more than the maximum extension in one direction. * adv_oneDistance: maximum extension (unit: kb)

library("rGREAT")
library("circlize")

#setwd(THE/FOLDER/WHERE/YOUR/PEAKFILES/ARE)


# If you want to use one of the peak files you generated use the LiftOver tool to conver your hg38
# peak file to hg19.

liftOver input.bed /home/participant/Course_Materials/ChIPSeq/Materials/RawData/hg38ToHg19.over.chain.gz output.bed unlifted.bed

# Toy example on randomly generated bed file
set.seed(123)
bed = circlize::generateRandomBed(nr = 1000, nc = 0)
bed[1:2, ]


# submit job (set for once every 300seconds)
job <-  submitGreatJob(bed,request_interval = 300, species="hg19", version = "default")

## job <-  submitGreatJob(bed, rule = "oneClosest", adv_oneDistance=5000, version = "3.0")


# list of dataframes, each one a table of single ontology
tb  <-  getEnrichmentTables(job)
names(tb)
tb[[1]][1:2, ]
job

# list available ontologies or categories

availableOntologies(job)
availableCategories(job)

availableOntologies(job, category = "Pathway Data")

# You can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies)

tb = getEnrichmentTables(job, ontology = c("GO Molecular Function", "BioCyc Pathway"))
tb = getEnrichmentTables(job, category = c("GO"))

Association between genomic regions and genes

par(mfrow = c(1, 3))
res = plotRegionGeneAssociationGraphs(job)

res is a GenomicRanges object

res[1:2, ]

By specifying ontology and term ID, you can get the association in a certain term. Here the term ID is from the first column of the data frame which is returned by getEnrichmentTables().

par(mfrow = c(1, 3))
res = plotRegionGeneAssociationGraphs(job, ontology = "GO Molecular Function",
    termID = "GO:0004984")

Alternatively:

Use liftOver tool at UCSC convert your hg38 bed file to hg19 coordinates.
Download the converted file and name it “hg19_Peaks_for_motif_detection.bed”
upload this to web version of GREAT

Downstream Analysis of ChIP-seq data

July 2018

Contents

0.1 Peak Analysis

0.1.1 Set working directory

0.1.2 Extract region around peak summit

0.1.3 ChIP peak annotation

0.2 Motif detection

0.2.1 Get fasta for peak regions

0.2.2 Meme and Dreme

0.3 Motif Enrichment Analysis

0.3.1 MemeChIP

0.3.2 PScanChIP

0.4 Network Analysis

0.4.1 STRING

0.4.2 GeneMANIA

0.5 Ontology Enrichment

0.5.1 GREAT (Genomic Regions Enrichment of Annotations Tool)