Contents

1 Introduction

In this practical, we will be going through various downstream analysis of ChiP-seq. This material was adapted from the practical material written by Shamith Samarajiwa in 2019.

1.1 Learning objectives

During this tutorial you will learn how to:

Part 1

  • Preprocess the summit files so that it is ready to be used by downstream analysis
  • Annotate peaks
  • perform functional enrichment analysis

Part 2

  • make heatmaps
  • perform motif enrichment analysis

During Part1, all the commands can be run by clicking the green arrow or using “Console” tab. During Part 2, we need to type in commands in “Terminal” tab of rstudio.

1.2 Extract regions around peak summits

Summits are the actual binding sites, we would like to make a list of regions with summits in the middle. This is good especially for motif analysis, as you expect to see the motif at the binding site.

Frist, load the summit file and have a look inside. We will use read.table to load bam file in R.

R

peakfile <-read.table("/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_summits.bed")
head(peakfile)
##     V1     V2     V3
## 1 chr3  51424  51425
## 2 chr3  51593  51594
## 3 chr3 111241 111242
## 4 chr3 440975 440976
## 5 chr3 441075 441076
## 6 chr3 710041 710042
##                                                                                                                                  V4
## 1 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_1
## 2 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_2
## 3 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_3
## 4 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_4
## 5 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_5
## 6 /home/DB679/Documents/articles_etc/2017/courseSeptember/course_data/GSE15780/macs2/TAp73beta_r2.fastq_trimmed.fastq_sorted_peak_6
##         V5
## 1  8.64697
## 2  6.87494
## 3 12.35170
## 4 13.13554
## 5  9.19477
## 6  4.39078

The regions are only 1bp length.

We would like to extend these genomic regions by 200bp in both directions, and have a set of genomic regions 400bp in length with the binding site in the middle.

R

# get a 400bp window around summit (-200 and +200)
Motif_Peaks <- data.frame((peakfile$V1), (peakfile$V2-200), (peakfile$V2+200))
head(Motif_Peaks)
##   X.peakfile.V1. X.peakfile.V2...200. X.peakfile.V2...200..1
## 1           chr3                51224                  51624
## 2           chr3                51393                  51793
## 3           chr3               111041                 111441
## 4           chr3               440775                 441175
## 5           chr3               440875                 441275
## 6           chr3               709841                 710241

Now we would like to write this out into a bed file. We can use write.table function.

R

?write.table

We have to modify the arguments of write.table so that it would match bed file format.

R

write.table(Motif_Peaks, "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed")

Let’s have a look at top 10 rows of the file.

bash

head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection_1stattempt.bed"
## "X.peakfile.V1." "X.peakfile.V2...200." "X.peakfile.V2...200..1"
## "1" "chr3" 51224 51624
## "2" "chr3" 51393 51793
## "3" "chr3" 111041 111441
## "4" "chr3" 440775 441175
## "5" "chr3" 440875 441275
## "6" "chr3" 709841 710241
## "7" "chr3" 854190 854590
## "8" "chr3" 963331 963731
## "9" "chr3" 984287 984687

Exercise 1

Optimize the arguments of write.table so that it would output a properly formatted bed file.

R

# create bed file
options(scipen=999) # This forces R not to use exponential notations (eg. e+10)

write.table(Motif_Peaks, 
            file= "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed", 
            row.names=FALSE,
            col.names =FALSE,
            sep="\t",
            quote =FALSE)

bash

head -n 10 "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Peaks_for_motif_detection.bed"
## chr3 51224   51624
## chr3 51393   51793
## chr3 111041  111441
## chr3 440775  441175
## chr3 440875  441275
## chr3 709841  710241
## chr3 854190  854590
## chr3 963331  963731
## chr3 984287  984687
## chr3 1069119 1069519

2 Downstream Analysis Part 1

2.1 Annotation of genomic features to peaks using ChIPseeker

We will use ChIPseeker to annotate genomic features.

R

library(ChIPseeker)
## 
## ChIPseeker v1.24.0  For help: https://guangchuangyu.github.io/software/ChIPseeker
## 
## If you use ChIPseeker in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Qing-Yu He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383

First, let’s load peak files we just created. ChIPseeker provides readPeakFile to load the peak and store in GRanges object. GRanges object is an object for genomic locations widely used by Bioconductor tools. If you want to find out more about this object, please read GenomicRanges vignette.

R

peak <- readPeakFile("/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed")
peak
## GRanges object with 2962 ranges and 0 metadata columns:
##          seqnames              ranges strand
##             <Rle>           <IRanges>  <Rle>
##      [1]     chr3         51225-51624      *
##      [2]     chr3         51394-51793      *
##      [3]     chr3       111042-111441      *
##      [4]     chr3       440776-441175      *
##      [5]     chr3       440876-441275      *
##      ...      ...                 ...    ...
##   [2958]     chr3 198089364-198089763      *
##   [2959]     chr3 198103083-198103482      *
##   [2960]     chr3 198130136-198130535      *
##   [2961]     chr3 198150244-198150643      *
##   [2962]     chr3 198184891-198185290      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Peak Annotation is performed by annotatePeak. User can define TSS (transcription start site) region, by default TSS is defined from -3kb to +3kb. Let’s first have a look at the documentation of annotatePeak.

R

?annotatePeak

All the peak information contained in peakfile will be retained in the output of annotatePeak. The position and strand information of nearest genes are reported. The distance from peak to the TSS of its nearest gene is also reported. The genomic region of the peak is reported in annotation column. Since some annotation may overlap, ChIPseeker adopted the following priority in genomic annotation.

  • Promoter
  • 5’ UTR
  • 3’ UTR
  • Exon
  • Intron
  • Downstream
  • Intergenic

N.B. 1. Downstream is defined as the downstream of gene end. 2. ChIPseeker also provides parameter genomicAnnotationPriority for user to prioritize this hierachy.

R

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Annotation of transcripts (transcripts' genomic regions)


peakAnno <- annotatePeak(peak, annoDb="org.Hs.eg.db", TxDb = txdb, tssRegion = c(-3000, 3000))
## >> preparing features information...      2020-07-30 17:47:41 
## >> identifying nearest features...        2020-07-30 17:47:43 
## >> calculating distance from peak to TSS...   2020-07-30 17:47:43 
## >> assigning genomic annotation...        2020-07-30 17:47:43 
## >> adding gene annotation...          2020-07-30 17:48:05
## Loading required package: org.Hs.eg.db
## 
## 'select()' returned 1:many mapping between keys and columns
## >> assigning chromosome lengths           2020-07-30 17:48:05 
## >> done...                    2020-07-30 17:48:05

Now, let’s check peakAnno object we just made.

R

head(as.data.frame(peakAnno))
##   seqnames  start    end width strand
## 1     chr3  51225  51624   400      *
## 2     chr3  51394  51793   400      *
## 3     chr3 111042 111441   400      *
## 4     chr3 440776 441175   400      *
## 5     chr3 440876 441275   400      *
## 6     chr3 709842 710241   400      *
##                                            annotation geneChr geneStart geneEnd
## 1                                   Distal Intergenic       3     23757   24501
## 2                                   Distal Intergenic       3     23757   24501
## 3     Exon (ENST00000663345.1/101927174, exon 3 of 3)       3    109707  196721
## 4                                   Distal Intergenic       3    389445  405739
## 5                                   Distal Intergenic       3    389445  405739
## 6 Intron (ENST00000661103.1/101927215, intron 2 of 3)       3    546192  845322
##   geneLength geneStrand    geneId      transcriptId distanceToTSS
## 1        745          1 102723448 ENST00000440867.1         27468
## 2        745          1 102723448 ENST00000440867.1         27637
## 3      87015          2 101927174 ENST00000663345.1         85280
## 4      16295          1     10752 ENST00000445697.1         51331
## 5      16295          1     10752 ENST00000445697.1         51431
## 6     299131          1 101927215 ENST00000653731.1        163650
##           ENSEMBL    SYMBOL                                    GENENAME
## 1 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 2 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 3 ENSG00000224318  CHL1-AS2                        CHL1 antisense RNA 2
## 4 ENSG00000134121      CHL1              cell adhesion molecule L1 like
## 5 ENSG00000134121      CHL1              cell adhesion molecule L1 like
## 6 ENSG00000224957 LINC01266 long intergenic non-protein coding RNA 1266

You can see that each of the regions is assigned to a gene by the proximity and feature (annotation).

Let’s make a bar plot.

R

plotAnnoBar(peakAnno)

The distance from the peak (binding site) to the TSS of the nearest gene is calculated by annotatePeak and reported in the output. Function plotDistToTSS is used to calculate the percentage of binding sites upstream and downstream from the TSS of the nearest genes, and visualize the distribution.

R

plotDistToTSS(peakAnno,
              title="Distribution of transcription factor-binding loci\nrelative to TSS")

2.2 Functional enrichment analysis using ChIPseeker

annotatePeak function of ChIPseeker assign the nearest gene’s name to each of the genomic regions. Using the assigned gene, ChIPseeker can perform functional enrichment analysis.

Enrichment analysis is widely used to make sense of a list of genes. There are several Bioconductor packages for investigating whether the number of selected genes associated with a particular biological term is larger than expected, including DOSE for Disease Ontology, ReactomePA for reactome pathway, clusterProfiler for Gene Ontology and KEGG enrichment analysis. We will perform reactome pathway analysis as an example.

R

library(ReactomePA)
## ReactomePA v1.32.0  For help: https://guangchuangyu.github.io/ReactomePA
## 
## If you use ReactomePA in published research, please cite:
## Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479

If you remember each of the genes are assigned to the nearest gene. We will use these genes to perform function enrichment analysis. (Make sure the method suits your biological questions)

R

head(as.data.frame(peakAnno))
##   seqnames  start    end width strand
## 1     chr3  51225  51624   400      *
## 2     chr3  51394  51793   400      *
## 3     chr3 111042 111441   400      *
## 4     chr3 440776 441175   400      *
## 5     chr3 440876 441275   400      *
## 6     chr3 709842 710241   400      *
##                                            annotation geneChr geneStart geneEnd
## 1                                   Distal Intergenic       3     23757   24501
## 2                                   Distal Intergenic       3     23757   24501
## 3     Exon (ENST00000663345.1/101927174, exon 3 of 3)       3    109707  196721
## 4                                   Distal Intergenic       3    389445  405739
## 5                                   Distal Intergenic       3    389445  405739
## 6 Intron (ENST00000661103.1/101927215, intron 2 of 3)       3    546192  845322
##   geneLength geneStrand    geneId      transcriptId distanceToTSS
## 1        745          1 102723448 ENST00000440867.1         27468
## 2        745          1 102723448 ENST00000440867.1         27637
## 3      87015          2 101927174 ENST00000663345.1         85280
## 4      16295          1     10752 ENST00000445697.1         51331
## 5      16295          1     10752 ENST00000445697.1         51431
## 6     299131          1 101927215 ENST00000653731.1        163650
##           ENSEMBL    SYMBOL                                    GENENAME
## 1 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 2 ENSG00000223587 LINC01986 long intergenic non-protein coding RNA 1986
## 3 ENSG00000224318  CHL1-AS2                        CHL1 antisense RNA 2
## 4 ENSG00000134121      CHL1              cell adhesion molecule L1 like
## 5 ENSG00000134121      CHL1              cell adhesion molecule L1 like
## 6 ENSG00000224957 LINC01266 long intergenic non-protein coding RNA 1266

R

pathway <- enrichPathway(as.data.frame(peakAnno)$geneId)
pathway
## #
## # over-representation test
## #
## #...@organism     human 
## #...@ontology     Reactome 
## #...@keytype      ENTREZID 
## #...@gene     chr [1:833] "102723448" "101927174" "10752" "101927215" "27255" "152330" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...0 enriched terms found
## #...Citation
##   Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for
##   reactome pathway analysis and visualization. Molecular BioSystems
##   2016, 12(2):477-479

Unfortunately, the dataset we are using is a small modified dataset, we cannot find any enrichment here.

If the set of genes are biologically meaningful, we can make a nice plot out of it.

R

gene <- c("11171", "8243", "112464", "2194",
            "9318", "79026", "1654", "65003",
            "6240", "3476", "6238", "3836",
            "4176", "1017", "249")
yy = enrichPathway(gene, pvalueCutoff=0.05)
head(summary(yy))
## Warning in summary(yy): summary method to convert the object to data.frame is
## deprecated, please use as.data.frame instead.
##                        ID                                         Description
## R-HSA-68962   R-HSA-68962           Activation of the pre-replicative complex
## R-HSA-69242   R-HSA-69242                                             S Phase
## R-HSA-176187 R-HSA-176187 Activation of ATR in response to replication stress
##              GeneRatio   BgRatio       pvalue   p.adjust     qvalue
## R-HSA-68962       2/11  33/10654 0.0005028708 0.02510404 0.01709873
## R-HSA-69242       3/11 161/10654 0.0005112494 0.02510404 0.01709873
## R-HSA-176187      2/11  37/10654 0.0006328751 0.02510404 0.01709873
##                      geneID Count
## R-HSA-68962       4176/1017     2
## R-HSA-69242  8243/4176/1017     3
## R-HSA-176187      4176/1017     2
dotplot(yy)

3 Downstream Analysis Part 2

3.1 Normalization and Visualization using Deeptools

In this section, we are going to make a heatmap which shows the distribution of peaks around the regions specified. These regions could be regions of the genome with specific annotation (eg. TSS) or the regions which are called as peaks. One of the popular packages used for this purpose is called Deeptools. In this practical, we could like to make a heatmap of input and ChIP seq data using peak regions.

Deeptools offers a function (bamCoverage) which can convert BAM files into BigWig file with various normalization methods. In this practical, we use BigWig files of input and ChIP seq created from MACS2 callpeak bedGraph output, as they are already normalized for library sizes.

3.2 Deeptools

First, we need to prepare a intermediate file that can be used to plot heatmap. computeMatrix function takes BigWig file and bed file with the genomic regions of your interest as input, compute the values (score per genome region) needed for heatmaps and summary plots.

For example, we can run the following command.

bash

computeMatrix reference-point -S /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_treat_pileup.bw /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_control_lambda.bw -R /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed -a 2000 -b 2000 --skipZeros --sortRegions descend -o /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --referencePoint "center"

Exercise 2

What does each argument in the code above mean? (Hint: run computeMatrix reference-point in Terminal.)

Answer 2

  • -S: bigWig file(s) containing the scores to be plotted.
  • -R: bed file(s) containing the regions to plot.
  • -a: –afterRegionStartLength. Distance downstream of the reference-point selected
  • -b: –beforeRegionStartLength. Distance upstream of the reference-point selected.
  • –skipZeros: Whether regions with only scores of zero should be included or not.
  • –sortRegions: Whether the output file should present the regions sorted.
  • -o: File name to save the gzipped matrix file needed to plot heatmap.
  • –referencePoint: The reference point for the plotting.

The function uses the arguments as follows:

  1. The “center” of the regions (--referencePoint) specified in the BED file are taken.
  2. Values indicated with –beforeRegionStartLength (-b) and –afterRegionStartLength (-a) (=2000bp) are added.
  3. The resulting regions are split up into 50 bp bins (can be changed via (–binSize))
  4. The mean score based on the scores given in the bigWig file is calculated (the kind of score can be changed via –averageTypeBins).

Now, we can visualize the read coverages for genomic regions using the matrix we just made. We use a function called plotHeatmap.

bash

plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap.png

Exercise 3

What does each argument in the code above mean? (Hint: run plotHeatmap in Terminal.)

Answer 3

  • –matrixFile: Matrix file from the computeMatrix tool.
  • –outFileSortedRegions: The name of bed file into which the regions are sorted (The order of genomic regions is the same as the one used in heatmap).
  • –outFileName: output file name.

plotHeatmap can sort the genomic regions using the coverage scores calculated in computeMatrix and plots heatmap. By default, the regions are sorted in the decsending order.

Let’s have a look at the sorted bed file.

bash

head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed

You can see that the genomic regions are reordered. This order matches the order in heatmap. Let’s have a look at the heatmap.png file. (Use the tab “Files” on the right side of Rstudio.)

Exercise 4

  1. Make a heatmap with 1000bp regions (500bp upstream, 500bp downstream) with summits at their centre. Use “matrix_1.gz” as the name of the file, and output heatmap as “heatmap_1”.
  2. Change the color of the heatmap (eg. to ‘copper’) and write it out as “heatmap_2”. (Hint: Have a look at --colorMap section of plotHeatmap)
  3. Sort the genomic regions in the heatmap in ascending order and write it out as “heatmap_3”. (Hint: Have a look at --sortRegions section of plotHeatmap)

Answer 4

bash

#1 
computeMatrix reference-point -S /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_treat_pileup.bw /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/TAp73beta_r2.fastq_trimmed.fastq_sorted_standard_control_lambda.bw -R /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed -a 500 -b 500 --skipZeros --sortRegions descend -o /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix_1.gz --referencePoint "center"

plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix_1.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap_1.png

#2
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap.png --colorMap 'copper'

#3
plotHeatmap --matrixFile /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/matrix.gz --outFileSortedRegions /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmapsortedregions.bed --outFileName /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/heatmap_3.png --sortRegions ascend

3.3 Motif Analysis using MEME Suite

Let’s use MEME Suite to do the motif analysis. There are various different tools available.

MEME-ChIP performs several motif analysis steps on a set of user provided DNA sequences in fasta format. It is especially useful for analyzing peaks from ChIP-seq experiments. MEME-ChIP can

  1. discover novel DNA-binding motifs in the central regions (100 characters by default) of the input sequences (with MEME and DREME),
  2. determine which motifs are most centrally enriched (with CentriMo),
  3. analyze them for similarity to known binding motifs (with Tomtom), and
  4. automatically group significant motifs by similarity,
  5. perform a motif spacing analysis (with SpaMo), and,
  6. create a GFF file for viewing each motif’s predicted sites in a genome browser. (quoted from http://meme-suite.org/doc/meme-chip.html?man_type=web)

First, we have to convert our bed file into fasta file. We can use bedtools for this. (It is a very useful tool especially to edit bed files!)

We use a function called bedtools getfasta. First, as usual, let’s have a look what arguments the function takes.

bash

bedtools getfasta -h

You need the bed file (“Peaks_for_motif_detection.bed”) and the actual sequence of hg38 (“hg38_chr3.fa”) as input to convert genomic regions specified in bed file into the actual sequences of those regions. We will take a look what the fasta file “hg38_chr3.fa” looks like. You can download this kind of data from gencode or UCSC websites. Ideally, you use the same fasta file as you used for the upstream analysis.

bash

# If strand information is known use -s option as well
bedtools getfasta -fi "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/hg38_chr3.fa" -bed "/home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Peaks_for_motif_detection.bed" -fo "/home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa"

Now, let’s have a look at the first 10 lines of this fasta file.

bash

head -n 10 /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/Tp73_peaks.fa

With this fasta file, we are ready to run meme-chip!

bash

meme-chip

First, it will detect de-novo motifs using MEME and DREME. These two use different algorhythms to detect motifs (MEME: Expectation Maximisation (EM) algorithm, DREME: pattern matching). MEME is good at finding relatively long motif, whereas DREME is better at finding short motifs.

Ideally the sequences should be all the same length, between 100 and 500 base-pairs long and centrally enriched for motifs. (Ours is 400bp, centrally enriched.)

The de-novo motifs discovered by MEME and DREME are compared against known motifs from PWM databases using TOMTOM. The motifs found are also checked by CentriMo for central enrichment (enrichment of the motif in the central 100bp regions compared to more distal regions of 400bp).

SpaMo uses each of the discovered motifs as the “primary” motif, and all the motifs in the database as potential “secondary” motifs and reports the secondary motifs whose occurrences are enriched at particular distances relative to the primary motif’s occurrences in the input sequences.

bash

# This takes a while to run (30-45 mins)

meme-chip -oc /home/ubuntu/Course_Materials/ChIPSeq/practicals/output/meme -db /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/JASPAR/JASPAR_CORE_2016_vertebrates.meme /home/ubuntu/Course_Materials/ChIPSeq/practicals/data/Tp73_peaks.fa

The function above takes a while to run, so let’s have a look at the results first.

4 References