Course etiquette
Please read the course etiquette, if you haven’t read that yet.
Shared document
We are using shared GoogleDocs documents for each of the main topics covered during the summer school. The document for this section can be found here.
Prerequisites
If you want to follow this tutorial using your own machine, you need to install the following R packages by running this code:
install.packages("tidyverse")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DiffBind")
The dataset we will use for this tutorial is not a part of the DiffBind package. You can download this sample datasets from here.
The primary aim of the DiffBind package is to identify differentially bound regions between two sample groups. It includes functions to support the processing of peak sets, including overlapping and merging peak sets, counting sequencing reads overlapping intervals in peak sets, and identifying statistically significantly differentially bound sites.
Before we start let’s load required packages:
library(DiffBind)
library(tidyverse)
library(rtracklayer)
During this tutorial you will learn how to:
DiffBind
DiffBind works primarily with peaksets, which are sets of genomic intervals representing candidate protein binding sites. Each interval consists of:
Additionally, files containing mapped sequencing reads (generally .bam files) can be associated with each peakset.
In order to run the command smoothly, let’s first set the working directory by running the following command in the console:
setwd("/home/ubuntu/Course_Materials/ChIPSeq/practicals/DiffBind")
You should now see 2 folders: peaks
and reads
and a tamoxifen.csv
, that contains all metadata for the experiment.
ls .
## peaks
## reads
## solutions.R
## tamoxifen.csv
## tamoxifen_report.csv
Peaksets are derived either from ChIP-Seq peak callers, such as MACS or using some other criterion (e.g. genomic windows, or all the promoter regions in a genome).
The easiest way is to generate a .csv
file or .xls
/.xlsx
file with one line for each peakset.
More than 1 peakset per sample
E.g. if multiple peak callers are used for comparison purposes each sample would have more than one line in the sample sheet. A merging function will generate a consensus peakset for the experiment.
Let’s read an example sample sheet into a data.frame
.
# Read a csv file
samples <- read.csv("tamoxifen.csv")
# Look at the loaded metadata
names(samples)
## [1] "SampleID" "Tissue" "Factor" "Condition" "Treatment"
## [6] "Replicate" "bamReads" "ControlID" "bamControl" "Peaks"
## [11] "PeakCaller"
samples
## SampleID Tissue Factor Condition Treatment Replicate
## 1 BT4741 BT474 ER Resistant Full-Media 1
## 2 BT4742 BT474 ER Resistant Full-Media 2
## 3 MCF71 MCF7 ER Responsive Full-Media 1
## 4 MCF72 MCF7 ER Responsive Full-Media 2
## 5 MCF73 MCF7 ER Responsive Full-Media 3
## 6 T47D1 T47D ER Responsive Full-Media 1
## 7 T47D2 T47D ER Responsive Full-Media 2
## 8 MCF7r1 MCF7 ER Resistant Full-Media 1
## 9 MCF7r2 MCF7 ER Resistant Full-Media 2
## 10 ZR751 ZR75 ER Responsive Full-Media 1
## 11 ZR752 ZR75 ER Responsive Full-Media 2
## bamReads ControlID bamControl
## 1 reads/Chr18_BT474_ER_1.bam BT474c reads/Chr18_BT474_input.bam
## 2 reads/Chr18_BT474_ER_2.bam BT474c reads/Chr18_BT474_input.bam
## 3 reads/Chr18_MCF7_ER_1.bam MCF7c reads/Chr18_MCF7_input.bam
## 4 reads/Chr18_MCF7_ER_2.bam MCF7c reads/Chr18_MCF7_input.bam
## 5 reads/Chr18_MCF7_ER_3.bam MCF7c reads/Chr18_MCF7_input.bam
## 6 reads/Chr18_T47D_ER_1.bam T47Dc reads/Chr18_T47D_input.bam
## 7 reads/Chr18_T47D_ER_2.bam T47Dc reads/Chr18_T47D_input.bam
## 8 reads/Chr18_TAMR_ER_1.bam TAMRc reads/Chr18_TAMR_input.bam
## 9 reads/Chr18_TAMR_ER_2.bam TAMRc reads/Chr18_TAMR_input.bam
## 10 reads/Chr18_ZR75_ER_1.bam ZR75c reads/Chr18_ZR75_input.bam
## 11 reads/Chr18_ZR75_ER_2.bam ZR75c reads/Chr18_ZR75_input.bam
## Peaks PeakCaller
## 1 peaks/BT474_ER_1.bed.gz bed
## 2 peaks/BT474_ER_2.bed.gz bed
## 3 peaks/MCF7_ER_1.bed.gz bed
## 4 peaks/MCF7_ER_2.bed.gz bed
## 5 peaks/MCF7_ER_3.bed.gz bed
## 6 peaks/T47D_ER_1.bed.gz bed
## 7 peaks/T47D_ER_2.bed.gz bed
## 8 peaks/TAMR_ER_1.bed.gz bed
## 9 peaks/TAMR_ER_2.bed.gz bed
## 10 peaks/ZR75_ER_1.bed.gz bed
## 11 peaks/ZR75_ER_2.bed.gz bed
A sample sheet should have a stucture that is compatible with DiffBind loading function. You can explore all the requirements by checking the documentation for dba
function:
?DiffBind::dba
The peaksets are read in using the following DiffBind function, that will construct a new DBA object from the sample sheet.
tamoxifen <- dba(sampleSheet="tamoxifen.csv")
## BT4741 BT474 ER Resistant Full-Media 1 bed
## BT4742 BT474 ER Resistant Full-Media 2 bed
## MCF71 MCF7 ER Responsive Full-Media 1 bed
## MCF72 MCF7 ER Responsive Full-Media 2 bed
## MCF73 MCF7 ER Responsive Full-Media 3 bed
## T47D1 T47D ER Responsive Full-Media 1 bed
## T47D2 T47D ER Responsive Full-Media 2 bed
## MCF7r1 MCF7 ER Resistant Full-Media 1 bed
## MCF7r2 MCF7 ER Resistant Full-Media 2 bed
## ZR751 ZR75 ER Responsive Full-Media 1 bed
## ZR752 ZR75 ER Responsive Full-Media 2 bed
A metadata of the DBA can be viewed simply by:
tamoxifen
## 11 Samples, 2845 sites in matrix (3795 total):
## ID Tissue Factor Condition Treatment Replicate Caller Intervals
## 1 BT4741 BT474 ER Resistant Full-Media 1 bed 1080
## 2 BT4742 BT474 ER Resistant Full-Media 2 bed 1122
## 3 MCF71 MCF7 ER Responsive Full-Media 1 bed 1556
## 4 MCF72 MCF7 ER Responsive Full-Media 2 bed 1046
## 5 MCF73 MCF7 ER Responsive Full-Media 3 bed 1339
## 6 T47D1 T47D ER Responsive Full-Media 1 bed 527
## 7 T47D2 T47D ER Responsive Full-Media 2 bed 373
## 8 MCF7r1 MCF7 ER Resistant Full-Media 1 bed 1438
## 9 MCF7r2 MCF7 ER Resistant Full-Media 2 bed 930
## 10 ZR751 ZR75 ER Responsive Full-Media 1 bed 2346
## 11 ZR752 ZR75 ER Responsive Full-Media 2 bed 2345
First line printed into a console shows the total number of unique peaks after merging overlapping ones (3795), and the dimensions of the default binding matrix of 11 samples by the 2845 sites that overlap in at least two of the samples. You may change the minimal number of overlaping samples by modyfying a parameter minOverlap
in dba
function while loading the peaksets.
Peaksets provide insight into the potential occupancy of the ChIPed protein at specific genomic regions. After the peaksets have been loaded you may access individual peaks in each sample by subsetting the main dba
object. For example, the following code will show you genomic location of the first 6 peaks of the first sample:
head(tamoxifen$peaks[[1]])
## V1 V2 V3 V5
## 1 chr18 215562 216063 0.01742633
## 2 chr18 311530 312105 0.06909971
## 3 chr18 356656 357315 0.02859148
## 4 chr18 371110 372092 0.03846775
## 5 chr18 395116 396464 0.47995863
## 6 chr18 399014 400382 0.56996310
It is also useful to perform some exploratory visualisation to investigate how how these occupancy maps agree with each other, eg. between experimental conditions, biological models or peak callers.
dba.plotHeatmap(tamoxifen)
We generated a correlation heatmap with initial sample clustering. The plot shows that the replicates for each cell line cluster together, however the cell lines do not cluster accodring to their sensitivity to tamoxifen treatment. It also shows that the two most highly correlated cell lines are the two MCF7-based ones, even though they respond differently to tamoxifen treatment.
Beyond quality control, the product of an occupancy analysis may be a consensus peakset, representing an overall set of candidate binding sites to be used in further analysis.
The next step is to take the alignment files (BAM) and compute count information for each of the peaks/regions in the consensus set. For each of the consensus regions DiffBind takes the number of uniquelu aligned reads to compute a normalized read count for each sample at every potential binding site. The peaks in the consensus peakset may be re-centered and trimmed based on calculating their summits (point of greatest read overlap) in order to provide more standardized peak intervals.
These reads are obtained using the dba.count function
. As this example is based on a transcription factor that binds to the DNA, resulting in “punctate”, narrow peaks, it is advisable to use the “summits” option to re-center each peak around the point of greatest enrichment. This keeps the peaks at a consistent width (in this case, with summits=250
, the peaks will be 500bp
, extending 250bp
up and downstream of the summit). The final result of counting is a binding affinity matrix containing a (normalized) read count for each sample at every potential binding site.
tamoxifen.counted <- dba.count(tamoxifen, summits=250)
## Re-centering peaks...
Take a look at the DBA object again. You should now see a column that contains the FRiP values for each sample. This is the proportion of reads for that sample that overlap a peak in the consensus peakset, and can be used to indicate which samples show more enrichment overall.
tamoxifen.counted
## 11 Samples, 2845 sites in matrix:
## ID Tissue Factor Condition Treatment Replicate Caller Intervals FRiP
## 1 BT4741 BT474 ER Resistant Full-Media 1 counts 2845 0.18
## 2 BT4742 BT474 ER Resistant Full-Media 2 counts 2845 0.17
## 3 MCF71 MCF7 ER Responsive Full-Media 1 counts 2845 0.34
## 4 MCF72 MCF7 ER Responsive Full-Media 2 counts 2845 0.21
## 5 MCF73 MCF7 ER Responsive Full-Media 3 counts 2845 0.27
## 6 T47D1 T47D ER Responsive Full-Media 1 counts 2845 0.12
## 7 T47D2 T47D ER Responsive Full-Media 2 counts 2845 0.07
## 8 MCF7r1 MCF7 ER Resistant Full-Media 1 counts 2845 0.24
## 9 MCF7r2 MCF7 ER Resistant Full-Media 2 counts 2845 0.15
## 10 ZR751 ZR75 ER Responsive Full-Media 1 counts 2845 0.35
## 11 ZR752 ZR75 ER Responsive Full-Media 2 counts 2845 0.24
With this matrix, the samples can be re-clustered using affinity, rather than occupancy data:
dba.plotHeatmap(tamoxifen.counted)
To see how well the samples cluster with one another, we can draw a PCA plot using all 2845 consensus sites.
dba.plotPCA(tamoxifen.counted, attributes=DBA_TISSUE, label=DBA_ID)
You should see that samples from the same cell line cluster together. How about their sensitivity to tamoxifen? To color samples by cell lines we used a parameter attributes=DBA_TISSUE
, which utilizes a constant variable indicating which column in the metadata file stands for the source tissue.
DBA_TISSUE
## [1] 2
samples[,DBA_TISSUE]
## [1] "BT474" "BT474" "MCF7" "MCF7" "MCF7" "T47D" "T47D" "MCF7" "MCF7"
## [10] "ZR75" "ZR75"
Challenge 1
1. Look at the documentation of the constant variables used in DiffBind package (Hint:?DBA_TISSUE
) 2. Which column contains the informations about cell lines sensitivity to tamoxifen?
3. Plot PCA coloring samples by their sensitivity to tamoxifen.
Next we have to let DiffBind know how we want to group our samples. In our case we will group based on condition. This is done using the function, as follows:
tamoxifen.counted <- dba.contrast(tamoxifen.counted, categories=DBA_CONDITION)
You can see almost the same summary table as before except that now the contrast is added at the end.
tamoxifen.counted
## 11 Samples, 2845 sites in matrix:
## ID Tissue Factor Condition Treatment Replicate Caller Intervals FRiP
## 1 BT4741 BT474 ER Resistant Full-Media 1 counts 2845 0.18
## 2 BT4742 BT474 ER Resistant Full-Media 2 counts 2845 0.17
## 3 MCF71 MCF7 ER Responsive Full-Media 1 counts 2845 0.34
## 4 MCF72 MCF7 ER Responsive Full-Media 2 counts 2845 0.21
## 5 MCF73 MCF7 ER Responsive Full-Media 3 counts 2845 0.27
## 6 T47D1 T47D ER Responsive Full-Media 1 counts 2845 0.12
## 7 T47D2 T47D ER Responsive Full-Media 2 counts 2845 0.07
## 8 MCF7r1 MCF7 ER Resistant Full-Media 1 counts 2845 0.24
## 9 MCF7r2 MCF7 ER Resistant Full-Media 2 counts 2845 0.15
## 10 ZR751 ZR75 ER Responsive Full-Media 1 counts 2845 0.35
## 11 ZR752 ZR75 ER Responsive Full-Media 2 counts 2845 0.24
##
## 1 Contrast:
## Group1 Members1 Group2 Members2
## 1 Resistant 4 Responsive 7
The main differential analysis function is invoked as follows:
# DESeq2
tamoxifen.analysed <- dba.analyze(tamoxifen.counted)
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
# edgeR
tamoxifen.analysed <- dba.analyze(tamoxifen.counted, method=DBA_EDGER)
# All methods simulanously
tamoxifen.analysed <- dba.analyze(tamoxifen.counted, method=DBA_ALL_METHODS)
## converting counts to integer mode
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
The method uses existing methods created to perform differential expression analysis. The default method is DESeq2
but you can try edgeR
as well. At the end of the summary displayed you can see the amount of differentially bound sites found by each method.
dba.show(tamoxifen.analysed, bContrasts=T)
## Group1 Members1 Group2 Members2 DB.edgeR DB.DESeq2
## 1 Resistant 4 Responsive 7 326 631
This means that out of the 2845 regions DESeq2
identified 631 and edgeR 326 as significantly differentially bound using the default threshold of FDR <= 0.05
.
We can see that edgeR identifies alot fewer peaks, but it would be good to see if those peaks are a subset of the DESeq2 results. For a quick look at the overlapping peaks identified by the two different tools (DESeq2 and edgeR) we can plot a Venn diagram:
dba.plotVenn(tamoxifen.analysed,contrast=1,method=DBA_ALL_METHODS)
Challenge 2
The default threshold is padj < 0.05. 1. How many regions are differentially bound between resistant and sensitive cell lines with a more stringent threshold of 0.01? Use both methods: DESeq2 and edgeR. (Hint: useth = 0.01
).
2. Plot a Venn Diagram of overlapping peaks identified by the two different tools.
We can now plot the same type of heatmap as at the beginning of our analysis using only the differentially bound sites. This will strengthen a bit the differeneces between the conditions as expected.
dba.plotHeatmap(tamoxifen.analysed, contrast=1)
We can also display the binding affinity heatmap to see the binding patterns across the identified regions. You can control which method’s result you wish to see by setting method=DBA_EDGER
or method=DBA_DESEQ2
(default). We will use a parameter ColAttributes = DBA_CONDITION
to display only exprerimental conditions as column annotations.
dba.plotHeatmap(tamoxifen.analysed, ColAttributes = DBA_CONDITION, contrast=1, correlations=FALSE)
To further analyse the results we can use built in functions to generate MA plots
, volcano plots
, PCA plots
and boxplots
. All these functions can be called using either one of the differential binding methods (eg. DESeq2/edgeR).
MA plots are useful to visualise which data points are differentially bound. Each of these points represents a binding site and red points indicate differentailly bound ones. These points have a log fold change of at least 2.
dba.plotMA(tamoxifen.analysed)
Similarly to MA plots, Volcano plots can show the significantly differentailly bound sites and show their fold enrichment (or p-values).
dba.plotVolcano(tamoxifen.analysed)
PCA can give us different representation of how the samples are associated. We see that samples of the different conditions cluster separately. This command calculates principal components based on the differentially bound sites only.
dba.plotPCA(tamoxifen.analysed, contrast = 1)
Boxplots can give us an idea about the read distribution differences between the classes - in our case the two conditions. The first two boxes show distribution of reads over all differentially bound sites; the middle two show differences on those sites where the affinity increases in Responsive and the two boxes on the right show differences where the affinity increases in Resistant samples.
dba.plotBox(tamoxifen.analysed)
And finally we can report the differentially bound peak regions, identified by either method (DESeq2/edgeR). These results files contain the genomic coordinates for all consensus sites and statistics for differential enrichment including fold-change, p-value and FDR.
report <- dba.report(tamoxifen.analysed)
report
## GRanges object with 631 ranges and 6 metadata columns:
## seqnames ranges strand | Conc Conc_Resistant
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## 2452 chr18 64490686-64491186 * | 6.20 1.32
## 1291 chr18 34597713-34598213 * | 5.08 -0.06
## 976 chr18 26860997-26861497 * | 7.14 2.85
## 2338 chr18 60892900-60893400 * | 6.91 1.37
## 339 chr18 9785198-9785698 * | 5.03 0.59
## ... ... ... ... . ... ...
## 1862 chr18 49683288-49683788 * | 1.13 -0.86
## 1397 chr18 38061116-38061616 * | 1.36 -0.86
## 2408 chr18 62888911-62889411 * | 2.41 0.42
## 891 chr18 24305281-24305781 * | 2.92 0.97
## 857 chr18 23291930-23292430 * | 3.09 1.26
## Conc_Responsive Fold p-value FDR
## <numeric> <numeric> <numeric> <numeric>
## 2452 6.84 -5.52 7.70e-10 1.18e-06
## 1291 5.72 -5.78 8.32e-10 1.18e-06
## 976 7.77 -4.92 2.49e-09 1.79e-06
## 2338 7.55 -6.18 2.94e-09 1.79e-06
## 339 5.66 -5.06 3.14e-09 1.79e-06
## ... ... ... ... ...
## 1862 1.64 -2.50 0.0106 0.0482
## 1397 1.89 -2.76 0.0106 0.0482
## 2408 2.92 -2.50 0.0108 0.0488
## 891 3.42 -2.46 0.0108 0.0488
## 857 3.59 -2.33 0.0108 0.0488
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The value columns are described below:
Before writing to file we need to convert it to a data frame so that genomic coordinates get written as columns and not GRanges.
report.df <- as.data.frame(report)
write.table(report.df, "tamoxifen_report.csv", sep="\t", quote=F, row.names=F)
We will look at the occupancy resistance data in more detail showing what a pure occupancy-based analysis would look like, and comparing it to the results obtained using the affinity data.
One reason to do an occupancy-based analysis is to determine what candidate sites should be used in a subsequent affinity-based analysis. In the example so far, we took all sites that were identified in peaks in at least two of the eleven peaksets, reducing the number of sites from 3795 overall to the 2845 sites used in the differential analysis. We could have used a more stringent criterion, such as only taking sites identified in five or six of the peaksets, or a less stringent one, such as including all 3795 sites.
A global overview of overlap numbers at different threshold can be obtained using the RATE mode of the dba.overlap
function as follows:
olap.rate <- dba.overlap(tamoxifen, mode=DBA_OLAP_RATE)
olap.rate
## [1] 3795 2845 1773 1388 1074 817 653 484 384 202 129
The returned data in olap.rate is a vector containing the number of peaks that appear in at least one, two, three, and so on up to all eleven peaksets.
These values can be plotted to show the overlap rate drop-off curve:
plot(olap.rate,type='b',ylab='# peaks', xlab='Overlap at least this many peaksets')
The presence of a peak in multiple peaksets is an indication that it is a “real” binding site, in the sense of being identifiable in a repeatable manner (by different biological replicates or/and different peak callig algorithms).
Let’s look at tamoxifen responsive MCF7 cell line represented by 3 replicates. The overlap rate for just the positive MCF7 samples can be isolated using a sample mask. A set of sample masks are automatically associated with a DBA
object in the $mask
field:
names(tamoxifen$masks)
## [1] "BT474" "MCF7" "T47D" "ZR75" "ER"
## [6] "Resistant" "Responsive" "Full-Media" "bed" "Replicate.1"
## [11] "Replicate.2" "Replicate.3" "All" "None"
Arbitrary masks can be generated using the dba.mask
function. In this case, a mask that isolates the MCF7 samples can be generated by combining to pre-defined masks (MCF7 and Responsive) and passed into the function:
dba.overlap(tamoxifen, tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive, mode=DBA_OLAP_RATE)
## [1] 1780 1215 885
There are 885 peaks (out of 1780) identified in all three replicates. We can visualise that as a Venn Diagram:
dba.plotVenn(tamoxifen, tamoxifen$masks$MCF7 & tamoxifen$masks$Responsive)
Challenge 3
1. Check overlap between the two replicates of tamoxifen responsiveT47D
cell line.
2. Visualise the overlap as Venn Diagram.
A separate consensus peakset for each of the replicated sample types can be added to the DBA object using dba.peakset:
tamoxifen_consensus <- dba.peakset(tamoxifen, consensus=c(DBA_TISSUE,DBA_CONDITION), minOverlap=0.66)
## Add consensus: BT474 Resistant
## Add consensus: MCF7 Responsive
## Add consensus: T47D Responsive
## Add consensus: MCF7 Resistant
## Add consensus: ZR75 Responsive
This adds a new consensus peakset for each set of samples that share the same Tissue and Condition values. From this, a new DBA
object can be generated consisting of only the five consensus peaksets (the $Consensus mask filters peaksets previously formed using) :
tamoxifen_consensus <- dba(tamoxifen_consensus, mask=tamoxifen_consensus$masks$Consensus, minOverlap=1)
tamoxifen_consensus
## 5 Samples, 2666 sites in matrix:
## ID Tissue Factor Condition Treatment Replicate Caller
## 1 BT474:Resistant BT474 ER Resistant Full-Media 1-2 bed
## 2 MCF7:Responsive MCF7 ER Responsive Full-Media 1-2-3 bed
## 3 T47D:Responsive T47D ER Responsive Full-Media 1-2 bed
## 4 MCF7:Resistant MCF7 ER Resistant Full-Media 1-2 bed
## 5 ZR75:Responsive ZR75 ER Responsive Full-Media 1-2 bed
## Intervals
## 1 896
## 2 1215
## 3 318
## 4 879
## 5 1933
An overall consensus peakset, that includes peaks identified in at least two replicates of at least one sample group, can be identified:
consensus_peaks <- dba.peakset(tamoxifen_consensus, bRetrieve=TRUE)
This consensus peakset could then be used as the basis for the binding matrix used in dba.count
.
Challenge 4
1. Perform differential binding analysis usingtamoxifen
as input (Hint: usepeaks=consensus_peaks
).
2. How many differential binding sites was identified usingDESeq2
with the significance thresholdFDR < 0.01
?
3. Requiring that consensus peaks overlap in at least one third of the samples in each group, check the overlap between tamoxifen sensitive and resistant samples. Hint (useminOverlap=0.33
andconsensus=DBA_CONDITION
)
4. Plot the results as Venn Diagram.
Dora Bihary
VIB Center for Cancer Biology, University of Leuven, BE
MRC Cancer Unit, University of Cambridge, UK
Harvard Chan Bioinformatics Core
DiffBind team