ATAC-seq data analysis

0.1 Important Notes
0.2 Datasets
0.3 Setup Tools and Methods
0.4 Introduction
0.5 Create a reference genome,build genome index and align fastq files to genome.
0.6 Quality control
0.7 Identify nucleosome free, mononucleosome, dinucleosome, and trinucleosome regions
0.8 Heatmap
0.9 Distribution of nucleosome-free and nucleosome-bound regions
0.10 TF Footprints
0.11 Enrichment analysis of TF footprints
- 0.11.1 Sample correlations
- 0.11.2 V plots

0.1 Important Notes

There will be two types of scripts ( shell -run on linux terminal and R - run on R studio)
Code snippets that should not be run in the inetrest of time are also included. (These are there for you to follow the complete analysis)

0.2 Datasets

Reference Genome: GRCh38 from UCSC
GRCh38 blacklist:
Gene models from Bioconductor annotation package - TxDb.Hsapiens.UCSC.hg38.knownGene
We will work with the example dataset in the ATACseqQC package. I’ve also included a single ATAC-seq BAM file from the Original Nature Methods paper describing ATAC-seq published by Will Greenleaf’s lab.

(Ref: Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013 Dec;10(12):1213-8. doi: 10.1038/nmeth.2688. Epub 2013 Oct 6. PubMed PMID: 24097267; PubMed Central PMCID: PMC3959825.)

SRA_ID of ATACseq_50k_Rep3 sample: SRR891270 (see in “/home/participant/Course_Materials/ChIPSeq/Materials/RawData/ATAC”vplo folder for fastq and bam files). In the interest of time and available compute power, We will work on a single chromosome in this tutorial.

0.3 Setup Tools and Methods

We will use both R/Bioconductor packages as well as other software packages for ATAC-seq analysis.

# Install only if missing

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")


BiocManager::install(c("ATACseqQC", "BiFET", "ChIPpeakAnno", "MotifDb", "GenomicAlignments",
           "BSgenome.Hsapiens.UCSC.hg19", "TxDb.Hsapiens.UCSC.hg19.knownGene",
           "phastCons100way.UCSC.hg19"))


#hg38 versions
BiocManager::install(c("BSgenome.Hsapiens.UCSC.hg38", "TxDb.Hsapiens.UCSC.hg38.knownGene", "phastCons100way.UCSC.hg38"))

0.4 Introduction

This tutorial will demonstrate the computational processing and analyse of ATAC-seq data. The quality control, artefact removal and alignment of Fastq files to a reference genome are the same as for ChIP-seq and will not be repeated.

Quality control, Artefact Removal and Alignment to a reference genome
Quality filtering, blacklist removal, shifting alignment
Fragment length and insert size detection
Peak Calling
ATAC QC
TF Footprinting

0.5 Create a reference genome,build genome index and align fastq files to genome.

type all bash code in terminal / shell and R code in RStudio

bash-script


# DON'T RUN! 
# Example of How To Build a Reference Genome

# Download reference genome from UCSC and make BWA index
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/ .
ls |grep "alt" | rm -rf
gunzip *.gz

# cat * > genome.fa will lose the numerical ordering (for lexicographical ordering instead) so you get: 1, 11-19, 2, 20-22, 3-9, M, other, X, Y. Some tools want the proper ordering (1-22, X, Y, M, other), which is possible using a line of bash :

# concatenate the fasta files into a single reference fasta file
echo "$(ls chr*.fa | sort -V | grep -vP 'chr[^X|Y|\d]'; ls chr*.fa | sort -V | grep -vP 'chr[\d|X|Y]')" | xargs cat > hg38.fa

# see if the chromosoem order is preserved
grep chr hg38.fa

# make bwa index
bwa index -a bwtsw hg38.fa

bash-script

# QC and artefact removal

# Similar to previous practicals (use FASTQC and cutadapt)

bash-script


# DON'T RUN!

# align to hg38
bwa mem -M -t 40 hg38 SRR891270_1_trim.fastq.gz SRR891270_2_trim.fastq.gz > SRR891270.sam

bash-script

# DON'T RUN!
# sort and index BAM file of ATAC sample

samtools view -S -b -h  -T hg38.fa SRR891270.sam | samtools sort -@ 8 -O bam -T SRR891270.tmp -o SRR891270.bam -

bash-script

# DON'T RUN!
# Peak calling for ATAC-seq is different to ChIP-seq!


# Peaks representing Nucleosome Free Regions (NFRs)

MACS2 callpeak -t SRR891270.bam --nomodel --shift -100 --extsize 200 --format BAM -g hg38

# Optionally, for paired end data

MACS2 callpeak -t SRR891270.bam --format BAMPE  -g hg38

# For nucleosome occupancy shift and extension can centre the signal on nucleosomes (147 bp DNA is wrapped in a nucleosome)

MACS2 callpeak -t SRR891270.bam --nomodel --shift 37 --extsize 73 --format BAM -g hg38



--nomodel: don’t build shifting model
--shift: when this value is negative, ends will be moved toward 3'->5' direction
--extsize: extend reads in 5’->3’ direction to fix-sized fragments

R-script

# plot the number of mapped reads per chromosome
library("Rsubread")
library("Rsamtools")
library("ggplot2")
library("devtools")
library("magrittr")

0.6 Quality control

## load the library
library("ATACseqQC")

# input the bamFile from the ATACseqQC package 
bamfile <- system.file("extdata", "GL1.bam", package="ATACseqQC", mustWork=TRUE)

# BAM file name
bamfile.labels <- gsub(".bam", "", basename(bamfile))

0.6.1 Fragement size distribution

R-script

# generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)

0.6.2 Estimate library complexity

#bamQC(bamfile, outPath=NULL)
estimateLibComplexity(readsDupFreq(bamfile, index=bamfile))

0.6.3 Adjust read start position

Tn5 transposase has been shown to bind as a dimer and inserts two adaptors into accessible DNA locations separated by 9 bp2.

Therefore, for downstream analysis, such as peak-calling and footprinting, all reads in input bamfile need to be shifted. The function shiftGAlignmentsList can be used to shift the reads. By default, all reads aligning to the positive strand are offset by +4bp, and all reads aligning to the negative strand are offset by -5bp1.

The adjusted reads will be written into a new bamfile for peak calling or footprinting.

## bamfile tags to be read in
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")

## files will be output into outPath
outPath <- "splited"
dir.create(outPath)

## shift the coordinates of 5'ends of alignments in the bam file
library(BSgenome.Hsapiens.UCSC.hg19)

seqlev <- "chr1" ## subsample data for quick run
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)

#shift the GAlignmentsLists by 5' ends. All reads aligning to the positive strand will be offset by +4bp, and all reads aligning to the negative strand will be offset -5bp by default.
gal1 <- shiftGAlignmentsList(gal)
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)

0.6.4 Promoter/ Transcript body score

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txs <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)

#PT score is calculated for coverage of promoter divided by the coverage of transcripts body. PT score will show if the signal is enriched in promoters.
pt <- PTscore(gal1, txs)
plot(pt$log2meanCoverage, pt$PT_score, 
     xlab="log2 mean coverage",
     ylab="Promoter vs Transcript")

0.7 Identify nucleosome free, mononucleosome, dinucleosome, and trinucleosome regions

The shifted reads will be split into bins consisting of nucleosome free, mononucleosome, dinucleosome, and trinucleosome. Shifted reads that do not fit into any of the above bins will be discarded. Splitting reads is a time-consuming step because we are using random forest to classify the fragments based on fragment length, GC content and conservation scores.

By default, we assign the top 10% of short reads (reads below 100_bp) as nucleosome-free regions and the top 10% of intermediate length reads as (reads between 180 and 247 bp) mononucleosome. This serves as the training set to classify the rest of the fragments using random forest. The number of the tree will be set to 2 times of square root of the length of the training set.

R-script

library(phastCons100way.UCSC.hg19)
## run program for chromosome 1 only
txs <- txs[seqnames(txs) %in% "chr1"]
genome <- Hsapiens
## split the reads into NucleosomeFree, mononucleosome, 
## dinucleosome and trinucleosome.
objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome,
                              conservation=phastCons100way.UCSC.hg19)

### Save the binned alignments into bam files.

null <- writeListOfGAlignments(objs, outPath)

## list the files generated by splitBam.
dir(outPath)

0.8 Heatmap

R-script

library(ChIPpeakAnno)
bamfiles <- file.path(outPath,
                     c("NucleosomeFree.bam",
                     "mononucleosome.bam",
                     "dinucleosome.bam",
                     "trinucleosome.bam"))
## Plot the cumulative percentage of tag allocation in nucleosome-free 
## and mononucleosome bam files.
cumulativePercentage(bamfiles[1:2], as(seqinfo(Hsapiens)["chr1"], "GRanges"))


#
TSS <- promoters(txs, upstream=0, downstream=1)
TSS <- unique(TSS)
## estimate the library size for normalization
(librarySize <- estLibSize(bamfiles))
#
## calculate the signals around TSSs.
NTILE <- 101
dws <- ups <- 1010
sigs <- enrichedFragments(gal=objs[c("NucleosomeFree", 
                                     "mononucleosome",
                                     "dinucleosome",
                                     "trinucleosome")], 
                          TSS=TSS,
                          librarySize=librarySize,
                          seqlev=seqlev,
                          TSS.filter=0.5,
                          n.tile = NTILE,
                          upstream = ups,
                          downstream = dws)

## log2 transformed signals
sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))
#plot heatmap
featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
                      zeroAt=.5, n.tile=NTILE)

0.9 Distribution of nucleosome-free and nucleosome-bound regions

## get signals normalized for nucleosome-free and nucleosome-bound regions.
out <- featureAlignedDistribution(sigs, 
                                  reCenterPeaks(TSS, width=ups+dws),
                                  zeroAt=.5, n.tile=NTILE, type="l", 
                                  ylab="Averaged coverage")
## rescale the nucleosome-free and nucleosome signals to 0~1
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
out <- apply(out, 2, range01)
matplot(out, type="l", xaxt="n", 
        xlab="Position (bp)", 
        ylab="Fraction of signal")
axis(1, at=seq(0, 100, by=10)+1, 
     labels=c("-1K", seq(-800, 800, by=200), "1K"), las=2)
abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")

0.10 TF Footprints

ATAC-seq footprints infer factor occupancy genome-wide. The factorFootprints function uses matchPWM to predict the binding sites using the input position weight matrix (PWM). Then it calculates and plots the accumulated coverage for those binding sites to show the status of the occupancy genome-wide. Unlike CENTIPEDE4, the footprints generated here do not take the conservation (PhyloP) into consideration. factorFootprints function could also accept the binding sites as a GRanges object.

R-script

## foot prints
library(MotifDb)
CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
print(CTCF[[1]], digits=2)


sigs <- factorFootprints(shiftedBamfile, pfm=CTCF[[1]], 
                         genome=genome,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)

0.11 Enrichment analysis of TF footprints

In order to use BiFET, the user will need an input file with two GRanges objects:

The first object GRpeaks represents a matrix of ATAC-seq or DNase-seq peaks in GRanges class where each row represents the location of each peak. In addition there must be 3 metadata columns called “reads” (representing read counts in each peak), “GC” (representing the GC content), and lastly “peaktype” which designates each peak as either (“target”,“background”,“no”).
The second object GRmotif represents footprint calls from any footprint algorithms as a GRanges class where each row represents the location of each PWM occurrence. The footprint calls in the forward strand and those in the backward strand from the same PWM are not differentiated. The row names of GRmotif are the motif IDs (e.g. MA01371 STAT1).

R-script

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BiFET")

library(BiFET)
library(GenomicRanges)

#BiFET (Bias-free Footprint Enrichment Test) aims to identify TFs whose footprints are over-represented in target regions compared to background regions after correcting for differences in read counts and GC contents between target and background regions, where regions represent ATAC-seq or DNAse-seq peaks.

#In this example, the file input_peak_motif.Rdata was obtained as follows: we used ATAC-seq data obtained from five human PBMC (Ucar, et al., 2017) and five human islet samples (Khetan, et al., 2017) and called peaks using MACS version 2.1.0 (Zhang, et al., 2008) with parameters -nomodel -f BAMPE. The peak sets from all samples were merged to generate one consensus peak set (N = 57,108 peaks) by using R package DiffBind_2.2.5. (Ross-Innes, et al., 2012), where only the peaks present at least in any two samples were included in the analysis. We used the summits option to re-center each peak around the point of greatest enrichment and obtained consensus peaks of same width (200bp).

peak_file <- system.file("extdata", "input_peak_motif.Rdata",
                         package = "BiFET")
load(peak_file)

# Display the first few rows and columns of the peak file
head(GRpeaks)


# For each peak, GC content was obtained using peak annotation program annotatePeaks.pl from the HOMER software (Heinz. et al., 2010).


#In this example, TF footprints were called using PIQ algorithm (Sherwood, et al., 2014) using the pooled islet samples and pooled PBMC samples to increase the detection power for TF footprints. 
head(GRmotif)

#enrichment p-value
result <- calculate_enrich_p(GRpeaks, GRmotif)
head(result)

0.11.1 Sample correlations

#plot sample correlations
path <- system.file("extdata", package="ATACseqQC", mustWork=TRUE)
bamfiles <- dir(path, "*.bam$", full.name=TRUE)
gals <- lapply(bamfiles, function(bamfile){
               readBamFile(bamFile=bamfile, tag=character(0), 
                          which=GRanges("chr1", IRanges(1, 1e6)), 
                          asMates=FALSE)
         })
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txs <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(GenomicAlignments)
plotCorrelation(GAlignmentsList(gals), txs, seqlev="chr1")

0.11.2 V plots

vp <- vPlot(shiftedBamfile, pfm=CTCF[[1]], 
            genome=genome, min.score="90%", seqlev=seqlev,
            upstream=200, downstream=200, 
            ylim=c(30, 250), bandwidth=c(2, 1))

V-plot for the entire dataset:

ATAC-seq data analysis

July 2018

Contents

0.1 Important Notes

0.2 Datasets

0.3 Setup Tools and Methods

0.4 Introduction

0.5 Create a reference genome,build genome index and align fastq files to genome.

0.6 Quality control

0.6.1 Fragement size distribution

0.6.2 Estimate library complexity

0.6.3 Adjust read start position

0.6.4 Promoter/ Transcript body score

0.7 Identify nucleosome free, mononucleosome, dinucleosome, and trinucleosome regions

0.8 Heatmap

0.9 Distribution of nucleosome-free and nucleosome-bound regions

0.10 TF Footprints

0.11 Enrichment analysis of TF footprints

0.11.1 Sample correlations

0.11.2 V plots