We’ll be using ChIP-seq and RNA-seq datasets to demonstrate how to align ChIP-seq and RNA-seq data to the GRCh38
reference genome. GRCh38/hg38 is the latest assembly of the human genome released December of 2013, that greatly expanded alternate (ALT) contigs. These alternate haplotypes includes highly variable HLA loci and represent common complex variation of the human genome. You can read more about GRCh38/hg38 here.
The dataset for this practical is a publicly available dataset downloaded from the NCBI GEO
repository with the accession GSE15780. These data were already used in the paper: “Crosstalk between c-Jun and TAp73alpha/beta contributes to the apoptosis-survival balance” by Koeppel et al. to explores genome wide binding of two transcription factors (TP53
and TP73
splice variants) and gene expression in a human cancer cell line.
BWA
alignment of ChIP-seq to GRCh38
referenceBowtie2
alignmnet of ChIP-seq to GRCh38
referenceSAM Tools
SAM Stat
STAR
alignment of RNA-seq dataA sequence mapping (alignment) is a way of arranging biological sequence (DNA or RNA), so that it reflects a given order, for example similarity to a reference genome. Reference genomes can be downloaded from UCSC, Ensembl or NCBI genome resources. Here’s the UCSC genome browser url for the human reference FASTA file, that can be used to align your sequencing reads (hg38.fa). Detailed download instructions are on that page.
Whole genomic alignments can be time consuming and not realistic to do in the short time we have. Therefore, we downloaded and preprocessed a single chromosome (chr3.fa) from UCSC to save time.
Despite significant improvement in the accuracy and completeness of the human genome, presence of repeat-rich sequences in centromeric and acrocentric short arms regions may affect mapping of short-reads datasets. The preprocessing step includes aligning to a GRCh38 genome with a sponge-database (which removes artefacts and non-chromosomal sequences) and then regenerating the chr3 fastq file. Because we’ve done this for you there is no need for you to use a sponge database again when you do the tutorial below. You can read more about this problem here: Miga K. et al. (2015)
Mapping of milions of short reads to a very large reference sequence is a challenging task. In order to accelerate short reads mapping, most of the modern alignment tools use a strategy of ‘indexing’ (think about it as indexing of a book), see: Trapnell C. and Salzberg SL (2009). Indexing is specific to an aligner and reference sequence used. All thr detailed informations about required genome indexes can be found in a software documentation.
We’ll use BWA to align a fastq ChIP-seq sample to the GRCh38 reference genome. Before running the command we need to make sure that we are in the correct directory:
cd ~/Course_Materials/Introduction/SS_DB
pwd
First, we need to create a BWA hg38chr3 index. You can access BWA docummentation here or just simply type in the terminal window:
bwa index
This command will display all the parameters that you can use when creating the reference genome (in today’s example - chromosome) index. Now we will run:
bwa index -p Reference/BWA/hg38_chr3_BWA_idx -a bwtsw Reference/BWA/hg38_chr3.fa
The command should take couple of minutes to finish, so let’s explore parameters used:
Normally you would use a complete genome build fasta (eg. hg38.fa) file to build a bwa index. In this case we’re using only chromosome 3: hg38_chr3.fa.
Once we created the reference index we can run main mapping command:
bwa mem -M -t 8 Reference/BWA/hg38_chr3_BWA_idx QC_Trimming/tp53_r2.fastq_trimmed.fastq.gz > Alignment/BWA/tp53_r2.sam
Parameters we used:
Samtools is an open source toolkit for next generation sequence data manipulation. It is patriculairly useful to modify and reformat sequence alignment files (SAM/BAM) for downstream processing. We’ll demonstrate only a few examples of samtools utilities, full documentation can be accessed here.
Generate BAM file from the alignment SAM file.
A BAM file (.bam) is the binary version of a SAM file. BAM occupies less disk space than SAM and is a default input required by some bioinformatic tools.
# Let's first look at the first 10 lines of your SAM file
head -n 10 Alignment/BWA/tp53_r2.sam
# Convert SAM to BAM
samtools view -bT Reference/BWA/hg38_chr3.fa Alignment/BWA/tp53_r2.sam > Alignment/BWA/tp53_r2.bam
# Now in order to look at the first few lines of BAM file we need to use `samtools view`
samtools view Alignment/BWA/tp53_r2.bam | head -n 10
Sort BAM file.
Many programs that use alignment files need sorted BAM files.
samtools sort Alignment/BWA/tp53_r2.bam -o Alignment/BWA/tp53_r2_sorted.bam
Generate BAM index.
This makes accessing these large files faster and again, is usually required by programs that process alingment files.
samtools index Alignment/BWA/tp53_r2_sorted.bam
Convert BAM to SAM
samtools view -h Alignment/BWA/tp53_r2_sorted.bam > Alignment/BWA/tp53_r2_sorted.sam
Filter unmapped reads in BAM
During running the alignment command, each read from the input has been assigned flags according to its properties (see: SAM flags). We can filter BAM/SAM files by including/excluding reads with certain flags. For example, in order to remove all the unmapped reads we can run:
samtools view -F 4 Alignment/BWA/tp53_r2_sorted.sam > Alignment/BWA/tp53_r2_sorted_mapped.bam
If we want all reads mapping within given genomic coordinates, eg: chr3:200000-500000, we can run:
samtools view Alignment/BWA/tp53_r2_sorted.bam chr3:200000-500000 > Alignment/BWA/tp53_r2_sorted_slice.bam
Simple statistics using SAM Tools flagstat
samtools flagstat Alignment/BWA/tp53_r2_sorted.bam
Create a FASTQ
file from a BAM
file
samtools bam2fq Alignment/BWA/tp53_r2_sorted.bam > Alignment/BWA/tp53_r2_sorted.fastq
# Look at the FASTQ file
head -n 10 Alignment/BWA/tp53_r2_sorted.fastq
In order to examine BAM/SAM files mapping quality we can run:
samstat Alignment/BWA/tp53_r2_sorted.bam
Generate a tdf (tile data format) file for viewing in IGV
browser.
igvtools count -z 5 -w 25 -e 250 Alignment/BWA/tp53_r2_sorted.bam Alignment/BWA/tp53_r2_sorted.tdf hg38
First, let’s explore bowtie2 options:
bowtie2 -h
Similairly to BWA workflow, we need to index a reference sequence before running main mapping algorithm:
bowtie2-build -f Reference/Bowtie2/hg38_chr3.fa Reference/Bowtie2/hg38_chr3_bowtie2_idx
Then, ChIP-Seq reads alignment:
bowtie2 -x Reference/Bowtie2/hg38_chr3_bowtie2_idx -U QC_Trimming/tp53_r2.fastq_trimmed.fastq.gz -S Alignment/Bowtie2/tp53_r2.sam
Now it is your turn to type a commands that first convert a SAM to BAM, then sort and finally index generated BAM file. All output files should be saved in Alignment/Bowtie2
directory.
We downloaded GTF file from GENCODE
Generate genome indices:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir Reference/STAR/ --genomeFastaFiles Reference/STAR/hg38_chr3.fa --sjdbGTFfile Reference/STAR/gencode.v31.basic.annotation.gtf --sjdbOverhang 100
Align reads to the genome:
STAR --runThreadN 4 --genomeDir Reference/STAR/ --readFilesIn RawData/RNAseq/tp53_rep1_trimmed.fastq.gz --readFilesCommand zcat --outFileNamePrefix Alignment/STAR/tp53_rep1_ --outSAMtype BAM SortedByCoordinate --sjdbGTFfile Reference/STAR/gencode.v31.basic.annotation.gtf --sjdbOverhang 100 --twopassMode Basic --outWigType bedGraph --outWigStrand Stranded
While STAR is running, the status messages will be appearing on the screen and the progress of the mapping job can be checked in the Log.progress.out
This should give you basic skills for doing next generation sequence alignment, and this is also the end of this practical!