Introduction to the dataset used in this part of the course
I’ll be using ChIP-seq and RNA-seq datasets to demonstrate how to align ChIP-seq and RNA-seq data to the GRCh38
reference genome.GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci.
The data-set for this practical is a publicly available dataset downloaded from the NCBI GEO
repository with the accession GSE15780. It’s a study “Crosstalk between c-Jun and TAp73alpha/beta contributes to the apoptosis-survival balance” PMID:21459846 by Koeppel et al. This study explores genome wide binding of transcription factors (TP53
and TP73
splice variants) and gene expression in a human cancer cell line.
BWA
alignment of ChIP-seq to GRCh38
referenceSAM Tools
tutorialBowtie2
alignmnet of ChIP-seq to GRCh38
referenceSTAR
alignment of RNA-seq dataWe downloaded the dataset (fastq files
) from the Sequence Read Archive using the SRA-toolkit
. There are multiple ways of doing this.
Browse the SRA dababase and download the data.
Use SRA toolkit. You need to install and configure this on your computer first. Detailed instructions are here.
The files you need are in /home/participant/Course_Materials/Introduction/SS_DB/Raw_Data/. These are Large files, so do not run this bit of R code below! It’s there just to show you how to download the files from the Sequence Read Archive.
R
print("Don't run me!!")
#setup SRAdb
# This will download a very large (~30 Gb) file!
library(SRAdb)
sqlfile <- 'SRAmetadb.sqlite'
if(!file.exists('SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile()
#establish a connection to the database
sra_con <- dbConnect(SQLite(),sqlfile)
# get SRR runs
# You need to give it the Experiment ID (SRX) of the dataset you would like to download
rs = listSRAfile(c("SRX016980"), sra_con, fileType = 'sra' )
# download the SRR file
getSRAfile(c("SRR036615"), sra_con, fileType='sra')
# convert to fastq using SRA Toolkit (you needs to install the SRA-toolkit on your computer)
[](https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/)
system ("fastq-dump SRR036615.lite.sra")
# or get the fastq file directly from EBI using ftp
getFASTQinfo(c("SRR036615"), sra_con, srcType = 'ftp' )
getSRAfile(c("SRR036615"), sra_con, fileType = 'fastq' )
Reference genomes can be downloaded from UCSC, Ensembl or NCBI genome resources. Here’s the UCSC genome browser url for the human reference GRCh38
Download instructions are on that page.
Whole genomic alignments can be time consuming and not realistic to do in the short time we have. Therefore, we downloaded and preprocessed a single chromosome (chr3)from the above dataset to save time. The preprocessing step included aligning to a GRCh38 genome with a sponge-database (which removes artefacts and non-chromosomal sequences) and then regenerating the chr3 fastq file. Because we’ve done this for you there is no need for you to use a sponge database again when you do the tutorial below.
bash
# Using BWA to align a fastq ChIP-seq sample to the GRCh38 reference genome
# Make sure you're in this folder and run all the commands from this directory. This is important for all the relative file paths to work!
# Use the bash terminal to run these commands!
pwd
create a BWA hg38chr3 index. We will use the bwa index command.
-p option give the index name (you can call this whatever you want, we named it hg38chr3bwaidx) -a option chooses one of the indexing algorithm within bwa
Normally you would use a complete genome build fasta (hg38.fa) file to build a bwa index. In this case we’re using chromosome3 hg38_chr3.fa (again to save time).
bash
# create genome index (chromosome in this case)
bwa index -p /home/participant/Course_Materials/Introduction/SS_DB/Reference/BWA/hg38chr3bwaidx \
-a bwtsw /home/participant/Course_Materials/Introduction/SS_DB/Reference/BWA/hg38_chr3.fa
Align to hg38 and generate the single end alignment SAM file
bash
# uses mem algorithm, -M This option leaves the best (longest) alignment for a read as is
# but marks additional alignments for the read as secondary
# -t = number of processor cores
bwa mem -M -t 8 /home/participant/Course_Materials/Introduction/SS_DB/Reference/BWA/hg38chr3bwaidx \
/home/participant/Course_Materials/Introduction/SS_DB/RawData/ChIPseq/tp53_r2.fastq.gz > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq.sam
SAMTools is an open source toolkit for next generation sequence data manipulation. These examples shows how to use the Samtools program.
Generate BAM file (binary version of SAM)from the alignment SAM file.
bash
# look at the first 10 lines of your SAM file
head -n 10 /home/participant/Course_Materials/Introduction/SS_DB/Alignment/tp53_r2.fastq.sam
#convert to BAM
samtools view -bT /home/participant/Course_Materials/Introduction/SS_DB/Reference/BWA/hg38_chr3.fa \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq.sam > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq.bam
Sort BAM file. Many programs that use alignment files need sorted BAM files.
bash
samtools sort -T sorted /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq.bam -o \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam
Generate BAM index. This makes accessing these large files faster.
bash
samtools index /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bai
Convert BAM to SAM
bash
#-h include header
samtools view -h /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted_anotherCopy.sam
Filter unmapped reads in BAM
bash
#-F only include reads with none of the FLAGS in INT present
samtools view -h -F 4 /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted_onlymapped.bam
If you need help decoding SAM flags
If we want all reads mapping within the genomic coordinates chr3:200000-500000
bash
samtools view /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam chr3:200000-500000 > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted_200-500k.bam
Simple statistics using SAM Tools flagstat
bash
#index the bam file first
samtools flagstat /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam
Create a fastq
file from a BAM
file
bash
samtools bam2fq /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.new_allreads.fastq
How to do this using bedtools
bash
# Examples only. DON'T RUN these!
#single end reads
bedtools bamtofastq -i input.bam -fq output.fastq
#paired-end reads:
samtools sort -n input.bam -o input_sorted.bam # sort by read name (-n)
bedtools bamtofastq -i input_sorted.bam -fq output_r1.fastq -fq2 output_r2.fastq
Run SAMStat
to asses BAM QC
bash
samstat /home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam
Generate a tdf (tile data format) file for viewing in IGV
browser.
bash
igvtools count -z 5 -w 25 -e 250 \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.bam \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/BWA/tp53_r2.fastq_sorted.tdf hg38
To get a list of options
bash
bowtie2 -h
First step is to build a database (index)
bash
# `bowtie2-build -f genome.fa dbname`
bowtie2-build -f /home/participant/Course_Materials/Introduction/SS_DB/Reference/Bowtie2/hg38_chr3.fa \
/home/participant/Course_Materials/Introduction/SS_DB/Reference/Bowtie2/hg38_chr3
Align to chr3
bash
bowtie2 -x /home/participant/Course_Materials/Introduction/SS_DB/Reference/Bowtie2/hg38_chr3 \
-U /home/participant/Course_Materials/Introduction/SS_DB/RawData/ChIPseq/tp53_r2.fastq.gz \
-S /home/participant/Course_Materials/Introduction/SS_DB/Alignment/Bowtie2/tp53_r2.sam
create sorted BAM file
bash
samtools view -Sb /home/participant/Course_Materials/Introduction/SS_DB/Alignment/Bowtie2/tp53_r2.sam > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/Bowtie2/tp53_r2.bam
samtools sort /home/participant/Course_Materials/Introduction/SS_DB/Alignment/Bowtie2/tp53_r2.bam > \
/home/participant/Course_Materials/Introduction/SS_DB/Alignment/Bowtie2/tp53_r2_sorted.bam
samtools index /home/participant/Course_Materials/Introduction/SS_DB/Alignment/Bowtie2/tp53_r2_sorted.bam
Generate genome indices:
bash
STAR --runThreadN 4 --runMode genomeGenerate \
--genomeDir /home/participant/Course_Materials/Introduction/SS_DB/Reference/STAR/ \
--genomeFastaFiles /home/participant/Course_Materials/Introduction/SS_DB/Reference/STAR/hg38_chr3.fa \
--sjdbGTFfile /home/participant/Course_Materials/Introduction/SS_DB/Reference/STAR/hg38.gtf \
--sjdbOverhang 100
Download an annotation GTF file and unzip it
bash
# if your reference genome is from Ensembl get GTF file from Ensembl else get from UCSC table
# browser
# wget ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.gtf.gz
# chmod 755 Homo_sapiens.GRCh38.92.gtf.gz
# gunzip Homo_sapiens.GRCh38.92.gtf.gz
# Get gtf from UCSC table browser and name it hg38.gtf
# Instructor will demonstrate this
# GENCODE gtf (best source for gtf file)
# ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz
bash
STAR --runThreadN 4 \
--genomeDir /home/participant/Course_Materials/Introduction/SS_DB/Reference/STAR/ \
--readFilesIn /home/participant/Course_Materials/Introduction/SS_DB/RawData/RNAseq/tp53_rep1_trimmed.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix /home/participant/Course_Materials/Introduction/SS_DB/Alignment/STAR/star_hg38_ \
--outSAMtype BAM SortedByCoordinate \
--sjdbGTFfile /home/participant/Course_Materials/Introduction/SS_DB/Reference/STAR/hg38.gtf \
--sjdbOverhang 100 \
--twopassMode Basic \
--outWigType bedGraph \
--outWigStrand Stranded
While STAR is running, the status messages will be appearing on the screen and the progress of the mapping job can be checked in the Log.progress.out
This should give you basic skills for doing next generation sequence alignment, and this is also the end of this practical!