Course etiquette
Please read the course etiquette, if you haven’t read that yet.
Shared document
We are using shared GoogleDocs documents for each of the main topics covered during the summer school. The document for this section can be found here.
Docker
docker exec -ti -u 0 docker /bin/bash
Software requirements (if not using Docker)
If you want to follow this tutorial using your own machine, you need to install the following command line tools:
You can install the tools one by one, but a very convenient way to manage installed tools/packages and their dependencies is Conda. If you are new to Conda, please follow this tutorial.
Sample dataset
A dataset for this tutorial is available here. Please keep in mind that, the directories and folder structure in your machine may differ from the one we used during the course.
After successful quality checks, raw sequencing reads are ready to be aligned against the reference genome. The choice of aligner is usually a personal preference that might be directed by the availability of computational resources and running time. Therefore, we will not align any raw .fastq
files today. Instead, we will carefully go through a standard alignment workflow together.
Reference genomes can be downloaded from UCSC, Ensembl or NCBI. We downloaded the most recent version of the human genome (hg38 a.k.a GRCh38)1 Technically, this is not the most recent one. A complete gap-less human reference genome, named T2T-CHM13, has recently been sequenced. This was made possible by long-read sequencing technologies. Whether and when this new reference will be widely used in practice is unknown.
Nurk, S., Koren, S., Rhie, A. et al. The complete sequence of a human genome. bioRxiv. (2021) together with its matching GTF file with annotations from Gencode.
Mapping of millions of short reads to a very large reference sequence is a challenging task. In order to accelerate short read mapping, most of the modern alignment tools use a strategy of ‘indexing’ (think about it as index in a book). Indexing is specific to an aligner and reference sequence/annotation used. The details about the required genome indexes can be found in the software documentation. For this tutorial we’ll use STAR - a splice-aware aligner that is very commonly used to map RNA-Seq reads.
Let’s explore the most important parameters of the STAR
command to generate an indexed reference genome. Full documentation can be accessed here.
--runMode genomeGenerate
: will generate indexed reference genome
--runThreadN
: number of threads (cores)
--genomeDir
: specify the output directory
--genomeFastaFiles
: path to the reference genome FASTA file
--sjdbGTFfile
: path to the GTF file with annotations
--sjdbOverhang
: max(readlength) - 1
So a basic command for GRCh38
genome with gencode.v29
annotations for a sequencing experiment with reads 100 bp long can be:
** DO NOT RUN **
STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir STAR_GRCh38_gencode29 \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v29.annotation.gtf \
--sjdbOverhang 99
Once we generated a reference genome we can move to the alignment stage.
** DO NOT RUN **
STAR --runThreadN 6 \
--genomeDir STAR_GRCh38_gencode29 \
--readFilesIn data/raji_rnaseq_rep1_trimmed.fastq.gz \
--outFileNamePrefix aligned/raji_rnaseq_rep1_trimmed \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
Option 1: run multiqc on the aligner log files
cd /home/ubuntu/Course_Materials/Introduction/practicals
multiqc -o data/multiqc_sample1_STAR/ data/sample1_STAR
Option 2: Use Samstat
cd /home/ubuntu/Course_Materials/Introduction/practicals
samstat data/sample1_STAR/Aligned.sortedByCoord.out.bam
Samtools is an open source tool kit for next generation sequence data manipulation. It is particularly useful to modify and reformat sequence alignment files (SAM/BAM) for downstream processing. We’ll demonstrate only a few examples of samtools utilities, full documentation can be accessed here.
A BAM file (.bam) is the binary version of a SAM file. BAM occupies less disk space than SAM and is a default input required by some bioinformatics tools.
# Go to the directory with STAR output
cd data/sample1_STAR
samtools view Aligned.sortedByCoord.out.bam > Aligned.sortedByCoord.out.sam
# SAM file is just a text file, so in order to view first few reads we can do:
head -n 10 Aligned.sortedByCoord.out.sam
# Can view the BAM file in one line by combining the view and head command using the pipe (|)
samtools view Aligned.sortedByCoord.out.bam | head -n 10
Many tools require sorted and indexed BAM/SAM files. In order to sort a BAM file we will use the samtools sort
command:
samtools sort Aligned.sortedByCoord.out.bam -o Aligned.sortedByCoord.out.sorted.bam
For creating a BAM index *.bai
, use the samtools index
command:
samtools index Aligned.sortedByCoord.out.sorted.bam
samtools view -bh -q 255 Aligned.sortedByCoord.out.sorted.bam > Aligned.sortedByCoord.out.sorted.unique.bam
Let’s extract reads mapping to PIK3CA, a gene that is essential for B-cell development and contributes to lymphomagenesis. We used Ensembl resources to know genomic coordinates of PIK3CA.
samtools index Aligned.sortedByCoord.out.sorted.unique.bam # we need to index this before filtering
samtools view -bh Aligned.sortedByCoord.out.sorted.unique.bam "chr3:179148114-179240093" > PIK3CA.bam
- Go to
/home/ubuntu/Course_Materials/Introduction/practicals/data/sample2_STAR/
directory.- Convert
Aligned.sortedByCoord.out.bam
toAligned.sortedByCoord.out.sam
- Compare the size of BAM and SAM file.
- How many reads
Aligned.sortedByCoord.out.bam
out of first 10 reads was mapped uniquely? Hint: mapping quality = 255 for uniquely mapped reads.
- Sort
Aligned.sortedByCoord.out.bam
usingsamtools sort
command, save the output asAligned.sortedByCoord.out.sorted.bam
- Index
Aligned.sortedByCoord.out.sorted.bam
usingsamtools index
command- Extract only uniquely mapped reads from
Aligned.sortedByCoord.out.sorted.bam
and save them asAligned.sortedByCoord.out.sorted.unique.bam
- [ADVANCED] How many reads were mapped uniquely?
- [ADVANCED] How many reads mapped uniquely to PIK3CA?
Benchmarking of the most popular short-read aligners:
Otto C, Stadler PF, Hoffmann S, Lacking alignments? The next-generation sequencing mapper segemehl revisited, Bioinformatics, Volume 30, Issue 13, 1 July 2014, Pages 1837–1843,
Dora Bihary
VIB Center for Cancer Biology, University of Leuven, BE
MRC Cancer Unit, University of Cambridge, UK
Joanna A. Krupka
MRC Cancer Unit, University of Cambridge, UK
Shoko Hirosue
MRC Cancer Unit, University of Cambridge, UK