1 Just a reminder

Course etiquette
Please read the course etiquette, if you haven’t read that yet.

Shared document
We are using shared GoogleDocs documents for each of the main topics covered during the summer school. The document for this section can be found here.

Docker

docker exec -ti -u 0 docker /bin/bash 

Software requirements (if not using Docker)
If you want to follow this tutorial using your own machine, you need to install the following command line tools:

You can install the tools one by one, but a very convenient way to manage installed tools/packages and their dependencies is Conda. If you are new to Conda, please follow this tutorial.

Sample dataset
A dataset for this tutorial is available here. Please keep in mind that, the directories and folder structure in your machine may differ from the one we used during the course.

2 Introduction

After successful quality checks, raw sequencing reads are ready to be aligned against the reference genome. The choice of aligner is usually a personal preference that might be directed by the availability of computational resources and running time. Therefore, we will not align any raw .fastq files today. Instead, we will carefully go through a standard alignment workflow together.

3 Preparing a Reference Genome

Reference genomes can be downloaded from UCSC, Ensembl or NCBI. We downloaded the most recent version of the human genome (hg38 a.k.a GRCh38)1 Technically, this is not the most recent one. A complete gap-less human reference genome, named T2T-CHM13, has recently been sequenced. This was made possible by long-read sequencing technologies. Whether and when this new reference will be widely used in practice is unknown.
Nurk, S., Koren, S., Rhie, A. et al. The complete sequence of a human genome. bioRxiv. (2021)
together with its matching GTF file with annotations from Gencode.

3.1 Reference genome index

Mapping of millions of short reads to a very large reference sequence is a challenging task. In order to accelerate short read mapping, most of the modern alignment tools use a strategy of ‘indexing’ (think about it as index in a book). Indexing is specific to an aligner and reference sequence/annotation used. The details about the required genome indexes can be found in the software documentation. For this tutorial we’ll use STAR - a splice-aware aligner that is very commonly used to map RNA-Seq reads.

Let’s explore the most important parameters of the STAR command to generate an indexed reference genome. Full documentation can be accessed here.

--runMode genomeGenerate : will generate indexed reference genome
--runThreadN : number of threads (cores)
--genomeDir : specify the output directory
--genomeFastaFiles : path to the reference genome FASTA file
--sjdbGTFfile: path to the GTF file with annotations
--sjdbOverhang: max(readlength) - 1

So a basic command for GRCh38 genome with gencode.v29 annotations for a sequencing experiment with reads 100 bp long can be:

** DO NOT RUN **
STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir STAR_GRCh38_gencode29 \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v29.annotation.gtf \
--sjdbOverhang 99  

3.2 Alignment to the reference genome

Once we generated a reference genome we can move to the alignment stage.

** DO NOT RUN **
STAR --runThreadN 6 \
--genomeDir STAR_GRCh38_gencode29 \
--readFilesIn data/raji_rnaseq_rep1_trimmed.fastq.gz \
--outFileNamePrefix aligned/raji_rnaseq_rep1_trimmed \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard   

4 Quality check of aligned reads

Option 1: run multiqc on the aligner log files

cd /home/ubuntu/Course_Materials/Introduction/practicals
multiqc -o data/multiqc_sample1_STAR/ data/sample1_STAR 

Option 2: Use Samstat

cd /home/ubuntu/Course_Materials/Introduction/practicals
samstat data/sample1_STAR/Aligned.sortedByCoord.out.bam

5 Manipulating aligned sequences

Samtools is an open source tool kit for next generation sequence data manipulation. It is particularly useful to modify and reformat sequence alignment files (SAM/BAM) for downstream processing. We’ll demonstrate only a few examples of samtools utilities, full documentation can be accessed here.

5.1 BAM to SAM

A BAM file (.bam) is the binary version of a SAM file. BAM occupies less disk space than SAM and is a default input required by some bioinformatics tools.

# Go to the directory with STAR output
cd data/sample1_STAR

samtools view Aligned.sortedByCoord.out.bam > Aligned.sortedByCoord.out.sam

# SAM file is just a text file, so in order to view first few reads we can do:
head -n 10 Aligned.sortedByCoord.out.sam  

# Can view the BAM file in one line by combining the view and head command using the pipe (|)
samtools view Aligned.sortedByCoord.out.bam | head -n 10

5.2 Sorting a BAM file

Many tools require sorted and indexed BAM/SAM files. In order to sort a BAM file we will use the samtools sort command:

samtools sort Aligned.sortedByCoord.out.bam -o Aligned.sortedByCoord.out.sorted.bam

For creating a BAM index *.bai, use the samtools index command:

samtools index Aligned.sortedByCoord.out.sorted.bam

5.3 Filter a BAM file to contain only uniquely mapped reads

samtools view -bh -q 255 Aligned.sortedByCoord.out.sorted.bam > Aligned.sortedByCoord.out.sorted.unique.bam

5.4 Filter a BAM files to contain reads mapping to a specific region

Let’s extract reads mapping to PIK3CA, a gene that is essential for B-cell development and contributes to lymphomagenesis. We used Ensembl resources to know genomic coordinates of PIK3CA.

samtools index Aligned.sortedByCoord.out.sorted.unique.bam # we need to index this before filtering
samtools view -bh Aligned.sortedByCoord.out.sorted.unique.bam "chr3:179148114-179240093" > PIK3CA.bam

5.5 Exercise 1

  1. Go to /home/ubuntu/Course_Materials/Introduction/practicals/data/sample2_STAR/ directory.
  2. Convert Aligned.sortedByCoord.out.bam to Aligned.sortedByCoord.out.sam
  3. Compare the size of BAM and SAM file.
  4. How many reads Aligned.sortedByCoord.out.bam out of first 10 reads was mapped uniquely? Hint: mapping quality = 255 for uniquely mapped reads.
  5. Sort Aligned.sortedByCoord.out.bam using samtools sort command, save the output as Aligned.sortedByCoord.out.sorted.bam
  6. Index Aligned.sortedByCoord.out.sorted.bam using samtools index command
  7. Extract only uniquely mapped reads from Aligned.sortedByCoord.out.sorted.bam and save them as Aligned.sortedByCoord.out.sorted.unique.bam
  8. [ADVANCED] How many reads were mapped uniquely?
  9. [ADVANCED] How many reads mapped uniquely to PIK3CA?

7 Acknowledgements

Dora Bihary
VIB Center for Cancer Biology, University of Leuven, BE
MRC Cancer Unit, University of Cambridge, UK

Joanna A. Krupka MRC Cancer Unit, University of Cambridge, UK
Shoko Hirosue MRC Cancer Unit, University of Cambridge, UK

Harvard Chan Bioinformatics Core