Checking the quality of the aligned data

Once we have aligned our reads, it is then possible to run some additional quality checks on our data. In the first instance we will just look at some basic metrics such as the percentage of aligned reads and the duplication rates. We can then use some more sophisticated methods to assess the integrity of our RNA and genomic locations to which our reads are aligning.

1. A quick look at the alignment metrics with samtools

We previously used the samtools package to sort and index our bam file. samtools also includes some simple tools to generate alignment metrics. This can be very handy for a quick first look at our alignment quality. The tool we are going to use is called flagstat

The Usage is:

samtools flagstat <in.bam>

Where <in.bam> is the bam file we wish to QC.

Exercise 1

  1. Check your current working directory and if necessary navigate to the Course_Materials/ directory using the command cd (change directory).
  2. Use the samtools flagstat command to generate alignment metrics for the sorted bam file you created in the previous practical.
  1. What percentage of the reads have aligned to the genome?

2. More detailed metrics with Picard Tools

Picard Tools is a suite of tools for analysing and manipulating sequencing data. It is maintained by the Broad Institute and comprises 88 different tools for doing jobs such as generating QC metrics, modifying bam files in various ways, or converting files between different formats.

We are going to use three QC metrics tools to get a variety of different important metrics.

Picard is java based programme and so to run a particular Picard tool the general format for the command is:

java -jar picard/picard.jar PicardToolName OPTION1=value1 OPTION2=value2...
  • picard.jar is the Java ARchive file that contains all the tools. It can be found in the picard directory under Course_Materials.
  • PicardToolName is the name of the tools that you wish to use.

This is then followed by a series of options/arguments for that particular tool. Each tool has specific options and arguments and you can check these most easily by going to the Picard website, or by using the `–help`` command:

java -jar picard/picard.jar PicardToolName --help

2.1 Duplication metrics

The first tool we are going to use is the MarkDuplicates tools. This tool actually performs two tasks.

First, Picard reads through the bam file and finds any duplicate reads - these are reads with the same 5’ position. For each group of duplicate reads, Picard selects a “primary” read based on the base quality scores and then marks all other reads in the group as “duplicates” by adding 1024 to the sam flag.

Once the duplicate reads are marked, Picard then also generates a metrics file that contains information about the duplication rate.

Exercise 2.1

  1. Run Picard’s MarkDuplicates tool on the sorted bam file using the following command:

    java -jar picard/picard.jar MarkDuplicates \
         INPUT=bam/MCL1.DL.sorted.bam \
         OUTPUT=bam/MCL1.DL.mkdup.bam \
         METRICS_FILE=bam/MCL1.DL.mkdup_metrics.txt \
         CREATE_INDEX=true

    Note: The \ at the end of each line tells the terminal that when you press Enter, you have not yet finished typing the command. You can if you wish, type the whole command on a single line, omitting the \. The command is written across multiple lines here just to make it easier to read.

Q. What is the duplication rate for this bam file?

2.2 Alignment metrics

Next we will collect some detailed alignment metrics using the CollectAlignmentSummaryMetrics tool. In this case we need to provide an input bam, the name of the output metrics file and the fasta reference.

Exercise 2.2

  1. Run Picard’s CollectAlignmentSummaryMetrics tool on the sorted bam providing the following options.
    • INPUT - The sorted bam file
    • OUTPUT - bam/MCL1.DL.alignment_metrics.txt
    • REFERENCE_SEQUENCE - references/Mus_musculus.GRCm38.dna.primary_assembly.fa

Q. What is the mismatch rate? This is the fraction of bases in mapped reads that do not match the base at that position in the reference genome. This is a combination of the number of SNPs and sequencing errors.

2.3 RNA alignment metrics

The CollectRnaSeqMetrics tool produces metrics describing the distribution of the reads across different genomics locations - intronic, exonic, intergenic, UTR - and the distribution of bases within the transcripts.

The CollectRnaSeqMetrics requires four pieces of information to run:

  1. The input bam file
  2. An file name for the output metrics file
  3. A file containing gene annotations in a format called RefFlat that is defined here. This format can be generated from a gtf file. We’ve already generated this file for you at references/Mus_musculus.GRCm38.97.txt
  4. A parameter for strand specificity. Strand specificity is factor of the library prep chemistry. There are three possible options:
    a) Forward stranded library prep, i.e. the reads are on the transcription strand - FIRST_READ_TRANSCRIPTION_STRAND
    b) Reverse stranded library prep, i.e. the reads are on the reverse strand - SECOND_READ_TRANSCRIPTION_STRAND
    c) Unstranded library prep, i.e. the reads may be on either strand - NONE
    With your own data you would need to find this information out from whoever has prepared the library. In this case our library prep is unstranded, so we will use NONE.

Exercise 2.3

  1. Run Picard’s CollectRnaSeqMetrics tool on the sorted bam file providing the following options:
    • INPUT - The sorted bam file
    • OUTPUT - bam/MCL1.DL.RNA_metrics.txt
    • REF_FLAT - the RefFlat reference file
    • STRAND - NONE

The results of this analysis are best viewed graphically, we will do this in the next exercise.

3. Visualising QC results with MultiQC

MultiQC is a tool for collating multiple QC results files into a single report. It’s use is simple, you just run the command multiqc in the directory containing your metrics files.

Exercise 3.1

  1. Move into the bam directory and run multiqc: multiqc . - the . specifies “the current directory”.
  2. Open the html report that was generated by multiqc and inspect the QC plots The easiest way to do this is type xdg-open multiqc_report.html, which will oprn the report in a web browser.

Exercise 3.2

In the metrics directory you should find Picard metrics for all of the bam files.

  1. Run multiqc in the metrics directory.
  2. Open the html report that was generated by multiqc and inspect the QC plots

Q. Are there any bam files that look problematic?