1. A quick look at the alignment metrics with samtools

Exercise 1

  1. Check your current working directory and if necessary navigate to the Course_Materials/ directory using the command cd (change directory).

pwd - to check present wworking directory

cd ~/Course_Materials - if necessary

  1. Use the samtools flagstat command to generate alignment metrics for the sorted bam file you created in the previous practical.
    samtools flagstat bam/MCL1.DL.sorted.bam

    28281754 + 0 in total (QC-passed reads + QC-failed reads)
    2848597 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    27199402 + 0 mapped (96.17% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
  1. What percentage of the reads have aligned to the genome? - 96.17% of reads have been mapped to the reference genome.

2. More detailed metrics with Picard Tools

2.1 Duplication metrics

Exercise 2.1

  1. Run Picard’s MarkDuplicates tool on the sorted bam file using the following command:
java -jar picard/picard.jar MarkDuplicates \
     INPUT=bam/MCL1.DL.sorted.bam \
     OUTPUT=bam/MCL1.DL.mkdup.bam \
     METRICS_FILE=bam/MCL1.DL.mkdup_metrics.txt \
     CREATE_INDEX=true

Note: The \ at the end of each line tells the terminal that when you press Enter, you have not yet finished typing the command. You can if you wish, type the whole command on a single line, omitting the \ - The command is written across multiple lines here just to make it easier to read.

Q. What is the duplication rate for this bam file? ~55%. Note that although the column headers for Picard say “PERCENT” or “PCT” the number is in fact the decimal fraction and need to be multiplied by 100 for percent. Just an odd quirk of Picard.

2.2 Alignment metrics

Exercise 2.2

  1. Run Picard’s CollectAlignmentSummaryMetrics tool on the sorted bam providing the following options.
    • INPUT - The sorted bam file
    • OUTPUT - “bam/MCL1.DL.alignment_metrics.txt”
    • REFERENCE_SEQUENCE - “references/Mus_musculus.GRCm38.dna.primary_assembly.fa”
java -jar picard/picard.jar CollectAlignmentSummaryMetrics \
     INPUT=bam/MCL1.DL.sorted.bam \
     OUTPUT=bam/MCL1.DL.alignment_metrics.txt \
     REFERENCE_SEQUENCE=references/Mus_musculus.GRCm38.dna.primary_assembly.fa

Q. What is the mismatch rate? This is the fraction of bases in mapped reads that do not match the base at that position in the reference genome. This is a combination of the number of SNPs and sequencing errors. - 0.2308 %

2.3 RNA alignment metrics

Exercise 2.3

  1. Run Picard’s CollectRnaSeqMetrics tool on the sorted bam file providing the following options:
    • INPUT - The sorted bam file
    • OUTPUT - “bam/MCL1.DL.RNA_metrics.txt”
    • REF_FLAT - the RefFlat reference file
    • STRAND - “NONE”
java -jar picard/picard.jar CollectRnaSeqMetrics \
     INPUT=bam/MCL1.DL.sorted.bam \
     REF_FLAT=references/Mus_musculus.GRCm38.97.txt \
     OUTPUT=bam/MCL1.DL.RNA_metrics.txt \
     STRAND=NONE

The results of this analysis are best viewed graphically, we will do this in the next exercise.

3. Visualising QC results with MultiQC

Exercise 3.1

  1. Move into the bam directory and run multiqc: multiqc . - the . specifies “the current directory”.
cd bam
multiqc .
  1. Open the html report that was generated by multiqc and inspect the QC plots
xdg-open mutliqc_report.html

Exercise 3.2

In the metrics directory you should find Picard metrics for all of the bam files.
1. Run multiqc in the metrics directory.
Assuming that you are still in the bam directory:

cd .. # to move back up to the Course_Materials directory
cd metrics
multiqc .
  1. Open the html report that was generated by multiqc and inspect the QC plots

Q. Are there any bam files that look problematic?

From the Alignment Summary plot we can see that the sample MCL1.DM that has only 9.8 million reads. This is not going to be sufficient for a good quality analysis and we should exclude this sample.

From the Gene Coverage plot we can see that there is one sample MCL1.LG that exhibits a pattern of coverage suggesting that the mRNA is degraded.