Once we have aligned our reads, it is then possible to run some additional quality checks on our data. In the first instance we will just look at some basic metrics such as the percentage of aligned reads and the duplication rates. We can then use some more sophisticated methods to assess the integrity of our RNA and genomic locations to which our reads are aligning.
samtools
We previously used the samtools
package to sort and index our bam file. samtools
also includes some simple tools to generate alignment metrics. This can be very handy for a quick first look at our alignment quality. The tool we are going to use is called flagstat
The Usage is:
samtools flagstat <in.bam>
Where <in.bam>
is the bam file we wish to QC.
- Check your current working directory and if necessary navigate to the
Course_Materials/
directory using the commandcd
(change directory).- Use the
samtools flagstat
command to generate alignment metrics for the sorted bam file you created in the previous practical.
- What percentage of the reads have aligned to the genome?
Picard Tools is a suite of tools for analysing and manipulating sequencing data. It is maintained by the Broad Institute and comprises 88 different tools for doing jobs such as generating QC metrics, modifying bam files in various ways, or converting files between different formats.
We are going to use three QC metrics tools to get a variety of different important metrics.
Picard is java based programme and so to run a particular Picard tool the general format for the command is:
java -jar picard/picard.jar PicardToolName OPTION1=value1 OPTION2=value2...
picard.jar
is the Java ARchive file that contains all the tools. It can be found in the picard directory under Course_Materials.PicardToolName
is the name of the tools that you wish to use.This is then followed by a series of options/arguments for that particular tool. Each tool has specific options and arguments and you can check these most easily by going to the Picard website, or by using the `–help`` command:
java -jar picard/picard.jar PicardToolName --help
The first tool we are going to use is the MarkDuplicates
tools. This tool actually performs two tasks.
First, Picard reads through the bam file and finds any duplicate reads - these are reads with the same 5’ position. For each group of duplicate reads, Picard selects a “primary” read based on the base quality scores and then marks all other reads in the group as “duplicates” by adding 1024 to the sam flag.
Once the duplicate reads are marked, Picard then also generates a metrics file that contains information about the duplication rate.
Run Picard’s MarkDuplicates tool on the sorted bam file using the following command:
java -jar picard/picard.jar MarkDuplicates \ INPUT=bam/MCL1.DL.sorted.bam \ OUTPUT=bam/MCL1.DL.mkdup.bam \ METRICS_FILE=bam/MCL1.DL.mkdup_metrics.txt \ CREATE_INDEX=true
Note: The
\
at the end of each line tells the terminal that when you pressEnter
, you have not yet finished typing the command. You can if you wish, type the whole command on a single line, omitting the\
. The command is written across multiple lines here just to make it easier to read.Q. What is the duplication rate for this bam file?
Next we will collect some detailed alignment metrics using the CollectAlignmentSummaryMetrics
tool. In this case we need to provide an input bam, the name of the output metrics file and the fasta reference.
- Run Picard’s
CollectAlignmentSummaryMetrics
tool on the sorted bam providing the following options.
- INPUT - The sorted bam file
- OUTPUT - bam/MCL1.DL.alignment_metrics.txt
- REFERENCE_SEQUENCE - references/Mus_musculus.GRCm38.dna.primary_assembly.fa
Q. What is the mismatch rate? This is the fraction of bases in mapped reads that do not match the base at that position in the reference genome. This is a combination of the number of SNPs and sequencing errors.
The CollectRnaSeqMetrics
tool produces metrics describing the distribution of the reads across different genomics locations - intronic, exonic, intergenic, UTR - and the distribution of bases within the transcripts.
The CollectRnaSeqMetrics
requires four pieces of information to run:
references/Mus_musculus.GRCm38.97.txt
NONE
.
- Run Picard’s
CollectRnaSeqMetrics
tool on the sorted bam file providing the following options:
- INPUT - The sorted bam file
- OUTPUT - bam/MCL1.DL.RNA_metrics.txt
- REF_FLAT - the RefFlat reference file
- STRAND - NONE
The results of this analysis are best viewed graphically, we will do this in the next exercise.
MultiQC is a tool for collating multiple QC results files into a single report. It’s use is simple, you just run the command multiqc
in the directory containing your metrics files.
- Move into the bam directory and run multiqc:
multiqc .
- the.
specifies “the current directory”.- Open the html report that was generated by multiqc and inspect the QC plots The easiest way to do this is type
xdg-open multiqc_report.html
, which will oprn the report in a web browser.
In the
metrics
directory you should find Picard metrics for all of the bam files.
- Run multiqc in the metrics directory.
- Open the html report that was generated by multiqc and inspect the QC plots
Q. Are there any bam files that look problematic?