Introduction to Bulk RNAseq data analysis

More detailed metrics with Picard Tools

Duplication metrics

Exercise 1

Run Picard’s MarkDuplicates tool on the sorted bam file using the following command:
java -jar picard/picard.jar MarkDuplicates \
         INPUT=salmon_qc_demo/SRR7657883/SRR7657883.salmon.sorted.bam \
         OUTPUT=salmon_qc_demo/SRR7657883/SRR7657883.salmon.mkdup.bam \
         METRICS_FILE=salmon_qc_demo/SRR7657883/SRR7657883.mkdup_metrics.txt \
         CREATE_INDEX=true \
         VALIDATION_STRINGENCY=SILENT
\(\Rightarrow\) salmon_qc_demo/SRR7657883/SRR7657883.salmon.mkdup.bam - The new bam file with duplicated marked
\(\Rightarrow\) salmon_qc_demo/SRR7657883/SRR7657883.salmon.mkdup.bai - The index for the new bam file
\(\Rightarrow\) salmon_qc_demo/SRR7657883/SRR7657883.salmon.mkdup_metrics.txt - The duplication metrics

Note: The \ at the end of each line tells the terminal that when you press Enter, you have not yet finished typing the command. You can if you wish, type the whole command on a single line, omitting the \ - The command is written across multiple lines here just to make it easier to read.

Q. What is the duplication rate for this bam file? You’ll need to look at the metrics file. The easiest way is to open in a spreadsheet. On the course machines we have LibreOffice Calc. You can find this in the launcher bar at the bottom or side of the desktop, e.g.:

The duplication rate reported ~5%.

Note that although the column headers for Picard say “PERCENT” or “PCT” the number is in fact the decimal fraction and need to be multiplied by 100 for percent. Just an odd quirk of Picard.

RNA alignment metrics

Generate the refFlat file

./scripts/create_refflat_from_sam_header.py \
  -b salmon_qc_demo/SRR7657883/SRR7657883.salmon.sorted.bam \
  -o references/GRCm38_transriptome_refFlat.txt

\(\Rightarrow\) references/GRCm38_transriptome_refFlat.txt

Exercise 2

Run Picard’s CollectRnaSeqMetrics tool on the sorted bam file providing the following options:

INPUT - The sorted bam file

OUTPUT - salmon_qc_demo/SRR7657883/SRR7657883.salmon.RNA_metrics.txt

REF_FLAT - the RefFlat reference file

STRAND - NONE

java -jar picard/picard.jar CollectRnaSeqMetrics \
     INPUT=salmon_qc_demo/SRR7657883/SRR7657883.salmon.sorted.bam \
     OUTPUT=salmon_qc_demo/SRR7657883/SRR7657883.salmon.RNA_metrics.txt \
     REF_FLAT=references/GRCm38_transriptome_refFlat.txt \
     STRAND=NONE \
     VALIDATION_STRINGENCY=SILENT

\(\Rightarrow\) bam/SRR7657883.chr14.RNA_metrics.txt - The RNAseq metrics

The results of this analysis are best viewed graphically, we will do this in the next exercise.

3. Visualising QC results with MultiQC

Exercise 3

Run multiqc on the bam directory:
multiqc \
    -n Salmon_QC_Report.html \
    -o salmon_qc_demo \
    salmon_qc_demo
-n - a name for the report

-o - the directory in which to place the report

Open the html report that was generated by multiqc and inspect the QC plots The easiest way to do this is type xdg-open salmon_qc_demo/Salmon_QC_Report.html, which will open the report in a web browser.

Exercise 4

In the salmon directory you should find Salmon outputs, duplication metrics and RNAseq metrics for all of the samples from the study.

Run multiqc on the contents of the salmon directory.

multiqc -z -n Salmon_QC_Report.html -o salmon salmon

\(\Rightarrow\) salmon/Salmon_QC_Report.html

Open the html report that was generated by multiqc and inspect the QC plots

Q. Are there any bam files that look problematic?

SRR7657893 has low alignment rate, an insert size profile that is skewed to left with a median at ~180 bp and a transcript coverage profile that shows a strong 3’ bias. This suggests that the RNA in the this sample has been degraded. NOTE: This sample is not real - we have mocked up the metrics files for the purpose of illustrating a poor quality data set.