November 2021

Differential Gene Expression Analysis Workflow


Fastq file format

Fastq file format - Headers

Fastq file format - Sequences

Fastq file format - Third line

Fastq file format - Quality Scores

(Phred) Quality Scores

Sequence quality scores are transformed and translated p-values

  • Sequence bases are derived after image processing (base calling)
    • This process is probabilistic in nature
    • Each base in sequence has p-value associated with it
    • p-values range: 0-1 (e.g.: 0.05, 0.01, 1e-30)
    • p-value of 0.01 inferred as 1 in 100 chance that called base is wrong

(Phred) Quality Scores …

How to assign p-values to bases in fastq file?

  • Base (one letter) but p-vales can be long (e.g.:0.000005)
  • Transform to Phred quality scores Q
  • \(Q = -10(log_{10} P)\) (e.g.: 0.01 = Q value of 20, 0.001 = Q value of 30)
  • Translate Q values to ASCII characters (Q value of 1 = !, Q value of 30 = ? )

QC is important

Check for any problems before we put time and effort into analysing potentially bad data

  • Start with FastQC
    • Quick
    • Outputs an easy to read html report

We run fastQC from the terminal with the command

fastqc <fastq>

but there are lots of other parameters which you can find to tailor your QC by typing

fastqc -h

Per base sequence quality

Good Data

Bad Data

Per base sequence content

Good Data

Bad Data

Per sequence GC content

Good Data

Bad Data

Adaptor content

Good Data

Bad Data

And now onto the exercise…