October 2024

Differential Gene Expression Analysis Workflow


Fastq file format

Fastq file format - Headers

Fastq file format - Sequences

Fastq file format - Third line

Fastq file format - Quality Scores

(Phred) Quality Scores

Sequence quality scores are transformed and translated p-values

  • Sequence bases are called after image processing (base calling)
    • Each base in a sequence has a p-value associated with it
    • p-values range from 0-1 (e.g.: 0.05, 0.01, 1e-30)
    • p-value of 0.01 inferred as 1 in 100 chance that called base is wrong

QC is important

At every stage we should check for any problems before we put time and effort into analysing potentially bad data

  • Start with FastQC on our sequencing outputs
    • Quick
    • Outputs an easy to read html report

We run fastQC from the terminal with the command

fastqc <fastq>

but there are lots of other parameters which you can find to tailor your QC by typing

fastqc -h

Per base sequence quality

Good Data

Bad Data

Per base sequence content

Good Data

Bad Data

Per sequence GC content

Good Data

Bad Data

Adaptor content

Good Data

Bad Data

And now onto the exercise…

A quick intro to the environment

  • The terminal is just a text based version of the operating system
  • We will look at an example with side by side GUI and text file system…
  • You use commands instead of mouse clicks - commands are case-senstitve and can be followed by arguments with spaces
    • cd
    • pwd
    • ls
    • flags - e.g. ls -a
    • the directory structure is like a tree, you can go back with cd ..
    • Up arrows to get through history
    • tab complete to avoid errors
    • Less or More to look at the files and q to exit
    • ctrl-c