Reference genomes

and

Common file formats

Overview

  • Reference genomes and GRC.

  • Fasta and FastQ (Unaligned sequences).

  • SAM/BAM (Aligned sequences).

  • BED (Genomic Intervals).

  • GFF/GTF (Gene annotation).

  • Wiggle files, BEDgraphs and BigWigs (Genomic scores).

Are there we there yet?

  • The human genome isnt complete! 

  • In fact, most model organisms's reference genomes are being regularly updated.

  • Reference genomes consist of mixture of known chromosomes and unplaced contigs called a " Genome Reference Assembly".

  • The latest genome assembly for humans is GRCh38.

    • Patches add information to the assembly without disrupting the chromosome coordinates . i.e GRCh38.p3

Genome Reference Consortium 

  • GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organims.
    • Human - GRCh38.p3
    • Mouse - GRCm38.p3
    • Zebrafish - GRCz10
  • Other model organisms are maintained separately.
    •  Drosophila - Berkeley Drosophila Genome Project, BDGP36

Why do we need to know about reference genomes

  • Allows for genes and genomic features to be evaluated in their linear genomic context.
    • Gene A is close to Gene B
    • Gene A and Gene B are within feature C.
  • Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.

Aligning to a reference genomes

 

DNA/cDNA

Fragment

DNA (PCR amplify)

Sequence DNA

Unaligned

sequence

Aligned sequences

Reference genome

A reference genome

  • A reference genome is a collection of contigs.
  • A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
  • Typically comes in FASTA format.
    • ">" line contains information on contig
    • Lines following contain contig sequence

High-throughput Sequencing formats

Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence files.

  • FASTQ - Unaligned sequences 

  • SAM - Aligned sequences

Unaligned Sequences

FastQ (FASTA with Qualities)

  • "@" followed by identifier.
  • Sequence information.
  • "+" 
  • Quality scores encodes as ASCI.

Unaligned Sequences

FastQ - Header

  • Header for each read can contain additional information
    • HS2000-887_89 - Machine name.
    • 5 - Flowcell lane.
    • /1 - Read 1 or 2 of pair (here read 1)

Unaligned Sequences

FastQ - Qualities

  • Qualities follow "+" line.
  • -log10 probability of sequence base being wrong. 
  • Encoded in ASCI to save space.
  • Used in quality assessment and downstream analysis

Aligned sequences

SAM format

  • SAM - Sequence Alignment Map.
  • Standard format for sequence data
  • Recognised by majority of software and browsers.

Aligned sequences

SAM - Header

  • SAM header contains information on alignment and contigs used.
  • @HD - Version number and sorting information
  • @SQ - Contig/Chromosome name and length of sequence.

Aligned sequences

SAM - Aligned Reads

 

  • Contains read and alignment information and location

Aligned sequences

SAM

  • Read name.
  • Sequence of read.
  • Encoded sequence quality.

Aligned sequences

SAM

  • Chromosome to which read aligns.
  • Position in chromosome to which 5' of read aligns.
  • Alignment information - "Cigar string".
    • ​100M - Continuous match of 100 bases
    • 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match

Aligned sequences

SAM

  • Bit flag - TRUE/FALSE for pre-defined read criteria
    • Paired? ​Duplicate? 
    • https://broadinstitute.github.io/picard/explain-flags.html
  • Paired read position and insert size
  • User defined flags.

Summarised Genomic Features formats

Post alignment, sequences reads are typically summarised into scores over/within genomic intervals.

  • BED - Genomic intervals and information.

  • Wiggle/BedGraph - Genomic intervals and scores.

  • GFF - Genomic annotation with information and scores

Summarising in genomic intervals.

BED format (BED)

  • Simple format
  • 3 tab separated columns
  • Chromsome, start, end

Summarising in genomic intervals.

BED format (BED6)

  • Chromosome, start, end
  • Identifier
  • Score
  • Strand ("." for strandless)

Signal at genomic positions

Wiggle

  • Information line
    • Chromosome
    • Step size
  • Step start position
  • Score

Signal at genomic positions

bedGraph

  • BED 3 format
    • Chromosome
    • Start 
    • End
  • 4th column - Score

Genomic Annotation

GFF

  • Used to genome annotation.
  • Stores position, feature (exon) and meta-feature (transcript/gene) information.

Genomic Annotation

GFF

  • Chromosome
  • Start of feature
  • End of Feature
  • Strand

Genomic Annotation

GFF

  • Source
  • Feature type
  • Score

Genomic Annotation

GFF

  • Column 9 contains key pairs (ID=exon01), separated by semi-colons ";"
  • ID - Feature name.
  • PARENT- Meta-feature name.

Saving time and space

bigWig, bigBED and TABIX

  • Many programs and browsers deal better with compressed, indexed versions of genomic files
    • SAM -> BAM (.bam and index file of .bai)
    • Wiggle and bedGraph -> bigWig (.bw/.bigWig)
    • BED -> bigBed (.bb)
    • BED and GFF -> (.gz and index file of .tbi)

Getting help and more information

  • UCSC file formats
    • https://genome.ucsc.edu/FAQ/FAQformat.html
  • IGV file formats
    • https://www.broadinstitute.org/igv/FileFormats
  • Sanger (GFF)
    • https://www.sanger.ac.uk/resources/software/gff/spec.html