Reference genomes

and

Common file formats

Overview

Reference genomes and GRC.
Fasta and FastQ (Unaligned sequences).
SAM/BAM (Aligned sequences).
BED (Genomic Intervals).
GFF/GTF (Gene annotation).
Wiggle files, BEDgraphs and BigWigs (Genomic scores).

Are there we there yet?

The human genome isnt complete!
In fact, most model organisms's reference genomes are being regularly updated.
Reference genomes consist of mixture of known chromosomes and unplaced contigs called a " Genome Reference Assembly".
The latest genome assembly for humans is GRCh38.
- Patches add information to the assembly without disrupting the chromosome coordinates . i.e GRCh38.p3

Genome Reference Consortium

GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organims.
- Human - GRCh38.p3
- Mouse - GRCm38.p3
- Zebrafish - GRCz10

Other model organisms are maintained separately.
- Drosophila - Berkeley Drosophila Genome Project, BDGP36

Why do we need to know about reference genomes

Allows for genes and genomic features to be evaluated in their linear genomic context.
- Gene A is close to Gene B
- Gene A and Gene B are within feature C.
Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.

Aligning to a reference genomes

DNA/cDNA

Fragment

DNA (PCR amplify)

Sequence DNA

Unaligned

sequence

Aligned sequences

Reference genome

A reference genome

A reference genome is a collection of contigs.
A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
Typically comes in FASTA format.
- ">" line contains information on contig
- Lines following contain contig sequence

High-throughput Sequencing formats

Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence files.

FASTQ - Unaligned sequences
SAM - Aligned sequences

Unaligned Sequences

FastQ (FASTA with Qualities)

"@" followed by identifier.
Sequence information.
"+"
Quality scores encodes as ASCI.

Unaligned Sequences

FastQ - Header

Header for each read can contain additional information
- HS2000-887_89 - Machine name.
- 5 - Flowcell lane.
- /1 - Read 1 or 2 of pair (here read 1)

Unaligned Sequences

FastQ - Qualities

Qualities follow "+" line.
-log10 probability of sequence base being wrong.
Encoded in ASCI to save space.
Used in quality assessment and downstream analysis

Aligned sequences

SAM format

SAM - Sequence Alignment Map.
Standard format for sequence data
Recognised by majority of software and browsers.

Aligned sequences

SAM - Header

SAM header contains information on alignment and contigs used.
@HD - Version number and sorting information
@SQ - Contig/Chromosome name and length of sequence.

Aligned sequences

SAM - Aligned Reads

Contains read and alignment information and location

Aligned sequences

SAM

Read name.
Sequence of read.
Encoded sequence quality.

Aligned sequences

SAM

Chromosome to which read aligns.
Position in chromosome to which 5' of read aligns.
Alignment information - "Cigar string".
- 100M - Continuous match of 100 bases
- 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match

Aligned sequences

SAM

Bit flag - TRUE/FALSE for pre-defined read criteria
- Paired? Duplicate?
- https://broadinstitute.github.io/picard/explain-flags.html
Paired read position and insert size
User defined flags.

Summarised Genomic Features formats

Post alignment, sequences reads are typically summarised into scores over/within genomic intervals.

BED - Genomic intervals and information.
Wiggle/BedGraph - Genomic intervals and scores.
GFF - Genomic annotation with information and scores

Summarising in genomic intervals.

BED format (BED)

Simple format
3 tab separated columns
Chromsome, start, end

Summarising in genomic intervals.

BED format (BED6)

Chromosome, start, end
Identifier
Score
Strand ("." for strandless)

Signal at genomic positions

Wiggle

Information line
- Chromosome
- Step size
Step start position
Score

Signal at genomic positions

bedGraph

BED 3 format
- Chromosome
- Start
- End
4th column - Score

Genomic Annotation

GFF

Used to genome annotation.
Stores position, feature (exon) and meta-feature (transcript/gene) information.

Genomic Annotation

GFF

Chromosome
Start of feature
End of Feature
Strand

Genomic Annotation

GFF

Source
Feature type
Score

Genomic Annotation

GFF

Column 9 contains key pairs (ID=exon01), separated by semi-colons ";"
ID - Feature name.
PARENT- Meta-feature name.

Saving time and space

bigWig, bigBED and TABIX

Many programs and browsers deal better with compressed, indexed versions of genomic files
- SAM -> BAM (.bam and index file of .bai)
- Wiggle and bedGraph -> bigWig (.bw/.bigWig)
- BED -> bigBed (.bb)
- BED and GFF -> (.gz and index file of .tbi)

Getting help and more information

UCSC file formats
- https://genome.ucsc.edu/FAQ/FAQformat.html
IGV file formats
- https://www.broadinstitute.org/igv/FileFormats
Sanger (GFF)
- https://www.sanger.ac.uk/resources/software/gff/spec.html