Understanding file formats for aligned reads

Unlike most of Bioinfomatics, a single standard file format has emerged for aligned reads. Moreoever, this file format is consistent regardless of whether you have DNA-seq, RNA-seq, ChIP-seq… data.

The `.sam` file

Sequence Alignment/Map (sam)
The output from an aligner such as bwa
Same format regardless of sequencing protocol (i.e. RNA-seq, ChIP-seq, DNA-seq etc)
May contain un-mapped reads
Potentially large size on disk; ~100s of Gb
- Can be manipulated with standard unix tools; e.g. cat, head, grep, more, less….
Official specification can be obtained online: -
We normally work on a compressed version called a .bam file. See later.
Header lines starting with an @ character, followed by tab-delimited lines
- Header gives information about the alignment and references sequences used

The first part of the header lists the names (SN) of the sequences (chromosomes) used in alignment, their length (LN) and a md5sum “digital fingerprint” of the .fasta file used for alignment (M5).

@HD VN:1.5  SO:coordinate   GO:none
@SQ SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2    LN:243199373    M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3    LN:198022430    M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4    LN:191154276    M5:23dccd106897542ad87d2765d28a19a1
.....
.....

Next we can define the read groups present in the file which we can use to identify which sequencing library, sequencing centre, Lane, sample name etc.

@RG ID:SRR077850    CN:bi   LB:Solexa-42057 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR081675    CN:bi   LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR080818    CN:bi   LB:Solexa-44770 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR084838    CN:bi   LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR081730    CN:bi   LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
.....
.....

Finally, we have a section where we can record the processing steps used to derive the file

@PG ID:MosaikAligner    CL:/share/home/wardag/programs/MOSAIK/bin/MosaikAligner -in /scratch/wardag/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114.mkb -out
....
....

Next is a tab-delimited section that describes the alignment of each sequence in detail.

SRR081708.237649    163 1   10003   6   1S67M   =   10041   105 GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA    S=<====<<>=><?=?=?>==@??;?>@@@=??@@????@??@?>?@@<@>@'@=?=??=<=>?>?=Q    ZA:Z:<&;0;0;;308;68M;68><@;0;0;;27;;>MD:Z:5A11A5A11A5A11A13 RG:Z:SRR081708  NM:i:6  OQ:Z:GEGFFFEGGGDGDGGGDGA?DCDD:GGGDGDCFGFDDFFFCCCBEBFDABDD-D:EEEE=D=DDDDC:

Column	Official Name	Brief
1	QNAME	Sequence ID
2	FLAG	Sequence quality expressed as a bitwise flag
3	RNAME	Chromosome
4	POS	Start Position
5	MAPQ	Mapping Quality
6	CIGAR	Describes positions of matches, insertions, deletions w.r.t reference
7	RNEXT	Ref. name of mate / next read
8	PNEXT	Postion of mate / next read
9	TLEN	Observed Template length
10	SEQ	Sequence
11	QUAL	Base Qualities

There can also be all manner of optional tags as extra columns introduce by an aligner or downstream analysis tool. A common use is the RG tag which refers back to the read groups in the header.

Unlike the .fastq files, where we had a separate file for forward and reverse reads, the .sam file contains all reads. Reads that are paired with each other should appear in consecutive lines in the .sam file generated by an aligner. Otherwise it is more common for the file to be sorted according to genomic coordinates. The paired reads should share the same sequence ID in the first column (sometimes with a /1 or /2 to indicate which is which).

Dr Mark Dunning presents….Fun with flags!

The “flags” in the sam file can represent useful QC information

Read is unmapped
Read is paired / unpaired
Read failed QC
Read is a PCR duplicate (see later)

The combination of any of these properties is used to derive a numeric value

For instance, a particular read has a flag of 163

Derivation

There is a set of properties that a read can possess. If a particular property is observed, a corresponding power of 2 is added multiplied by 1. The final value is derived by summing all the powers of 2.

	ReadHasProperty	Binary	MultiplyBy
isPaired	TRUE	1	1
isProperPair	TRUE	1	2
isUnmappedQuery	FALSE	0	4
hasUnmappedMate	FALSE	0	8
isMinusStrand	FALSE	0	16
isMateMinusStrand	TRUE	1	32
isFirstMateRead	TRUE	1	64
isSecondMateRead	FALSE	0	128
isSecondaryAlignment	FALSE	0	256
isNotPassingQualityControls	FALSE	0	512
isDuplicate	FALSE	0	1024

Value of flag is given by 1x1 + 1x2 + 0x4 + 0x8 + 0x16 + 1x32 + 1x64 + 0x128 + 0x256 + 0x512 + 0x1024 = 99

Have a CIGAR!

The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string is a way of encoding the match between a given sequence and the position it has been assigned in the genome. It is comprised by a series of letters and numbers to indicate how many consecutive bases have that mapping.

Code	Description
M	alignment match
I	insertion
D	deletion
N	skipped
S	soft-clipping
H	hard-clipping

e.g.

68M
- 68 bases matching the reference
1S67M
- 1 soft-clipped read followed by 67 matches
15M87N70M90N16M
- 15 matches following by 87 bases skipped followed by 70 matches etc.

Rather than dealing with .sam files, we usually analyse a .bam file.

samtools

If you want to try any of these commands, click the link to the CRUK Docker on the Desktop

samtools is one of the most-popular ngs-related software suites and has a wealth of tools for dealing with files in .bam and .sam format. If you are going to start processing your own data, the chances are you’ll be using samtools a lot. Typing samtools in a terminal will give a quick overview of the functions available, and importantly, the version number of the software.

samtools

More information about a particular command within samtools can be displayed by printing samtools followed by the name of the command. For example, the samtools view command can convert a .sam file into a compressed version called a .bam file. This is a very-common step in an NGS workflow.

Lets assume we have a file sample.sam which is the result of aligning reads using a tool such as bwa. Then the command to convert would be:-

# We don't actually have a file called sample.sam. This is just an example of what the command would look like

samtools view -bS sample.sam > sample.bam

-b: output a bam file
-S: input is a sam file
> : redirect to a file

So a .bam file is

Exactly the same information as a sam file
..except that it is binary version of sam
compressed around x4
However, attempting to read with standard unix tools cat, head etc will print garbage to the screen

When viewing a .sam or .bam, we can choose to just view the header information

cd /data/test
samtools view -H paired.bam

The samtools view command needs to be used with a bit of care if not selecting the -H option. Unless directed otherwise, samtools will print the entire contents of the file to the screen (“the standard output”). We usually “pipe” the output to another unix command, such as head

samtools view paired.bam | head

Alternative, we can print the reads within a specific region. There are also options to print reads with particular flags.

samtools view 1:1-100000 paired.bam

Other useful samtools commands

samtools sort

Sorting
- The reads in a newly-aligned .sam file will probably be sorted according to the order they were generated by the sequencer
- Reads can be sorted according to genomic position
- Which allows us to access the file more easily

samtools index

Indexing
- Allow efficient access
- Producing a file .bai in the same directory
flagstat
Collates quality control information from the “flags”

samtools flagstat paired.bam

Typically, you will be dealing with .bam files that are

indexed
sorted
have had PCR duplicates marked

About PCR duplicates…

Marking of PCR duplicates
- PCR amplification errors can cause some sequences to be over-represented
- Chances of any two sequences aligning to the same position are unlikely
- Caveat: obviously this depends on amount of the genome you are capturing

Such reads are marked but not usually removed from the data
Most downstream methods will ignore such reads
Typically, picard is used

Picard is another very-common tool in NGS analysis with lots of conversion, manipulation tools. If you are seriously considering getting into NGS analysis, it is worth getting to know.

man picard-tools

picard-tools MarkDuplicates
## If you have time, see how you would remove mark duplicates for the file paired.bam

Summary

.sam and .bam files are a consistent way of representing alignment to the genome
They look the same regardless of the type of experiment
For each read, we have lots of useful information about how reliable the mapping might be
- the flag; https://broadinstitute.github.io/picard/explain-flags.html
- the CIGAR
We can utilise this information in our analyses
The easiest way to actually “see” the reads is using a browser such as IGV, as we will see in the next section.

(Optional / Advanced)

Locating where each generated sequence came from in the genome
Outside the scope of this course
Usually perfomed automatically by a sequencing service
For most of what follows in the course, we will assume alignment has been performed and we are dealing with aligned data
- Popular aligners
- bwa http://bio-bwa.sourceforge.net/
- bowtie http://bowtie-bio.sourceforge.net/index.shtml
- novoalign http://www.novocraft.com/products/novoalign/
- stampy http://www.well.ox.ac.uk/project-stampy
- many, many more…..

For completeness, we will show the commands used to aligned these two example fastq files.

We have previously downloaded a reference genome from the 1000 genomes into a directory ref_data which is a sub-directory of /home/participant/Course_Materials/. To check what files are present in this directory we can use the ls command.

cd  /reference_data

If the output of this command did not list Homo_sapiens_assembly19.fasta, we will have to retrieve it from the Broad Institue and un-compress

wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta

bwa can index the unzipped .fasta file. The option -p allows us how we want to name all the files that the indexing procedure creates. [N.B this step will take a long time]. The result of which should be files hg19.amb, hg19.ann, hg19.bwt, hg19.pac and hg19,sa.

bwa index -p hg19 -a bwtsw Homo_sapiens_assembly19.fasta

The command to perform the alignment is bwa aln which has many options. We are going to stick with the default.

bwa aln

Each .fastq file needs to be aligned separately. The structure of the command is as follows

bwa aln
the particular command in the bwa suite that we want to run; alignment in this case
../ref_data/hg19 relative path to the reference files that we just created with the prefix that we defined
sample.fq1 the .fastq file that we want to align
> sample_1.sai “redirecting” the output to a file. Without the >, the output would be printed directly to the screen

cd /data/test

bwa aln /reference_data/hg19 sample.fq1 > sample_1.sai

bwa aln /reference_data/hg19 sample.fq2 > sample_2.sai

The output from the bwa aln step is in a binary format that we cannot interpret directly. We need to make use of the bwa sampe command to create a human-readable format. We give it the files from the previous step and the fastq files.

bwa sampe /reference_data/hg19 sample_1.sai sample_2.sai sample.fq1 sample.fq2 > sample.sam

The output is a .sam file that was the starting point for this section.

Understanding Alignments

Mark Dunning

Last modified: 22 Jul 2017

Understanding file formats for aligned reads

The `.sam` file

Dr Mark Dunning presents….Fun with flags!

Derivation

Have a CIGAR!

samtools

Other useful samtools commands

About PCR duplicates…

Summary

(Optional / Advanced)

References:-

Understanding Alignments

Mark Dunning

Last modified: 22 Jul 2017

Understanding file formats for aligned reads

The .sam file

Dr Mark Dunning presents….Fun with flags!

Derivation

Have a CIGAR!

samtools

Other useful samtools commands

About PCR duplicates…

Summary

(Optional / Advanced)

References:-

The `.sam` file