Unlike most of Bioinfomatics, a single standard file format has emerged for aligned reads. Moreoever, this file format is consistent regardless of whether you have DNA-seq, RNA-seq, ChIP-seq… data.
.sam
filebwa
.bam
file. See later.@
character, followed by tab-delimited lines
The first part of the header lists the names (SN
) of the sequences (chromosomes) used in alignment, their length (LN
) and a md5sum “digital fingerprint” of the .fasta
file used for alignment (M5
).
@HD VN:1.5 SO:coordinate GO:none
@SQ SN:1 LN:249250621 M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4 LN:191154276 M5:23dccd106897542ad87d2765d28a19a1
.....
.....
Next we can define the read groups present in the file which we can use to identify which sequencing library, sequencing centre, Lane, sample name etc.
@RG ID:SRR077850 CN:bi LB:Solexa-42057 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR081675 CN:bi LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR080818 CN:bi LB:Solexa-44770 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR084838 CN:bi LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
@RG ID:SRR081730 CN:bi LB:Solexa-42316 PL:illumina PU:ILLUMINA SM:NA06984
.....
.....
Finally, we have a section where we can record the processing steps used to derive the file
@PG ID:MosaikAligner CL:/share/home/wardag/programs/MOSAIK/bin/MosaikAligner -in /scratch/wardag/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114/NA06984.SRR077850.mapped.illumina.mosaik.CEU.SINGLE.20111114.mkb -out
....
....
Next is a tab-delimited section that describes the alignment of each sequence in detail.
SRR081708.237649 163 1 10003 6 1S67M = 10041 105 GACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTGACCCTAACCCTGACCCTAACCCTAA S=<====<<>=><?=?=?>==@??;?>@@@=??@@????@??@?>?@@<@>@'@=?=??=<=>?>?=Q ZA:Z:<&;0;0;;308;68M;68><@;0;0;;27;;>MD:Z:5A11A5A11A5A11A13 RG:Z:SRR081708 NM:i:6 OQ:Z:GEGFFFEGGGDGDGGGDGA?DCDD:GGGDGDCFGFDDFFFCCCBEBFDABDD-D:EEEE=D=DDDDC:
Column | Official Name | Brief |
---|---|---|
1 | QNAME | Sequence ID |
2 | FLAG | Sequence quality expressed as a bitwise flag |
3 | RNAME | Chromosome |
4 | POS | Start Position |
5 | MAPQ | Mapping Quality |
6 | CIGAR | Describes positions of matches, insertions, deletions w.r.t reference |
7 | RNEXT | Ref. name of mate / next read |
8 | PNEXT | Postion of mate / next read |
9 | TLEN | Observed Template length |
10 | SEQ | Sequence |
11 | QUAL | Base Qualities |
There can also be all manner of optional tags as extra columns introduce by an aligner or downstream analysis tool. A common use is the RG
tag which refers back to the read groups in the header.
Unlike the .fastq
files, where we had a separate file for forward and reverse reads, the .sam
file contains all reads. Reads that are paired with each other should appear in consecutive lines in the .sam
file generated by an aligner. Otherwise it is more common for the file to be sorted according to genomic coordinates. The paired reads should share the same sequence ID in the first column (sometimes with a /1
or /2
to indicate which is which).
The “flags” in the sam file can represent useful QC information
The combination of any of these properties is used to derive a numeric value
For instance, a particular read has a flag of 163
There is a set of properties that a read can possess. If a particular property is observed, a corresponding power of 2 is added multiplied by 1. The final value is derived by summing all the powers of 2.
ReadHasProperty | Binary | MultiplyBy | |
---|---|---|---|
isPaired | TRUE | 1 | 1 |
isProperPair | TRUE | 1 | 2 |
isUnmappedQuery | FALSE | 0 | 4 |
hasUnmappedMate | FALSE | 0 | 8 |
isMinusStrand | FALSE | 0 | 16 |
isMateMinusStrand | TRUE | 1 | 32 |
isFirstMateRead | TRUE | 1 | 64 |
isSecondMateRead | FALSE | 0 | 128 |
isSecondaryAlignment | FALSE | 0 | 256 |
isNotPassingQualityControls | FALSE | 0 | 512 |
isDuplicate | FALSE | 0 | 1024 |
Value of flag is given by 1x1 + 1x2 + 0x4 + 0x8 + 0x16 + 1x32 + 1x64 + 0x128 + 0x256 + 0x512 + 0x1024 = 99
See also
The CIGAR (Compact Idiosyncratic Gapped Alignment Report) string is a way of encoding the match between a given sequence and the position it has been assigned in the genome. It is comprised by a series of letters and numbers to indicate how many consecutive bases have that mapping.
Code | Description |
---|---|
M | alignment match |
I | insertion |
D | deletion |
N | skipped |
S | soft-clipping |
H | hard-clipping |
e.g.
68M
1S67M
15M87N70M90N16M
Rather than dealing with .sam
files, we usually analyse a .bam
file.
If you want to try any of these commands, click the link to the CRUK Docker on the Desktop
samtools
is one of the most-popular ngs-related software suites and has a wealth of tools for dealing with files in .bam
and .sam
format. If you are going to start processing your own data, the chances are you’ll be using samtools
a lot. Typing samtools
in a terminal will give a quick overview of the functions available, and importantly, the version number of the software.
samtools
More information about a particular command within samtools can be displayed by printing samtools
followed by the name of the command. For example, the samtools view
command can convert a .sam
file into a compressed version called a .bam
file. This is a very-common step in an NGS workflow.
Lets assume we have a file sample.sam
which is the result of aligning reads using a tool such as bwa
. Then the command to convert would be:-
# We don't actually have a file called sample.sam. This is just an example of what the command would look like
samtools view -bS sample.sam > sample.bam
-b: output a bam file
-S: input is a sam file
> : redirect to a file
So a .bam
file is
cat
, head
etc will print garbage to the screenWhen viewing a .sam
or .bam
, we can choose to just view the header information
cd /data/test
samtools view -H paired.bam
The samtools view
command needs to be used with a bit of care if not selecting the -H
option. Unless directed otherwise, samtools
will print the entire contents of the file to the screen (“the standard output”). We usually “pipe” the output to another unix command, such as head
samtools view paired.bam | head
Alternative, we can print the reads within a specific region. There are also options to print reads with particular flags.
samtools view 1:1-100000 paired.bam
samtools sort
.sam
file will probably be sorted according to the order they were generated by the sequencersamtools index
.bai
in the same directorysamtools flagstat paired.bam
Typically, you will be dealing with .bam
files that are
Picard is another very-common tool in NGS analysis with lots of conversion, manipulation tools. If you are seriously considering getting into NGS analysis, it is worth getting to know.
man picard-tools
picard-tools MarkDuplicates
## If you have time, see how you would remove mark duplicates for the file paired.bam
.sam
and .bam
files are a consistent way of representing alignment to the genomeFor completeness, we will show the commands used to aligned these two example fastq files.
We have previously downloaded a reference genome from the 1000 genomes into a directory ref_data
which is a sub-directory of /home/participant/Course_Materials/
. To check what files are present in this directory we can use the ls
command.
cd /reference_data
If the output of this command did not list Homo_sapiens_assembly19.fasta
, we will have to retrieve it from the Broad Institue and un-compress
wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta
bwa
can index the unzipped .fasta
file. The option -p
allows us how we want to name all the files that the indexing procedure creates. [N.B this step will take a long time]. The result of which should be files hg19.amb
, hg19.ann
, hg19.bwt
, hg19.pac
and hg19,sa
.
bwa index -p hg19 -a bwtsw Homo_sapiens_assembly19.fasta
The command to perform the alignment is bwa aln
which has many options. We are going to stick with the default.
bwa aln
Each .fastq
file needs to be aligned separately. The structure of the command is as follows
bwa aln
bwa
suite that we want to run; alignment in this case../ref_data/hg19
relative path to the reference files that we just created with the prefix that we definedsample.fq1
the .fastq
file that we want to align> sample_1.sai
“redirecting” the output to a file. Without the >
, the output would be printed directly to the screencd /data/test
bwa aln /reference_data/hg19 sample.fq1 > sample_1.sai
bwa aln /reference_data/hg19 sample.fq2 > sample_2.sai
The output from the bwa aln
step is in a binary format that we cannot interpret directly. We need to make use of the bwa sampe
command to create a human-readable format. We give it the files from the previous step and the fastq files.
bwa sampe /reference_data/hg19 sample_1.sai sample_2.sai sample.fq1 sample.fq2 > sample.sam
The output is a .sam
file that was the starting point for this section.