The multiple alignment format (MAF) stores a series of
multiple alignments in an ASCII text format that is easy to
parse and read. MAF was created to
store multiple alignments at the DNA level between entire
genomes. It circumvents some of the problems inherent in other
multiple alignment formats, which are suitable for
alignments of single proteins or regions of DNA
without rearrangements, but are unable to handle genomic
issues such as forward and reverse strand directions or
multi-part alignments.
General Structure
MAF files (*.maf) are line-oriented. A file consists
of header and parameter lines, followed by a series
of alignment blocks (or paragraphs),
each of which contains one set of multiple alignments. Each
block is separated from the next by a blank line.
Blocks consist of two parts: an alignment line
showing the alignment parameters for the block, followed by
a set of sequence lines showing each sequence in the multiple
alignment at that location. In the current
implementation, parsers should ignore other types of
blocks and other types of lines within an alignment
block, as well as extra blank lines between blocks.
Lines starting with "#" are considered to
be comments. Lines starting with "##"
contain meta-data of one form or another and can be ignored
by most programs.
Example:
Here is a simple example
of a file containing 2 alignments of 3 sequences each,
from human, horse, and fugu:
##maf version=1 scoring=probability
#mblastz 8.91 02-Jan-2005
a score=0.128
s human_hoxa 100 9 + 100257 ACA-TTACT
s horse_hoxa 120 10 - 98892 ACAATTGCT
s fugu_hoxa 88 8 + 90788 ACA--TGCT
a score=0.071
s human_unc 9077 8 + 10998 ACAGTATT
s horse_unc 4555 6 - 5099 ACA--ATT
s fugu_unc 4000 4 + 4038 AC----TT
1. Header line
##maf version=1 scoring=probability
The first line of a .maf file begins with ##maf,
followed by one or more variable=value pairs
separated by white space. There must be no white
space on either side of the "=". Defined
variables include:
- version -- Required. Default = 1.
- scoring -- (Optional) The scoring
scheme used for the alignments. The current scoring
schemes are:
-
bit -- roughly corresponds to blast bit values:
approximately 2 points per aligning base minus penalties for
mismatches and inserts.
-
blastz -- blastz scoring scheme: roughly 100 points per
aligning base.
- probability -- some score normalized between 0 and 1.
- program -- (Optional) Name of the program
generating the alignment.
Parsers ignore variables they do not understand.
2. Parameter line
#mblastz 8.91 02-Jan-2005
One or more parameter lines follow the header. These
lines begin with "#" and display the parameters
used to run the alignment program.
3. Alignment lines
a score=0.128
Alignment lines contain the parameters used for that
particular alignment block. Each alignment block must begin with an
alignment line. Each alignment line begins with
"a",
followed by optional name=value
pairs. The currently defined name variables
are:
-
score -- (Optional) Floating point score. If
included, the scoring attribute in the header line
should also be defined.
-
pass -- (Optional). Positive integer value.
For programs that do multiple pass alignments, such as blastz,
pass indicates which pass the alignment came from.
Typically, pass 1 will find the strongest alignments
genome-wide, and pass 2 will find weaker alignments
between two first-pass alignments.
4. Sequence lines
s human_hoxa 100 9 + 100257 ACA-TTACT
s horse_hoxa 120 10 - 98892 ACAATTGCT
s fugu_hoxa 88 8 + 90788 ACA--TGCT
Sequence lines define the sequence of each species
within an alignment block. Not all species
will have a sequence line in every alignment block.
Each sequence in the alignment is entered on
a separate line, one per species, which can get quite long.
All sequences within an alignment block must be of the same
length. Alignment gaps are padded with "-".
Words within a line are delimited by white space.
Each sequence line begins with "s" and contains the
following required fields:
-
src -- Name of one of the source sequences
included in the alignment. For sequences that are resident
in a
browser assembly, the form database.chromosome allows
automatic creation of links to other assemblies. Non-browser
sequences are typically referenced by the species name alone.
Species names must not contain spaces: concatenate multi-word
names or replace the space with an underscore.
-
start -- Start of the aligning region in
the source sequence, using zero-based position coordinates.
If the strand value is "-", this field defines the
start relative to the reverse-complemented source sequence.
-
size -- Size of the aligning region in
the source sequence. This is equal to the number of
non-dash characters in the text field
(see below).
-
strand -- "+" or "-".
If the value is "-", the sequence aligns to the
reverse-complemented source.
-
srcSize -- Size of the entire source
sequence, not just the portions involved in the alignment.
-
text -- Nucleotides (or amino acids) in
the alignment and any insertions (indicated by "-").
Example:
Here is an example of three alignment
blocks derived from five starting sequences.
Repeats are shown in lower case, and each block may
have a subset of the input sequences. Each
column and row in the block of sequence lines must contain
at least one nucleotide; no columns or rows may consist
entirely of "-"s.
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
# multiz.v7
# maf_project.v5 _tba_right.maf3 mouse _tba_C
# single_cov2.v4 single_cov2 /dev/stdin
a score=23262.0
s hg16.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg16.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA
|