UCSC Genome Browser: Multiple Alignment Format (MAF)

- - - - - -

Multiple Alignment Format (MAF)

The multiple alignment format (MAF) stores a series of multiple alignments in an ASCII text format that is easy to parse and read. MAF was created to store multiple alignments at the DNA level between entire genomes. It circumvents some of the problems inherent in other multiple alignment formats, which are suitable for alignments of single proteins or regions of DNA without rearrangements, but are unable to handle genomic issues such as forward and reverse strand directions or multi-part alignments.

General Structure

MAF files (*.maf) are line-oriented. A file consists of header and parameter lines, followed by a series of alignment blocks (or paragraphs), each of which contains one set of multiple alignments. Each block is separated from the next by a blank line.

Blocks consist of two parts: an alignment line showing the alignment parameters for the block, followed by a set of sequence lines showing each sequence in the multiple alignment at that location. In the current implementation, parsers should ignore other types of blocks and other types of lines within an alignment block, as well as extra blank lines between blocks.

Lines starting with "#" are considered to be comments. Lines starting with "##" contain meta-data of one form or another and can be ignored by most programs.

Example:
Here is a simple example of a file containing 2 alignments of 3 sequences each, from human, horse, and fugu:

 ##maf version=1 scoring=probability
 #mblastz 8.91 02-Jan-2005
 
 a score=0.128
 s human_hoxa 100  9 + 100257 ACA-TTACT
 s horse_hoxa 120 10 -  98892 ACAATTGCT
 s fugu_hoxa   88  8  + 90788 ACA--TGCT
 
 a score=0.071
 s human_unc 9077 8 + 10998 ACAGTATT
 s horse_unc 4555 6 -  5099 ACA--ATT
 s fugu_unc  4000 4 +  4038 AC----TT

1. Header line

 ##maf version=1 scoring=probability

The first line of a .maf file begins with ##maf, followed by one or more variable=value pairs separated by white space. There must be no white space on either side of the "=". Defined variables include:

version -- Required. Default = 1.
scoring -- (Optional) The scoring scheme used for the alignments. The current scoring schemes are:
- bit -- roughly corresponds to blast bit values: approximately 2 points per aligning base minus penalties for mismatches and inserts.
- blastz -- blastz scoring scheme: roughly 100 points per aligning base.
- probability -- some score normalized between 0 and 1.
program -- (Optional) Name of the program generating the alignment.

Parsers ignore variables they do not understand.

2. Parameter line

  #mblastz 8.91 02-Jan-2005

One or more parameter lines follow the header. These lines begin with "#" and display the parameters used to run the alignment program.

3. Alignment lines

  a score=0.128

Alignment lines contain the parameters used for that particular alignment block. Each alignment block must begin with an alignment line. Each alignment line begins with "a", followed by optional name=value pairs. The currently defined name variables are:

score -- (Optional) Floating point score. If included, the scoring attribute in the header line should also be defined.
pass -- (Optional). Positive integer value. For programs that do multiple pass alignments, such as blastz, pass indicates which pass the alignment came from. Typically, pass 1 will find the strongest alignments genome-wide, and pass 2 will find weaker alignments between two first-pass alignments.

4. Sequence lines

  s human_hoxa 100  9 + 100257 ACA-TTACT
  s horse_hoxa 120 10 -  98892 ACAATTGCT
  s fugu_hoxa   88  8  + 90788 ACA--TGCT

Sequence lines define the sequence of each species within an alignment block. Not all species will have a sequence line in every alignment block. Each sequence in the alignment is entered on a separate line, one per species, which can get quite long. All sequences within an alignment block must be of the same length. Alignment gaps are padded with "-". Words within a line are delimited by white space.

Each sequence line begins with "s" and contains the following required fields:

src -- Name of one of the source sequences included in the alignment. For sequences that are resident in a browser assembly, the form database.chromosome allows automatic creation of links to other assemblies. Non-browser sequences are typically referenced by the species name alone. Species names must not contain spaces: concatenate multi-word names or replace the space with an underscore.
start -- Start of the aligning region in the source sequence, using zero-based position coordinates. If the strand value is "-", this field defines the start relative to the reverse-complemented source sequence.
size -- Size of the aligning region in the source sequence. This is equal to the number of non-dash characters in the text field (see below).
strand -- "+" or "-". If the value is "-", the sequence aligns to the reverse-complemented source.
srcSize -- Size of the entire source sequence, not just the portions involved in the alignment.
text -- Nucleotides (or amino acids) in the alignment and any insertions (indicated by "-").

Example:
Here is an example of three alignment blocks derived from five starting sequences. Repeats are shown in lower case, and each block may have a subset of the input sequences. Each column and row in the block of sequence lines must contain at least one nucleotide; no columns or rows may consist entirely of "-"s.

 ##maf version=1 scoring=tba.v8
 # tba.v8 (((human chimp) baboon) (mouse rat))
 # multiz.v7
 # maf_project.v5 _tba_right.maf3 mouse _tba_C
 # single_cov2.v4 single_cov2 /dev/stdin

 a score=23262.0
 s hg16.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
 s baboon         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
 s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
 s rn3.chr4     81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG

 a score=5062.0
 s hg16.chr7    27699739 6 + 158545518 TAAAGA
 s panTro1.chr6 28862317 6 + 161576975 TAAAGA
 s baboon         241163 6 +   4622798 TAAAGA
 s mm4.chr6     53303881 6 + 151104725 TAAAGA
 s rn3.chr4     81444246 6 + 187371129 taagga

 a score=6636.0
 s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca
 s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
 s baboon         249182 13 +   4622798 gcagctgaaaaca
 s mm4.chr6     53310102 13 + 151104725 ACAGCTGAAAATA