FASTA Documentation



                        COPYRIGHT NOTICE

Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.
Pearson and the University of Virginia.  All rights reserved. The
FASTA program and documentation may not be sold or incorporated
into a commercial product, in whole or in part, without written
consent of William R. Pearson and the University of Virginia.
For further information regarding permission for use or
reproduction, please contact: David Hudson, Assistant Provost for
Research, University of Virginia, P.O. Box 9025, Charlottesville,
VA 22906-9025, (804) 924-6853

The FASTA program package

Introduction

     This documentation describes the version 3 of the FASTA
program package (see W. R. Pearson and D. J. Lipman (1988),
"Improved Tools for Biological Sequence Analysis", PNAS
85:2444-2448 (Pearson and Lipman, 1988); W. R.  Pearson (1996)
"Effective protein sequence comparison" Meth. Enzymol.
266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics
46:24-36 (Zhang et al., 1997);  Pearson, (1999) Meth. in
Molecular Biology 132:185-219 (Pearson, 1999).  Version 3 of the
FASTA packages contains many programs for searching DNA and
protein databases and one program (prss3) for evaluating
statistical significance from randomly shuffled sequences.
Several additional analysis programs, including programs that
produce local alignments, are available as part of version 2 of
the FASTA package, which is still available.

     This document is divided into three sections: (1) A summary
overview of the programs in the FASTA3 package; (2) A guide to
installing the programs and databases; (3) A guide to using the
FASTA programs. The revision history of the programs can be found
in the readme.v30..v33, files. The programs are very easy to use,
so if you are using them on a machine that is administered by
someone else, you can skip section (2) and focus on (1) and (3)
to learn how to use the programsIf you are installing the
programs on your own machine, you will need to read section (2)
carefully.

1.  An overview of the FASTA programs

     Although there are a large number of programs in this
package, they belong to three groups: (1) "Conventional" Library
search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,
TFASTY3, SSEARCH3; (2) Programs for searching with short
fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical
significance: PRSS3.  Programs that start with fast search
protein databases, while tfast programs search translated DNA
databases.  Table I gives a brief description of the programs.

2.  Installing FASTA and the sequence databases

2.1.  Obtaining the libraries

     The FASTA program package does not include any protein or
DNA sequence libraries.  Protein databases are available on CD-
ROM from the PIR and EMBL (see below), or via anonymouse FTP from
many different sources.  As this document is updated in the fall
of 1999, no DNA databases are available on CD-ROM from the major
sequence databases: Genbank at the National for Biotechnology
Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and
EMBL at the European Bioinformatics Institute (www.ebi.ac.uk).
However, the databases are available via anonymous FTP from both
sites.

2.1.1.  The GENBANK DNA sequence library

     Because of the large size of DNA databases, you will
probably want to keep DNA databases in only one, or possibly two,
formats.  The FASTA3 programs that search DNA databases - fasta3,
tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile
(not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),
and BLAST2.0 (formatdb) formats, as well as EMBL format.  If you
are also running the GCG suite of sequence analysis programs, you
should use GCG/compressed-binary format or BLAST2.0 format for
your fasta3 searches.  If not, BLAST2.0 is a good choice.  These
files are considerably more compact than Genbank flat files, and
are preferred.  The NCBI does not provide software for converting
from Genbank flat files to Blast2.0 DNA databases, but you can
use the Blast formatdb program to convert ASN.1 formated Genbank
files, which are available from the NCBI ftp site.

     The NCBI also provides the nr, swissprot, and several EST
databases that are used by BLAST in FASTA format from:
ftp://ncbi.nlm.nih.gov/blast/db.  These databases are updated
nightly.

2.1.2.  The NBRF protein sequence library

     You can obtain the PIR protein sequence database (Barker et
al., 1998) from:

    National  Biomedical Research Foundation
    Georgetown  University  Medical  Center
    3900 Reservoir Rd, N.W.
    Washington, D.C. 20007

or via ftp from nbrf.georgetown.edu or from the NCBI
(ncbi.nlm.nih.gov/repository/PIR). The data in the ascii
directory is in PIR Codata format, which is not widely used.  I
recommend the PIR/VMS format data (libtype=5) in the vms
directory.

              Table I. Comparison programs in the FASTA3 package

-------------------------------------------------------------------------------
fasta3             Compare a protein sequence to a protein sequence database or
                   a  DNA  sequence  to a DNA sequence database using the FASTA
                   algorithm (Pearson and Lipman, 1988, Pearson, 1996).  Search
                   speed and selectivity are controlled with the ktup(wordsize)
                   parameter.  For protein comparisons, ktup =  2  by  default;
                   ktup  =1 is more sensitive but slower.  For DNA comparisons,
                   ktup=6 by default; ktup=3 or ktup=4 provides  higher  sensi-
                   tivity;  ktup=1  should  be  used  for oligonucleotides (DNA
                   query lengths < 20).

ssearch3           Compare a protein sequence to a protein sequence database or
                   a  DNA  sequence to a DNA sequence database using the Smith-
                   Waterman algorithm (Smith and Waterman, 1981).  ssearch3  is
                   about 10-times slower than FASTA3, but is more sensitive for
                   full-length protein sequence comparison.

fastx3/ fasty3     Compare a DNA sequence to a protein  sequence  database,  by
                   comparing  the  translated  DNA sequence in three frames and
                   allowing gaps  and  frameshifts.   fastx3  uses  a  simpler,
                   faster algorithm for alignments that allows frameshifts only
                   between codons; fasty3 is slower but produces better  align-
                   ments  with  poor  quality sequences because frameshifts are
                   allowed within codons.

tfastx3/ tfasty3   Compare a protein sequence to a DNA sequence database,  cal-
                   culating  similarities  with  frameshifts to the forward and
                   reverse orientations.

tfasta3            Compare a protein sequence to a DNA sequence database,  cal-
                   culating similarities (without frameshifts) to the 3 forward
                   and three reverse reading frames.  tfastx3 and  tfasty3  are
                   preferred    because    they   calculate   similarity   over
                   frameshifts.

fastf3/tfastf3     Compares an ordered peptide mixture, as would be obtained by
                   Edman degredation of a CNBr cleavage of a protein, against a
                   protein (fastf) or DNA (tfastf) database.

fasts3/tfasts3     Compares set of short peptide fragments,  as  would  be  ob-
                   tained from mass-spec. analysis of a protein, against a pro-
                   tein (fasts) or DNA (tfasts) database.
-------------------------------------------------------------------------------

2.1.3.  The EBI/EMBL CD-ROM libraries

     The European Bioinformatics Institute (EBI) distributes both
the EMBL DNA database and the SwissProt database on CD-ROM
(Bairoch and Apweiler, 1996), and they are available from:

    EMBL-Outstation  European Bioinformatics Institute
    Wellcome Trust Genome Campus,
    Hinxton Hall
    Hinxton,
    Cambridge CB10 1SD
    United Kingdom
    Tel: +44 (0)1223 494444
    Fax: +44 (0)1223 494468
    Email: DATALIB@ebi.ac.uk

In addition, the SWISS-PROT protein sequence database is
available via anonymous FTP from
ftp://ftp.expasy.ch/databases/swiss-prot/ (also see
www.expasy.ch).

2.2.  Finding the libraries: FASTLIBS

     The major problem that most new users of the FASTA package
have is in setting up the program to find the databases and their
library type.  In general, if you cannot get fasta3 to read a
sequence database, it is likely that something is wrong with the
FASTLIBS file.  A common problem is that the database file is
found, but either no sequences are read, or an incorrect number
of entries is read.  This is almost always because the library
format (libtype) is incorrect.  Note that a type 5 file (PIR/VMS
format) can be read as a type 0 (default FASTA) format file, and
the number of entries will be correct, but the sequence lengths
will not.

     All the search programs in the FASTA3 package use the
environment variable FASTLIBS to find the protein and DNA
sequence libraries.  The FASTLIBS variable contains the name of a
file that has the actual filenames of the libraries.  The
fastlibs file included with the distribution on is an example of
a file that can be referred to by FASTLIBS. To use the fastlibs
file, type:

    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
    or
    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)

Then edit the fastlibs file to indicate where the protein and DNA
sequence libraries can be found.  If you have a hard disk and
your protein sequence library is kept in the file
/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
in the directory: /usr/lib/genbank, then fastgbs might contain:

    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
    GB Primate$1P@/usr/lib/genbank/gpri.nam
    GB Rodent$1R@/usr/lib/genbank/grod.nam
    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
    ^   1    ^^^^       4                   ^     ^
              23                             (5)

The first line of this file says that there is a copy of the NBRF
protein sequence database (which is a protein database) that can
be selected by typing "P" on the command line or when the
database menu is presented in the file /usr/lib/seq/aabank.lib.

     Note that there are 4 or 5 fields in the lines in fastgbs.
The first field is the description of the library which will be
displayed by FASTA; it ends with a '$'.  The second field (1
character), is a 0 if the library is a protein library and 1 if
it is a DNA library.  The third field (1 character) is the
character to be typed to select the library.

     The fourth field is the name of the library file.  In the
example above, the /usr/lib/seq/aabank.lib file contains the
entire protein sequence library.  However the DNA library file
names are preceded by a '@', because these files (gpri.nam,
grod.nam, gmammal.nam) do not contain the sequences; instead they
contain the names of the files which contain the sequences.  This
is done because the GENBANK DNA database is broken down in to a
large number of smaller files.  In order to search the entire
primate database, you must search more than a dozen files.

     In addition, an optional fifth field can be used to specify
the format of the library file.  Alternatively, you can specify
the library format in a file of file names (a file preceded by an
'@').  This field must be separated from the file name by a space
character (' ') from the filename.  In the example above, the
aabank.lib file is in Pearson/FASTA format, while the swiss.seq
file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
FASTA can read the following formats:

    0 Pearson/FASTA (>SEQID - comment/sequence)
    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
    2 NBRF CODATA (ENTRY/SEQUENCE)
    3 EMBL/SWISS-PROT (ID/DE/SQ)
    4 Intelligenetics (;comment/SEQID/sequence)
    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
    6 GCG (version 8.0) Unix Protein and DNA (compressed)
    11 NCBI Blast1.3.2 format  (unix only)
    12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)

In particular, this version will work with the EMBL and PIR VMS
formats that are distributed on the EMBL CD-ROM. The latter
format (PIR VMS) is much faster to search than EMBL format.  This
release also works with the protein and DNA database formats
created for the BLASTP and BLASTN programs by SETDB and PRESSDB
and with the new NCBI search format.  If a library format is not
specified, for example, because you are just comparing two
sequences, Pearson/FASTA (format 0) is used by default. To
specify a library type on the command line, add it to the library
filename and surround the filename and library type in quotes:

    fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"


     You can specify a group of library files by putting a '@'
symbol before a file that contains a list of file names to be
searched.  For example, if @gmam.nam is in the fastgbs file, the
file "gmam.nam" might contain the lines:

    ' in the first column. (3) distributed sequence libraries
(this is a broad class that includes the NBRF/PIR VMS and blocked
ascii formats, Genbank flat-file format, EMBL flat-file format,
and Intelligenetics format.  All of the files that you create
should be of type (1) or (2).  FASTA format files (ones with a
'>' and comment before the sequence) are preferred, because they
can be used as query or library sequence files by all of the
programs.

     I have included several sample test files, *.aa and *.seq as
well as two small sequence libraries, prot_test.lib and gst.nlib.
The first line may begin with a '>' by a comment.  Spaces and
tabs (and anything else that is not an amino-acid code) are
ignored.

     Library files should have the form:

    >Sequence name and identifier
    A F A S Y T .... actual sequence.
    F S S       .... second line of sequence.
    >Next sequence name and identifier

This is often referred to as "FASTA" or format.  You can build
your own library by concatenating several sequence files.  Just
be sure that each sequence is preceded by a line beginning with a
'>' with a sequence name.

     The test file should not have lines longer than 120
characters, and sequences entered with word processors should use
a document mode, with normal carriage returns at the end of
lines.

     A different format is required to specify the ordered
peptide mixture for fastf3/tfastf3. For example:

    >mgstm1
    MGCEN,
    MIDYP,
    MLLAY,
    MLLGY
indicates m in the first position of all three peptides (as from
CNBr), G, I, L (twice) in the second position (first cycle),
C,D,L (twice) in the third position, etc.  The commas (,) are
required to indicate the number of fragments in the mixture, but
there should be no comma after the last residue.

     For the fasts3/tfasts3 program, the format is the same,
except that there is no requirement for the peptides to be the
same length.

4.  Statistical Significance

     All the programs in the FASTA3 package attempt to calculate
accurate estimates of the statistical significance of a match.
For fasta3, ssearch3, and fastx3/y3, these estimates are very
accurate (Pearson, 1998, Zhang et al., 1997)..  Altschul et al.
(Altschul et al., 1994) provides an excellent review of the
statistics of local similarity scores.  Local sequence similarity
scores follow the extreme value distribution, so that P(s > x) =
1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are
the lengths of the query and library sequence. This formula can
be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that
the average score for an unrelated library sequence increases
with the logarithm of the length of the library sequence.  The
fasta3 programs use simple linear regression against the the log
of the library sequence length to calculate a normalized "z-
score" with mean 50, regardless of library sequence length, and
variance 10. (Several other estimation methods are available with
the -z option.) These z-scores can then be used with the extreme
value distribution and the poisson distribution (to account for
the fact that each library sequence comparison is an independent
test) to calculate the number of library sequences to obtain a
score greater than or equal to the score obtained in the search.
The original idea and routines to do the linear regression on
library sequence length were provided Phil Green, U. Washington.
This version uses a slightly different strategy for fitting the
data than those originally provided by Dr. Green.

     The expected number of sequences is plotted in the histogram
using an "*". Since the parameters for the extreme value
distribution are not calculated directly from the distribution of
similarity scores, the pattern of "*'s" in the histogram gives a
qualitative view of how well the statistical theory fits the
similarity scores calculated by the programs.  For fasta3, if
optimized scores are calculated for each sequence in the database
(the default), the agreement between the actual distribution of
"z-scores" and the expected distribution based on the length
dependence of the score and the extreme value distribution is
usually very good.  Likewise, the distribution of ssearch3 Smith-
Waterman scores typically agrees closely with the  database.lc_seg

     (seg can also be used with some post processing, see
     readme.v33tx.)

-w # Line length (width) = number (<200)

-x # Specify the penalty for a match to an 'X', independently of the
     PAM matrix.  Particularly useful for fastx3/fasty3, where
     termination codons are encoded as 'X'.

-X   Specifies offsets for the beginning of the query and library
     sequence.  For example, if you are comparing upstream
     regions for two genes, and the first sequence contains 500
     nt of upstream sequence while the second contains 300 nt of
     upstream sequence, you might try:

         fasta -X "-500 -300" seq1.nt seq2.nt

     If the -X option is not used, FASTA assumes numbering starts with
     1.  (You should double check to be certain the negative numbering
     works properly.)

-y   Set the width of the band used for calculating "optimized"
     scores.  For proteins and ktup=2, the width is 16.  For
     proteins with ktup=1, the width is 32 by default.  For DNA
     the width is 16.

-z -1,0,1,2,3,4,5
     -z -1 turns off statistical calculations. z 0 estimates the
     significance of the match from the mean and standard
     deviation of the library scores, without correcting for
     library sequence length.  -z 1 (the default) uses a weighted
     regression of average score vs library sequence length; -z 2
     uses maximum likelihood estimates of Lambda and K; -z 3 uses
     Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
     uses two variations on the -z 1 strategy. -z 1 and -z 2 are
     the best methods, in general.

-z 11,12,14,15
     estimate the statistical parameters from shuffled copies of
     each library sequence.  This doubles the time required for a
     search, but allows accurate statistics to be estimated for
     libraries comprised of a single protein family.

-Z db_size 
     set the apparent size of the database to be used when calculating
     expectation E() values.  If you searched a database with 1,000
     sequences, but would like to have the E()-values calculated in
     the context of a 100,000 sequence database, use '-Z 100000'.

-1   sort output by init1 score (for compatibility with FASTP -
     do not use).

-3   translate only three forward frames

For example:

    fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and
display the results with 80 residues on an output line, showing
all of the residues in both sequences.  Be sure to enter the
options before entering the file names, or just enter the options
on the command line, and the program will prompt for the file
names.


     (November, 1997) In addition, it is now possible to provide
the fasta programs with the query sequence (fasta, fasty,
ssearch, tfastx), or two sequences (prss, lalign, plalign) from
the unix "stdin" stream.  This makes it much easier to set up
FASTA or PRSS WWW pages.  To specify that stdin be used, rather
than a file, the file name should be specified as '-' or '@' (the
latter file name makes it possible to specify a subset of the
sequence).  Thus:

    cat query.aa | fasta -q @:25-75 s

would take residues 25-75 from query.aa and search the 's'
library (see the discussion of FASTLIBS).  If DNA sequences are
to be read from stdin, the '-n' option must be used, as fasta
cannot check for DNA queries when stdin is used.

5.2.  Environment variables

     Because the current version of the program allows the user
to set virtually every option on the command line (except the
ktup, which must be set as the third command line argument), only
the FASTLIBS environment variable is routinely used.

FASTLIBS
     specifies the location of the file which contains the list
     of library descriptions, locations, and library types (see
     section on finding library files).

6.  Frequently Asked Questions

 (1)   Which program should I use? See Table I.

 (2)   How do I search with both DNA strands with fasta3 and
       fastx3? With version 32 of the FASTA program package, all
       searches that use DNA queries (e.g. fasta3, fastx3/y3)
       examine both strands. To revert to earlier FASTA behavior
       - only looking at the forward or reverse strand - use -3
       to search only the forward strand and -i -3 to search only
       the reverse strand.

 (3)   When I search Genbank - the program reports: 0 residues in
       0 sequences.  This typically happens because the program
       does not know that you are searching a Genbank flatfile
       database and is looking for a FASTA format database.  Be
       certain to specify the library type ("1" for Genbank
       flatfile) with the database name.

 (4)   What is the difference between fastx3 and fasty3 (or
       tfastx3 and tfasty3).  [t]fastx3 uses a simpler codon
       based model for alignments that does not allow frameshifts
       in some codon positions (see ref. (Zhang et al., 1997)).
       tfastx3 is about 30% faster, but tfasty3 can produce
       higher quality alignments in some cases.

 (5)   When I run fasta3 -q, I don't see any (or very little)
       output, but I get lots of scores when I run
       interactively.P With the -Q option, the number of high
       scores displayed is limited by the -E # cutoff, which is
       10.0 for protein comparisons, 2.0 for DNA comparisons, and
       5.0 for translated DNA:protein comparisons.  In
       interactive mode (without -Q), by default you see 20 high
       scores, regardless of E() value.

 (6)   What is ktup - All of the programs with fast in their name
       use a computer science method called a lookup table to
       speed the search.  For proteins with ktup=2, this means
       that the program does not look at any sequence alignment
       that does not involve matching two identical residues in
       both sequences.  Likewise with DNA and ktup = 6, the
       initial alignment of the sequences looks for 6 identical
       adjacent nucleotides in both sequences.  Because it is
       less likely that two identical amino-acids will line up by
       chance in two unrelated proteins, this speeds up the
       comparison.  But very distantly related sequences may
       never have two identical residues in a row but will have
       single aligned identities.  In this case, ktup = 1 may
       find alignments that ktup=2 misses.

 (7)   Sometimes, in the list of best scores, the same sequence
       is shown twice with exactly the same score.  Sometimes,
       the sequence is there twice, but the scores are slightly
       different. When any of the fasta3 programs searches a long
       sequence, it breaks the sequence up into overlapping
       pieces.  The length of the piece depends on the length of
       the query and the particular program being used (it can
       also be controlled with the -N #### option).  Since the
       pieces overlap by the length of the query sequence (or
       3*query_length for fastx/y3 and tfasta/x/y3), if the
       highest scoring alignment is at the end of one piece, it
       will be scored again at the beginning of the next piece.
       If the alignment is not be completely included in the
       overlap region, one of the pieces will give a higher score
       than the other.  These duplications can be detected by
       looking at the coordinates of the alignment.  If either
       the beginning or end coordinate is identical in two
       alignments, the alignments are at least partially
       duplicates.

As always, please inform me of bugs as soon as possible.

William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU


7.  References

Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
(1994). Issues in searching molecular sequence databases. Nature
Genet. 6,119-129.

Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
Methods Enzymol. 266,460-480.

Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
sequence data bank and its new supplement TrEMBL. Nucleic Acids.
Res. 24,21-25.

Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
(1998). The PIR-International Protein Sequence Database. Nucleic
Acids Res 26,27-32.

Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
of evolutionary change in proteins. In Atlas of Protein Sequence
and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
Spring, MD: National Biomedical Research Foundation), pp.
345-352.

Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
rapid generation of mutation data matrices from protein
sequences. Comp. Appl. Biosci. 8,275-282.

Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
biological sequence comparison. Proc. Natl. Acad. Sci. USA
85,2444-2448.

Pearson, W. R. (1995). Comparison of methods for searching
protein sequence databases. Prot. Sci. 4,1145-1160.

Pearson, W. R. (1996). Effective protein sequence comparison.
Methods Enzymol. 266,227-258.

Pearson, W. R. (1998). Empirical statistical estimates for
sequence similarity searches. J. Mol. Biol. 276,71-84.

Pearson, W. R. (1999). Flexible similarity searching with the
FASTA3 program package. In Bioinformatics Methods and Protocols,
S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
185-219.

Smith, T. F. and Waterman, M. S. (1981). Identification of common
molecular subsequences. J. Mol. Biol. 147,195-197.

Wootton, J. C. and Federhen, S. (1993). Statistics of local
complexity in amino acid sequences and sequence databases.
Comput. Chem. 17,149-163.

Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
sequence with a protein sequence. J. Computational Biology
4,339-349.