Scientific Supercomputing at the NIH

Fasta on Helix

The Fasta program package contains many programs for searching DNA and protein databases and one program (prss) for evaluating statistical significance from randomly shuffled sequences.

The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

[Programs] [Version] [How to Use] [Sample 1] [Sample 2] [User's Database] [Database Format] [Output Sample] [Documentation]

Programs

Protein searches

* Protein-protein FASTA
* Protein-protein Smith-Waterman (ssearch)
* (New) Global Protein-protein (Needleman-Wunsch) (ggsearch)
* (New) Global/Local protein-protein (glsearch)
* Protein-protein with unordered peptides (fasts)
* Protein-protein with mixed peptide sequences (fastf)

 

Nucleotide searches

* Nucleotide-Nucleotide (DNA/RNA fasta)
* Ordered Nucleotides vs Nucleotide (fastm)
* Un-ordered Nucleotides vs Nucleotide (fasts)

 

Translated searches

* Translated DNA (with frameshifts, e.g. ESTs) vs Proteins (fastx/fasty)
* Protein vs Translated DNA (with frameshifts) (tfastx/tfasty)
* Peptides vs Translated DNA (tfasts)

Statistical Significance

* Protein vs Protein shuffle (prss)
* DNA vs DNA shuffle (prss)
* Translated DNA vs Protein shuffle (prfx)

Local Duplications

* Local Protein alignments (lalign)
* Plot Protein alignment "dot-plot" (plalign)
* Local DNA alignments (lalign)
* Plot DNA alignment "dot-plot" (plalign)

 

Version

At the helix prompt, type 'fasta' to see the current installed version. e.g.

helix% fasta # fasta FASTA searches a protein or DNA sequence data bank
version 35.03 Feb. 18, 2008
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

How to Use

At the Helix prompt, type program name. For example, to run the individual Fasta program directly, typing 'fastx', or 'tfastx'.

If you need an occasional Fasta search on a few sequences, use the Parallel Fasta search web interface.
For large numbers of sequences (>100), Fasta on Biowulf may be the most efficient.

Sample 1 (user input in bold):

% fasta
FASTA searches a protein or DNA sequence data bank
version 35.03 Feb. 18, 2008
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
test sequence file name: /data/userID/testfasta.nt
library file name: @/fdb/fastadb/fastanam/ecoli.nt.nam # See database section below
ktup? (1 to 6) [6]
Query: /data/testfasta.nt
1>>>AF436075, 150 bp. 150 nt - 150 nt
Library: @/fdb/fastadb/fastanam/ecoli.nt.nam.. Done!
4662239 residues in 400 sequences
opt E()
< 20 0 0:
22 0 0: one = represents 1 library sequences
24 0 0:
26 0 0:
28 0 0:
30 0 1:*
32 12 2:=*==========
34 7 6:=====*=
36 15 12:===========*===
38 22 20:===================*==
40 18 27:================== *
42 28 33:============================ *
44 30 37:============================== *
46 32 38:================================ *
48 45 36:===================================*=========
50 28 33:============================ *
52 29 29:============================*
54 27 25:========================*==
56 10 21:========== *
58 14 17:============== *
60 15 14:=============*=
62 17 11:==========*======
64 19 9:========*==========
66 11 7:======*====
68 5 5:====*
70 8 4:===*====
72 2 3:==*
74 0 3: *
76 2 2:=*
78 0 2: *
80 0 1:*
82 2 1:*=
84 1 1:*
86 0 1:*
88 0 0:
90 0 0:
92 0 0:
94 0 0:
96 0 0:
98 0 0:
100 1 0:=
102 0 0:
104 0 0:
106 0 0:
108 0 0:
110 0 0:
112 0 0:
114 0 0:
116 0 0:
118 0 0:
>120 0 0:
4662239 residues in 400 sequences
Statistics: Expectation_n fit: rho(ln(x))= 0.7566+/- 0.059; mu= 61.3529+/- 5.524
mean_var=101.5743+/-51.265, 0's: 0 Z-trim: 1 B-trim: 0 in 0/8
Lambda= 0.127257
Kolmogorov-Smirnov statistic: 0.0453 (N=24) at 58
Algorithm: FASTA (3.5 Sept 2006) [optimized]
Parameters: +5/-4 matrix (5:-4) ktup: 6
join: 46, opt: 31, open/ext: -12/-4, width: 16
Scan time: 0.210
Enter filename for results []: re1
How many scores would you like to see? [20]
The best scores are: opt bits E(400)
gi|1790649|gb|AE000492.1|AE000492 Escherichia (10181) [f] 120 30.9 0.36
gi|1789499|gb|AE000393.1|AE000393 Escherichia (10516) [r] 104 28.0 2.7
gi|1787706|gb|AE000241.1|AE000241 Escherichia (10160) [r] 102 27.6 3.6
gi|2367369|gb|AE000500.1|AE000500 Escherichia (11383) [r] 100 27.3 4.1
gi|2367182|gb|AE000382.1|AE000382 Escherichia (11024) [f] 96 26.6 7
gi|2367366|gb|AE000496.1|AE000496 Escherichia (11929) [r] 95 26.5 7.4
gi|1786262|gb|AE000118.1|AE000118 Escherichia (21757) [f] 90 26.3 8.1
gi|1788883|gb|AE000340.1|AE000340 Escherichia (15169) [f] 90 25.9 11
gi|1787764|gb|AE000246.1|AE000246 Escherichia (16338) [r] 88 25.6 13
gi|1787248|gb|AE000203.1|AE000203 Escherichia (10751) [r] 91 25.6 13
gi|1787566|gb|AE000229.1|AE000229 Escherichia (12275) [r] 90 25.6 14
gi|1786580|gb|AE000145.1|AE000145 Escherichia (11448) [f] 90 25.5 14
gi|2367338|gb|AE000477.1|AE000477 Escherichia (11314) [r] 90 25.5 15
gi|2367137|gb|AE000329.1|AE000329 Escherichia (11313) [r] 90 25.5 15
gi|1787509|gb|AE000224.1|AE000224 Escherichia (12963) [r] 89 25.5 15
gi|1788425|gb|AE000300.1|AE000300 Escherichia (16939) [f] 87 25.5 15
gi|1787588|gb|AE000231.1|AE000231 Escherichia (12790) [r] 89 25.5 15
gi|1788338|gb|AE000294.1|AE000294 Escherichia (16032) [r] 86 25.2 17
gi|2367095|gb|AE000113.1|AE000113 Escherichia (13485) [r] 86 25.0 20
gi|1788129|gb|AE000277.1|AE000277 Escherichia (11653) [f] 87 25.0 21
More scores? [0]
Display alignments also? (y/n) [n] y
number of alignments [20]?

150 residues in 1 query sequences
4662239 residues in 400 library sequences
Scomplib [35.03]
start: Wed Jun 18 13:12:27 2008 done: Wed Jun 18 13:14:51 2008
Total Scan time: 0.210 Total Display time: 0.010
Function used was FASTA [version 35.03 Feb. 18, 2008]

Sample 2

% fasta /home/user/YourSeq.nt @/fdb/fastadb/fastanam/ecoli.nt.nam 6 -O /home/user/Out.seq -b 40 -d 40 -H -Q
The above command will cleanly run/finish without prompting user.

Databases available on Helix

A large collection of Fasta-format databases is maintained and updated on the Helix Systems. [Database status]. Any individual fasta-format file can be used as a database by entering its full pathname as the 'library file name'. For larger databases such as NCBI nt which consist of multiple files, we maintain '.nam' files which contain a list of all the sections in the directory /fdb/fastadb/fastanam/. To use these multi-section databases, users need to prefix the .nam filename with '@'. (e.g. @/fdb/fastadb/fastanam/ref.human.protein.nam)

Protein databases Nucleotide databases
nr
swissprot
pdb
yeast.aa
drosoph.aa
ecoli.aa
mito.aa
ref.human.protein
ref.mouse.protein
hs_genome.protein
mouse_genome.protein
nt
est_human
est_mouse
htgs
mito.nt
ecoli.nt
pdb.nt
yeast.nt
drosoph.nt
ref.human.rna
ref.mouse.rna
ref.human.genomic
ref.other.genomic
hs_genome
hs_genome.rna
mouse_genome
mouse_genome.rna

 

How to run against your own database

You may need to run Fasta against a very specific collection of sequences (e.g. all HIV-related proteins) or against a group of available databases (e.g. yeast + ecoli). The fasta script is set up for these options.

There are 2 types of personal databases:

1. The simplest database is a collection of sequences in fasta format. For example, if you want to run fasta against hiv-2 proteins, you could search the NCBI databases for 'hiv-2' and download all the resultant proteins in fasta format. This would then be your target database. The database file would look like this:

Sample fasta-format database file


>gi|6226318|ref|NC_001234.1| Saccharomyces cerevisiae mitochondrion, complete genome TTCATAATTAATTTTTTATATATATATTATATTATAATATTAATTTATATTATAAAAATA >gi|6226515|ref|NC_001233.1| Saccharomyces cerevisiae mitochondrion, complete genome TTCATAATTAATTTTTTATATATATATTATATTATAATATTAATTTATA >gi|6227515|ref|NC_001256.1| Saccharomyces cerevisiae mitochondrion, complete genome TTCATAATTAATTTTTTATATATATATTATATTATAATATTAATTTATA

When prompted for the database filename, you should enter the filename with full path of this library file:

library file name: /home/username/hiv-1.fas

2. A collection of files can be searched by setting up a 'library file' that lists all the target databases. Each database is on a single line followed by its format number which describes the format of the file. e.g.

Sample library file gpri.lib:

gpri1.seq 11
gpri2.seq 6
gpri3.seq 0
grod1.seq 5
[...]

When prompted for the database filename, you should enter the filename and full path of this library file. Note the '@' in front of the file name. This '@' sign is needed whenever the file contains other file(s) instead of actual sequences. For example:

Library file name: @/home/username/gpri.lib

Format Number

0 Pearson/FASTA (>SEQID - comment/sequence)
1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
2 NBRF CODATA (ENTRY/SEQUENCE)
3 EMBL/SWISS-PROT (ID/DE/SQ)
4 Intelligenetics (;comment/SEQID/sequence)
5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
6 GCG (version 8.0) Unix Protein and DNA (compressed)
11 NCBI Blast1.3.2 format (unix only)
12 NCBI Blast2.0 format (unix only, fasta32t08 or later)

Sample Output

% fasta /home/userID/YourSeq.nt @/fdb/fastadb/fastanam/drosoph.nt.nam 6 -O /home/userID/YourOut.seq -b 40 -d 40 -H -Q FASTA searches a protein or DNA sequence data bank version 35.03 Feb. 18, 2008
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 AF436075,, 150 nt vs @/gcg/gcgnih/nihscripts/fastanam/month.nt.nam library 265698378 residues in 61815 sequences statistics sampled from 60000 to 63578 sequences Expectation_n fit: rho(ln(x))= 6.8625+/-0.000157; mu= 12.6574+/- 0.011 mean_var=163.8188+/-34.364, 0's: 27 Z-trim: 28 B-trim: 94 in 1/75 Lambda= 0.100206 Kolmogorov-Smirnov statistic: 0.0181 (N=29) at 40 FASTA (3.46 Dec 2003) function [optimized, +5/-4 matrix (5:-4)] ktup: 6 join: 46, opt: 31, open/ext: -12/-4, width: 16 The best scores are: opt bits E(61815) gi|42491463|gb|AC138679.10| Mus musculus chromoso (230184) 268 50.1 3.4e-05 gi|41469033|gb|AC145725.2| Gasterosteus aculeatus (158904) 160 34.4 1.7 gi|41946802|gb|BC066008.1| Mus musculus cDNA clon (3198) 167 34.0 2.3 [...] >>gi|42491463|gb|AC138679.10| Mus musculus chromosome 5, (230184 nt) initn: 285 init1: 190 opt: 268 Z-score: 189.0 bits: 50.1 E(): 3.4e-05 banded Smith-Waterman score: 268; 81.928% identity (82.927% ungapped) in 83 nt overlap (9-91:25787-25868) 10 20 30 AF4360 AGCCAGCGAACCTCCCAGCAAAACCAGCAGAAGAAGCT :: :::::::: :: ::::: ::::::::: gi|424 CAGTGCAACCCTGTTTATTTCCTGTCAAGAAATCTCCCAGCCAAGCCAGCTGAAGAAGCT 25760 25770 25780 25790 25800 25810 40 50 60 70 80 90 AF4360 CAGAAACACAGACAGCAGTACGAAG-AAATGGTGGTTCAAGCAAAAAAAAGAGAACTTTT :: :: :::::::::::::: :::: : :::::: : :: :: :: ::: ::: gi|424 CAAAAGCACAGACAGCAGTATGAAGAAGATGGTGCTGCAGGCCAAGAAACGAGGTACTGT 25820 25830 25840 25850 25860 25870

Documentation

fasta documentation

http://fasta.bioch.virginia.edu/