NCBI C Toolkit Cross Reference

C/doc/blast/formatdb.html


  1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  2     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3 
  4 <html xmlns="http://www.w3.org/1999/xhtml">
  5   <head>
  6     <meta name="generator"
  7     content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
  8 
  9     <title></title>
 10   </head>
 11 
 12   <body>
 13 <pre>
 14 Formatdb README
 15 ------------------
 16 
 17 Table of Contents
 18 
 19     Introduction
 20  
 21     Command Line Options
 22 
 23     Configuration File
 24  
 25     Formatdb Notes/Troubleshooting
 26       
 27             A The -o option and identifiers
 28             B "SORTFiles failed" message
 29             C Formatting large FASTA files
 30             D Piping a database to formatdb without uncompressing
 31             E Creating custom databases.
 32             F General troubleshooting tips.
 33             G "SeqIdParse Failure" error
 34             H "FileOpen" error
 35 
 36     Appendix 1: The Files Produced by Formatdb      
 37 
 38 
 39 
 40 Introduction
 41 ------------
 42 Formatdb must be used in order to format protein or nucleotide source
 43 databases before these databases can be searched by blastall, blastpgp
 44 or MegaBLAST. The source database may be in either FASTA or ASN.1
 45 format.  Although the FASTA format is most often used as input to
 46 formatdb, the use of ASN.1 is  advantageous for those who are using
 47 ASN.1 as the common source for other formats such as the GenBank
 48 report. Once a source database file has been formatted by formatdb it
 49 is not needed by BLAST. Please note that formatdb does not create
 50 non-redundant blast databases.
 51 
 52 If you are having problems formatting a BLAST databases please scroll
 53 down to the "Formatdb Notes/Troubleshooting" section below. Or contact
 54 the BLAST Desk at blast-help@ncbi.nlm.nih.gov
 55 
 56 
 57 Command Line Options
 58 --------------------
 59 A list of the command line options and the current version for formatdb may 
 60 be obtained by executing formatdb without options, as in:
 61 
 62     formatdb -
 63 
 64 The formatdb options are summarized below:
 65 
 66 formatdb 2.2.11   arguments:
 67 
 68   -t  Title for database file [String]  Optional
 69 
 70   -i  Input file(s) for formatting (this parameter must be set) [File In]
 71 
 72   -l  Logfile name: [File Out]  Optional default = formatdb.log
 73 
 74   -p  Type of file
 75          T - protein   
 76          F - nucleotide [T/F]  Optional
 77     default = T
 78 
 79   -o  Parse options
 80          T - True: Parse SeqId and create indexes.
 81          F - False: Do not parse SeqId. Do not create indexes.
 82  [T/F]  Optional default = F
 83 
 84     If the "-o" option is TRUE (and the source database is in FASTA
 85     format), then the database identifiers in the FASTA definition
 86     line must follow the convention of the FASTA Defline Format.
 87     Please see section "F Note on creating custom databases"
 88     below.
 89 
 90   -a  Input file is database in ASN.1 format (otherwise FASTA is expected)
 91          T - True, 
 92          F - False.
 93  [T/F]  Optional default = F
 94 
 95   -b  ASN.1 database in binary mode
 96          T - binary, 
 97          F - text mode.
 98  [T/F]  Optional default = F
 99 
100     A source ASN.1 database may be represented in two formats -
101     ascii text and binary. The "-b" option, if TRUE, specifies that
102     input ASN.1 database is in binary format. The option is ignored
103     in case of FASTA input database.
104 
105   -e  Input is a Seq-entry [T/F]  Optional default = F
106 
107     A source ASN.1 database (either text ascii or binary) may
108     contain a Bioseq-set or just one Bioseq. In the latter case the
109     "-e" switch should be set to TRUE.
110 
111   -n  Base name for BLAST files [String]  Optional
112 
113     This options allows one to produce BLAST databases with a
114     different name than that of the original FASTA file.  For
115     instance, one could have a file named 'ecoli.nuc.txt' and and
116     format it as 'ecoli':
117 
118         formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
119 
120         uncompress -c nr.z | formatdb -i stdin -o T -n nr
121 
122     This can be used in situations where the original FASTA file is
123     not required other than by formatdb.  This can help in a
124     situation where disk-space is tight.
125 
126   -v  Database volume size in millions of letters [Integer]  Optional
127     default = 0
128     range from 0 to &lt;NULL&gt;
129 
130     This option breaks up large FASTA files into 'volumes' (each
131     with a maximum  size of 2 billion letters).  As part of the
132     creation of a volume formatdb  writes a new type of BLAST
133     database file, called an  alias file, with the  extension 'nal'
134     or 'pal'.
135 
136   -s  Create indexes limited only to accessions - sparse [T/F]  Optional
137     default = F
138 
139     This option limits the indices for the string identifiers (used
140     by formatdb)  to accessions (i.e., no locus names).  This is
141     especially useful for sequences sets  like the EST's where the
142     accession and locus names are identical.  Formatdb runs  faster
143     and produces smaller temporary files if this option is used.
144     It is strongly  recommended for EST's, STS's, GSS's, and
145     HTGS's.
146 
147   -V  Verbose: check for non-unique string ids in the database [T/F]  Optional
148     default = F
149 
150   -L  Create an alias file with this name
151         use the gifile arg (below) if set to calculate db size
152         use the BLAST db specified with -i (above) [File Out]  Optional
153 
154     This option produces a BLAST database alias file using a specified
155     database, but limiting the sequences searched to those in the GI list
156     given by the -F argument.  See the section "Note on creating an alias file 
157     for a GI list" for more information.
158 
159   -F  Gifile (file containing list of gi's) [File In]  Optional
160 
161     This option can be used to specify the GI list for the alias file
162     construction (-L option above) or to produce a binary GI list if
163     the -B option (below) is set.
164 
165   -B  Binary Gifile produced from the Gifile specified above [File Out]  Optional
166     This option specifies the name of a binary GI list file.  This option should
167     be used with the -F option.  A text GI list may be specified with the -F
168     option and the -B option will produce that GI list in binary format.  The
169     binary file is smaller and BLAST does not need to convert it, so it can
170     be read faster.  
171 
172   -T  Taxid file to set the taxonomy ids in ASN.1 deflines [File In]  Optional
173 
174     This file specifies a text file containing Seq-id string/numeric taxonomy
175     id pairs, separated by a single white space character (or tab), one per 
176     line. Gi numbers can also be used in place of Seq-id strings. Examples:
177 
178     % cat seqid-taxid.txt
179     lcl|hmm271 4                                                               
180     lcl|hmm273 6                                                               
181     lcl|hmm276 9                                                               
182     % cat gi-taxid.txt
183     129295 9031                                                         
184     129296 9031
185     68738 9031 
186 
187 Configuration File
188 ------------------
189 Starting from formatdb version 4, we have added a configuration file to allow
190 flexibility in specifying the membership and link bits to set in the ASN.1
191 defline structures. This feature is available by recompiling the formatdb
192 binary with the following compile time flag: -DSET_ASN1_DEFLINE_BITS.
193 The membership bit arrays are used to help distinguish sequences that belong 
194 to a subset database (e.g.: pdb, swissprot) in a non-redundant database 
195 (e.g.: nr). The link bit arrays are used to indentify which sequences should 
196 have a user specified "link out" in the blast (html) report. These features 
197 are still under development and useful within NCBI only. A sample 
198 configuration file follows:
199 
200 ; .formatdbrc: formatdb configuration file
201 ;
202 ; This information is needed for the new database format, to set
203 ; the links and membership information of the structured deflines.
204 
205 ;;;;;;;;;;;;;;;;;;; Memberships section ;;;;;;;;;;;;;;;;;;;;;;
206 ; This section determines what the bits mean in the membership bit array.
207 ; These must be unique positive integers starting with 1.
208 ; When adding a new entry, remember to update the value of TotalNum  
209 ; This information is used to distinguish database subsets (e.g.: pdb,
210 ; swissprot) in non-redundant databases (e.g.: nr).
211 [MembershipBitNumbers]
212 TotalNum = 3
213 1  = swissprot
214 2  = pdb
215 3  = refseq_protein
216 
217 ;;;;;;;;;;;;;;;;;;;;;;;; Links section ;;;;;;;;;;;;;;;;;;;;;;;;;;
218 ; This section determines what the bits mean in the links bit array.
219 ; These must be unique positive integers starting with 1.
220 ; When adding a new entry, remember to update the value of TotalNum and 
221 ; adding the appropriate file paths for each database in the LinkFiles 
222 ; section.
223 ; This is used for the link out feature of the blast result formatter.
224 [LinkBitNumbers]
225 TotalNum = 4   ; total number of bits used so far
226 1 = LocusLink
227 2 = UniGene
228 3 = Structure
229 4 = Geo
230 
231 ; These are the paths to the files containing the
232 ; gi's whose links will be modified. The format is
233 ; &lt;link_type&gt; = &lt;file_path&gt;
234 ; where link_type is one of the types of links defined in the LinkBitNumbers
235 ; section.
236 [LinkFiles]
237 LocusLink = /home/joe/locus_gis.txt
238 UniGene   = /dev/null
239 Structure = /dev/null
240 Geo       = /home/john/geo.txt
241 ; EOF
242 
243 Formatdb Notes/Troubleshooting:
244 -------------------------------
245 
246 A.) Note on -o option:
247 
248 It is always advantageous to use the '-o' option if the database
249 identifiers are in the format specified at
250 ftp://ftp.ncbi.nih.gov/blast/db/README.  If the database identifiers
251 are in this parseable format,  formatdb produces additional indices
252 allowing retrieval from the databases by identifier. The databases on
253 the NCBI FTP site contain parseable identifiers. It is sufficient if
254 the  first word on the FASTA definition line is a unique identifier
255 (e.g.,  "&gt;3091 Alcoho de..."). It is necessary to use parseable
256 identifiers for the following  cases:
257 
258 1.) ASN.1 is to be produced from blastall or blastpgp, then "-o" must
259 be TRUE.
260 
261 2.) query-anchored alignments are desired (i.e., the '-m' option with a
262 non-zero value is used).
263 
264 3.) The gi's are desired as part of the output (i.e., '-I' is used).
265 
266 4.) fastacmd will be used to fetch sequences from the database by
267 accession or gi.
268 
269 See Appendix 1: The Files Produced by Formatdb for more information 
270 in the -o T option.
271 
272 
273 B.) Note on "SORTFiles failed" message:
274 
275 Formatdb will use the 'standard' temporary directory to sort the string
276 indices on disk. Under UNIX this directory is often /var/tmp and if
277 there is not enough space there, then the error message: "ERROR:
278 [000.000] SORTFiles failed" will be issued.  This can be avoided by
279 setting the TMPDIR environment variable to a partition with more free
280 space.  This message may also often be avoided by using the sparse
281 option (-s) for formatdb described above.
282 
283 
284 C.) Note on formatting large (4 Gig and larger) FASTA files:
285 
286 A single BLAST database can contain up to 4 billion letters.  If one
287 wishes to formatdb a FASTA file containing more letters than this, several
288 databases, each of a maximum size of 4 billion bases, will be produced.
289 This will be done automatically if the -v argument is not set.  One may 
290 also specify a smaller size for the volume databases by using the -v option:
291 
292 formatdb -i hugefasta -p F -v 2000000000
293 
294 This command line will format the "hugefasta" FASTA file as a number of
295 database "volumes," each containing a maximum of two billion base
296 pairs, as specified by the "-v" option. Two billion is the current
297 limitation on the NCBI toolkit command-line parser. The volumes will
298 have names consisting of the root database name, "hugefasta" followed
299 by a two-digit volume extension, followed by the usual BLAST database
300 extensions. These smaller databases can be searched as if they were a
301 single entity using:
302 
303 blastall -i infile -d hugefasta -p blastn -o out
304 
305 In this case, BLAST recognizes that the database "hugefasta" has been
306 partitioned into several volumes because it detects a file with the
307 name of the root database followed by an extension of "nal" (for
308 protein databases, the extension is "pal"). This file specifies a
309 database list to be searched when the root database name is specified
310 to BLAST. BLAST sequentially searches each database listed in this
311 "nal" file and generates output that is indistinguishable from that of
312 a single database search. A sample "nal" file, resulting from
313 formatting the datafile "hugefasta" into three volumes, is given below.
314 The "DBLIST" line can also be edited to specify additional databases to
315 be searched.
316 
317 #
318 # Alias file created Tue Jan 18 13:12:24 2000
319 #
320 #
321 TITLE hugefasta
322 #
323 DBLIST hugefasta.00 hugefasta.01 hugefasta.02
324 #
325 #GILIST
326 #
327 #OIDLIST
328 #
329 
330 The "nal" and "pal" files can also be used to simplify searches of
331 multiple databases created separately. For instance, a file called
332 "multi.nal" containing the following lines could be created from
333 scratch using a text editor.
334 
335 #
336 # Alias file created Tue Jan 18 13:12:24 2000
337 #
338 #
339 TITLE multi
340 #
341 DBLIST part1 part2 part3
342 #
343 #GILIST
344 #
345 #OIDLIST
346 #
347 
348 The "multi.nal" file would allow the three databases, "part1", "part2",
349 and "part3", to be searched by specifying a single database name,
350 "multi", on the blastall command line as follows:
351 
352 blastall -i infile -d multi -p blastn -o out
353 
354 The reason for using database volumes, as opposed to simply making the
355 indices in the BLAST databases large enough to handle all conceivable
356 databases  with an eight-byte 'integer', is that this would have
357 doubled the size of the indices  for all searches no matter how small
358 the database.  Hence very large FASTA files  are broken down into a
359 couple of databases.
360 
361 Formatdb must be able to open files larger than 2 Gig in order to work
362 on very large files.  This is not a problem on a 64-bit OS and on
363 certain 32-bit OS that allows binaries to be made large-file aware.
364 The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled
365 large file aware.
366 
367 
368 D.) Note on running formatdb on a database without uncompressing it:
369 
370 Under UNIX it is possible to uncompress a database on the fly and pipe
371 it to formatdb. This can reduce the disk-space needed for running
372 formatdb on a large database.  In addition, some operating systems
373 cannot write files larger than 2 Gig to disk.  To circumvent this on
374 Unix or Linux systems, use a "pipe" system such as:
375 
376 uncompress -c nt.Z|formatdb -i stdin -o T -p T -n "nt" -v 100000000
377 
378 In this case, no file is written which is larger than 1 Gig and an
379 arbitrarily large database is formatted as a set of 1 Gig volumes.
380 Note the use of the '-n' option that specifies the name of the
381 resulting BLAST database.  Note also that 'stdin' specifies that input
382 will be coming from 'standard input'.  The nt FASTA file is not needed
383 for running BLAST searches and nt.Z may be deleted after formatdb has
384 been run.
385 
386 
387 E) Note on creating custom databases
388 
389 With Standalone BLAST it is possible to take any custom file of FASTA
390 sequences and use these as a database source file for searching. All
391 BLAST database source files must be in FASTA format. In order to use
392 the formatdb option -o T, especially for use with NCBI tool kit retrieval
393 tools the FASTA defline must follow a specific format.
394 
395 F) Note on creating an alias file for a GI list.
396 
397 Formatdb can now produce a BLAST database alias file that specifies a (real)
398 BLAST database to search as well as a GI list to limit the search.  This
399 is useful if one often searches a subset of a database (e.g., based
400 on organism or a curated list).  The alias file makes the search appear as
401 if one were searching a real database rather than the subset of one.
402 The procedure to produce an alias file for searching (protein) nr limiting it to a
403 list of zebrafish GI's would be:
404 
405 1.) obtain the list of zebrafish GI's from Entrez or some other source and
406 keep it in a file called "zebrafish.gi.in".
407 
408 2.) invoke formatdb to convert the text GI list to the more efficient binary format:
409 
410 formatdb -F zebrafish.gi.in -B zebrafish.gi
411 
412 3.) invoke formatdb with the following options:
413 
414 formatdb -i nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database"
415 
416 This will produce the alias file zebrafish.pal listing the database title, the
417 real database to be searched, the GI file, and some statistics:
418 
419 #
420 # Alias file created Thu Jul  5 15:04:29 2001
421 #
422 #
423 TITLE My zebrafish database
424 #
425 DBLIST nr
426 #
427 GILIST zebrafish.gi
428 #
429 #OIDLIST
430 #
431 NSEQ 1836
432 LENGTH 640724
433 
434 
435 One can search this by invoking (for example):
436 
437 blastall -p blastp -d zebrafish -i MYQUERY -o MYOUTPUT
438 
439 NOTE: One may wish to prepare the alias file in one directory, but move it
440 to a different production directory that does not contain the real database.
441 In that case you may use the '-n' option to specify a path to the real
442 database in the production environment.  In the example below the -n option is 
443 used to specify that the nr database can really be found at a relative path
444 of ../../newest_blast/blast
445 
446 formatdb -i nr -n ../../newest_blast/blast/nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database"
447 
448 and the alias file will be:
449 
450 #
451 # Alias file created Wed Nov 28 13:55:41 2001
452 #
453 #
454 TITLE My zebrafish database
455 #
456 DBLIST ../../newest_blast/blast/nr
457 #
458 GILIST zebrafish.gi
459 #
460 #OIDLIST
461 #
462 NSEQ 1836
463 LENGTH 640724
464 
465 
466 Notes on Version 4 of the BLAST databases
467 -----------------------------------------
468 
469 Version 4 of the BLAST databases address some important
470 shortcomings of the current (version 4) databases:
471 
472 1.) Version 3 does not handle ambiguity characters correctly if a 
473 database sequence is longer than about 16 million bases which may 
474 lead to incorrect results.  The new version does.
475 
476 2.) Version 3 only allows one volume of a BLAST database to
477 contain at most about 4 billion bases.  The new databases allows that to
478 be much larger.
479 
480 
481 PLEASE NOTE THAT VERSION 3 OF THE BLAST DATABASES WILL NO LONGER BE SUPPORTED.
482 
483 
484 The new databases keep the sequence descriptors in a structured format
485 (ASN.1) and some new information has been put into those fields.  The new
486 information is:
487 
488 1.) taxid.  This integer specifies the taxonomy of the sequence and will
489 allow greater flexibility in how taxonomic information is presented in a
490 future version of BLAST.
491 
492 2.) link bits.  These specify whether LinkOut information about the database
493 sequence is available and permits the additon of a gif with a link to the relevant page.
494 in a future version of BLAST.
495 
496 3.) membership bits.  These specify that a given gi in a database also belongs to
497 a subset database.  An example of this relationship is the EST's database.  Est
498 contains all EST's, but also comprises est_human, est_mouse and est_others; with
499 the new membership bit it will be possible to search any of the subset est databases
500 with only the main est database and two other small files (an alias file and an "oidlist").
501 This can reduce the amount of disk-space and memory needed by half in this case.
502 It is expected that the proper alias file and oidlist for searching such subsets will
503 be made available on the NCBI FTP site in mid January.
504 
505 
506 
507 FASTA Defline Format
508 --------------------
509 The syntax of FASTA Deflines used by the NCBI BLAST server depends on
510 the database from which each sequence was obtained.  The table below lists
511 the identifiers for the databases from which the sequences were derived.
512  
513 
514   Database Name                         Identifier Syntax
515 
516   GenBank                               gb|accession|locus
517   EMBL Data Library                     emb|accession|locus
518   DDBJ, DNA Database of Japan           dbj|accession|locus
519   NBRF PIR                              pir||entry
520   Protein Research Foundation           prf||name
521   SWISS-PROT                            sp|accession|entry name
522   Brookhaven Protein Data Bank          pdb|entry|chain
523   Patents                               pat|country|number 
524   GenInfo Backbone Id                   bbs|number 
525   General database identifier           gnl|database|identifier
526   NCBI Reference Sequence               ref|accession|locus
527   Local Sequence identifier             lcl|identifier
528  
529 
530 For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag
531 indicates that the identifier refers to a GenBank sequence, "M73307" is its
532 GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS.  
533 
534 "gi" identifiers are being assigned by NCBI for all sequences contained
535 within NCBI's sequence databases.  The 'gi' identifier provides a uniform
536 and stable naming convention whereby a specific sequence is assigned
537 its unique gi identifier.  If a nucleotide or protein sequence changes,
538 however, a new gi identifier is assigned, even if the accession number
539 of the record remains unchanged. Thus gi identifiers provide a mechanism
540 for identifying the exact sequence that was used or retrieved in a
541 given search.
542 
543 For searches of the nr protein database where the sequences are derived
544 from conceptual translations of sequences from the nucleotide databases
545 the following syntax is used:
546 
547                      gi|gi_identifier
548 
549 An example would be:
550 
551         gi|451623           (U04987) env [Simian immunodeficiency..."
552 
553 where '451623' is the gi identifier and the 'U04987' is the accession
554 number of the nucleotide sequence from which it was derived.
555 
556 Users are encouraged to use the '-I' option for Blast output which will
557 produce a header line with the gi identifier concatenated with the database
558 identifier of the database from which it was derived, for example, from a
559 nucleotide database:
560 
561         gi|176485|gb|M73307|AGMA13GT
562 
563 And similarly for protein databases: 
564 
565         gi|129295|sp|P01013|OVAX_CHICK
566 
567 The gnl ('general') identifier allows databases not on the above list to
568 be identified with the same syntax.  An example here is the PID identifier:
569 
570         gnl|PID|e1632
571 
572 PID stands for Protein-ID; the "e" (in e1632) indicates that this ID
573 was issued by EMBL.  As mentioned above, use of the "-I" option
574 produces the NCBI gi (in addition to the PID) which users can also
575 retrieve on.
576 
577 The bar ("|") separates different fields as listed in the above table.
578 In some cases a field is left empty, even though the original
579 specification called for including this field.  To make
580 these identifiers backwards-compatiable for older parsers
581 the empty field is denoted by an additional bar ("||").
582 
583 BLAST Databases without GI's or GenBank accessions
584 --------------------------------------------------
585 Some BLAST users wish to format a FASTA file of sequences that do not
586 contain NCBI ID's such as accessions or GI numbers.  This may be the
587 case if the sequences have not yet been submitted to GenBank or are
588 proprietary.  If the only goal is to perform BLAST searches and format
589 a standard BLAST report then the simplest solution is to not set the
590 "-o" option when running formatdb and indices for the identifiers will
591 not be constructed.  If one wishes to produce XML or ASN.1 output or
592 wants to fetch sequences by ID with fastacmd, it is necessary to
593 observe a few simple rules when constructing the ID's.  These rules are
594 necessary to ensure that the ID's can be reliably parsed to make the ID
595 indices.
596 
597 1.) ID's of type "local" or "general" should be used.  This means that
598 the ID's will have the syntax "lcl|IDENTIFIER" (for "local") or
599 "gnl|DATABASE|IDENTIFIER" (for "general").  The tokens DATABASE and
600 IDENTIFIER should be assigned by the user here.  The local ID has only
601 one user provided token, the general ID requires two.  The fields are
602 separated by vertical bars ("|")..
603 
604 2.) Letters, numbers, underscores ("_"), dashes, and periods may be
605 used.  Uppercase and lowercase letters are treated as being distinct.
606 No spaces are allowed in the ID, this indicates the end of the ID.
607 
608 3.) All ID's should be unique, if the entire ID is examined.  As an
609 example consider the following four ID's:
610 
611 gnl|H.sapiens|seq1
612 gnl|H.sapiens|seq2
613 gnl|M.Mus|seq1
614 lcl|seq1
615 
616 All of these ID's are considered unique.  The first two might be
617 sequences one and two of a collection of Human sequences; the fourth
618 might be the first sequence in a collection of mouse sequences; the
619 fourth is simply identified as the first sequence.
620 
621 ID's must fit into the framework described above to ensure that they
622 can be reliably parsed and indices built for them.  This means that it
623 is not possible for users to invent new ID formats on the fly.
624 Examples of illegal ID's would be:
625 
626 H.sapiens|seq1
627 gnl|H.sapiens|seq1|A
628 
629 The first identifier is missing a database tag (e.g., no "lcl"), the second
630 identifier has an extra field since vertical bars separate fields.  Illegal
631 ID's will not be processed by formatdb if the "-o" option is used.
632 
633 
634 F. General troubleshooting tips. 
635 
636 The Latest Version: Make sure you are using the latest version of the
637 formatdb executable. Earlier versions of formatdb may not recognize
638 changes in the ASN.1 or FASTA definition line format of current BLAST
639 databases or other sources of NCBI sequences.
640 
641 
642 G. Troubleshooting "SeqIdParse Failure" errors
643 
644 The most frequent cause of SeqIdParse Failure errors come from the
645 syntax of the FASTA definition lines in the source database file. Many
646 third parties do not follow the syntax in section F. If you are
647 getting a SeqIdParse error this can often be eliminated by formatting
648 the database with the -o option set to FALSE (e.g. -o F). The -o option
649 is really not important for BLAST searching unless you are going to use
650 the results to parse out the identifiers for searching Entrez and
651 downloading the sequences. If you need to use the -o T option then your
652 best option is to examine the definition lines of the database sequences
653 and attempt to make them conform the FASTA syntax.
654 
655 
656 H. Troubleshooting "FileOpen" errors.
657 
658 This is caused when the formatdb program can not find the /data
659 subdirectory. When you download and extract the Standalone BLAST
660 executables, the formatdb program is located in the same directory as
661 the /data subdirectory. If either if these move, formatdb will not
662 function without a .ncbirc file telling it where the /data subdirectory
663 resides. Create a text file in the same directory as formatdb that
664 contains the following lines:
665 
666 [NCBI] 
667 Data="path/data/"
668 
669 Where "path/data/" is the path to the location of the Standalone BLAST
670 "data" subdirectory. For Example:
671 
672 Data=/root/blast/data
673 
674 For PC's this would be 
675 
676 [NCBI] 
677 Data="C:\path\data\"
678 
679 Where "C:path\data\" is the path to the location of the Standalone
680 BLAST "data" subdirectory. For example:
681 
682 Data=C:\blast\data
683 
684 For Macintosh this would be a simpletext file called "ncbi.cnf", placed in 
685 System folder, that contains:
686 
687 [BLAST]
688 BLASTDB=root:blast:data
689 
690 Where "root:blast:data" is the path to the location of the Standalone
691 BLAST "data" subdirectory. For example on the machine names "LabMac": 
692 
693 BLASTDB=LabMac:blast:data
694 
695 
696 Appendix 1: The Files Produced by Formatdb
697 ------------------------------------------
698 Using formatdb without the "-o T" indexing option results in three
699 BLAST database files (.nhr, .nin, ,nsq).  Using the "-o T" option will result 
700 in additional files. 
701 If gi's are present in the FASTA definition lines of the source file, there 
702 will be four additional files created (.nsd, nsi, nni, nnd). These are 
703 ISAM indices for mapping a sequence identifier to a particular sequence in 
704 the BLAST database. If gi's are not use there will be only two additional 
705 files created (.nsd, .nsi).  
706 These files are listed below for both an example nucleotide and a protein 
707 sequence database. The actual sequence data is stored in the files with 
708 extension "nsq" or "psq".  The compression ratio for these files is 
709 about 4:1 for nucleotides and 1:1 for protein sequence source files.
710 
711 Extension    Content        Format        
712 ---------------------------------------------
713 Nucleotide database formatted without "-o T"
714                         
715 nhr          deflines       binary    
716                     
717 nin          indices        binary       
718 
719 nsq        sequence data    binary        
720 
721 
722 Nucleotide database formatted with "-o T" add these ISAM files:
723 
724 nnd          GI data        binary        
725 
726 nni          GI indices     binary        
727 
728 nsd          non-GI data    binary        
729         
730 nsi         non-GI indices  binary        
731 
732 The formatdb index files involving deflines are small relative to the
733 source database due to entries such as the one below in which the
734 defline is much shorter than the sequence.
735 
736 &gt;gi|5819095|ref|NC_001321.1| Balaenoptera physalus mitochondrion, complete genome
737 GTTAATTACTAATCAGCCCATGATCATAACATAACTGAGGTTTCATACATTTGGTATTTTTTTATTTTTTTTGGGGGGCT
738 TGCACGGACTCCCCTATGACCCTAAAGGGTCTCGTCGCAGTCAGATAAATTGTAGCTGGGCCTGGATGTATTTGTTATTT
739 GACTAGCACAACCAACATGTGCAGTTAAATTAATGGTTACAGGACATAGTACTCCACTATTCCCCCCGGGCTCAAAAAAC
740 TGTATGTCTTAGAGGACCAAACCCCCCTCCTTCCATACAATACTAACCCTCTGCTTAGATATTCACCACCCCCCTAGACA
741 GGCTCGTCCCTAGATTTAAAAGCCATTTTATTTATAAATCAATACTAAATCTGACACAAGCCCAATAATGAAAATACATG
742 AACGCCATCCCTATCCAATACGTTGATGTAGCTTAAACACTTACAAAGCAAGACACTGAAAATGTCTAGATGGGTCTAGC
743 CAACCCCATTGACATTAAAGGTTTGGTCCCAGCCTTTCTATTAGTTCTTAACAGACTTACACATGCAAGTATCCACATCC
744 CAGTGAGAACGCCCTCTAAATCATAAAGATTAAAAGGAGCGGGTATCAAGCACGCTAGCACTAGCAGCTCACAACGCCTC
745 GCTTAGCCACGCCCCCACGGGACACAGCAGTGATAAAAATTAAGCTATAAACGAAAGTTCGACTAAGTCATGTTAATTTA
746 ....16398 bp total
747 
748 
749 Extension    Content        Format            
750 ------------------------------------------
751 Protein database formatted without "-o T"
752 
753 phr        deflines         binary    
754                     
755 pin        indices          binary       
756 
757 psq        sequence data    binary        
758 
759 Protein database formatted with "-o T" add these ISAM files:
760 
761 pnd        GI data          binary        
762 
763 pni        GI indices       binary        
764 
765 psd        non-GI data      binary        
766         
767 psi        non-GI indices   binary        
768 
769 N.B.: The pre-formatted protein BLAST databases distributed by NCBI
770 (nr.*tar.gz and pataa.*tar.gz) contain a couple of extra files with the
771 extensions .ppi and ppd. These are ISAM index and data files for looking
772 up entries in the database using PIG as they key (see fastacmd -P option).
773 These files cannot be generated by the stand alone formatdb binary included in
774 this distribution.
775 
776 The formatdb index files involving deflines are large relative to the
777 source database due to entries such as the one below in which the
778 defline is much longer than the sequence.
779 
780 &gt;gi|229659|pdb|1AAP|A Chain A, Protease Inhibitor Domain Of Alzheimer's
781 Amyloid Beta-Protein Precursor (APPI)gi|229660|pdb|1AAP|B Chain B,
782 Protease Inhibitor Domain Of Alzheimer's Amyloid Beta-Protein Precursor
783 (APPI)
784 VREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSA
785 
786 
787 DISCLAIMER: The internal structure of the BLAST databases is subject
788 to change with little or no notice.  The readdb API should be
789 used to extract data from the BLAST databases.  Readdb is part
790 of the NCBI toolkit (ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/),
791 readdb.h contains a list of supported function calls.  
792 
793 Last updated August 10 2006
794 </pre>
795   </body>
796 </html>
797 

source navigation ]   [ diff markup ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.