NCBI Home IEB Home C Toolkit docs C++ Toolkit source browser C Toolkit source browser (2) |
NCBI C Toolkit Cross ReferenceC/doc/blast/formatdb.html |
source navigation diff markup identifier search freetext search file search |
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 3 4 <html xmlns="http://www.w3.org/1999/xhtml"> 5 <head> 6 <meta name="generator" 7 content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" /> 8 9 <title></title> 10 </head> 11 12 <body> 13 <pre> 14 Formatdb README 15 ------------------ 16 17 Table of Contents 18 19 Introduction 20 21 Command Line Options 22 23 Configuration File 24 25 Formatdb Notes/Troubleshooting 26 27 A The -o option and identifiers 28 B "SORTFiles failed" message 29 C Formatting large FASTA files 30 D Piping a database to formatdb without uncompressing 31 E Creating custom databases. 32 F General troubleshooting tips. 33 G "SeqIdParse Failure" error 34 H "FileOpen" error 35 36 Appendix 1: The Files Produced by Formatdb 37 38 39 40 Introduction 41 ------------ 42 Formatdb must be used in order to format protein or nucleotide source 43 databases before these databases can be searched by blastall, blastpgp 44 or MegaBLAST. The source database may be in either FASTA or ASN.1 45 format. Although the FASTA format is most often used as input to 46 formatdb, the use of ASN.1 is advantageous for those who are using 47 ASN.1 as the common source for other formats such as the GenBank 48 report. Once a source database file has been formatted by formatdb it 49 is not needed by BLAST. Please note that formatdb does not create 50 non-redundant blast databases. 51 52 If you are having problems formatting a BLAST databases please scroll 53 down to the "Formatdb Notes/Troubleshooting" section below. Or contact 54 the BLAST Desk at blast-help@ncbi.nlm.nih.gov 55 56 57 Command Line Options 58 -------------------- 59 A list of the command line options and the current version for formatdb may 60 be obtained by executing formatdb without options, as in: 61 62 formatdb - 63 64 The formatdb options are summarized below: 65 66 formatdb 2.2.11 arguments: 67 68 -t Title for database file [String] Optional 69 70 -i Input file(s) for formatting (this parameter must be set) [File In] 71 72 -l Logfile name: [File Out] Optional default = formatdb.log 73 74 -p Type of file 75 T - protein 76 F - nucleotide [T/F] Optional 77 default = T 78 79 -o Parse options 80 T - True: Parse SeqId and create indexes. 81 F - False: Do not parse SeqId. Do not create indexes. 82 [T/F] Optional default = F 83 84 If the "-o" option is TRUE (and the source database is in FASTA 85 format), then the database identifiers in the FASTA definition 86 line must follow the convention of the FASTA Defline Format. 87 Please see section "F Note on creating custom databases" 88 below. 89 90 -a Input file is database in ASN.1 format (otherwise FASTA is expected) 91 T - True, 92 F - False. 93 [T/F] Optional default = F 94 95 -b ASN.1 database in binary mode 96 T - binary, 97 F - text mode. 98 [T/F] Optional default = F 99 100 A source ASN.1 database may be represented in two formats - 101 ascii text and binary. The "-b" option, if TRUE, specifies that 102 input ASN.1 database is in binary format. The option is ignored 103 in case of FASTA input database. 104 105 -e Input is a Seq-entry [T/F] Optional default = F 106 107 A source ASN.1 database (either text ascii or binary) may 108 contain a Bioseq-set or just one Bioseq. In the latter case the 109 "-e" switch should be set to TRUE. 110 111 -n Base name for BLAST files [String] Optional 112 113 This options allows one to produce BLAST databases with a 114 different name than that of the original FASTA file. For 115 instance, one could have a file named 'ecoli.nuc.txt' and and 116 format it as 'ecoli': 117 118 formatdb -i ecoli.nuc.txt -p F -o T -n ecoli 119 120 uncompress -c nr.z | formatdb -i stdin -o T -n nr 121 122 This can be used in situations where the original FASTA file is 123 not required other than by formatdb. This can help in a 124 situation where disk-space is tight. 125 126 -v Database volume size in millions of letters [Integer] Optional 127 default = 0 128 range from 0 to <NULL> 129 130 This option breaks up large FASTA files into 'volumes' (each 131 with a maximum size of 2 billion letters). As part of the 132 creation of a volume formatdb writes a new type of BLAST 133 database file, called an alias file, with the extension 'nal' 134 or 'pal'. 135 136 -s Create indexes limited only to accessions - sparse [T/F] Optional 137 default = F 138 139 This option limits the indices for the string identifiers (used 140 by formatdb) to accessions (i.e., no locus names). This is 141 especially useful for sequences sets like the EST's where the 142 accession and locus names are identical. Formatdb runs faster 143 and produces smaller temporary files if this option is used. 144 It is strongly recommended for EST's, STS's, GSS's, and 145 HTGS's. 146 147 -V Verbose: check for non-unique string ids in the database [T/F] Optional 148 default = F 149 150 -L Create an alias file with this name 151 use the gifile arg (below) if set to calculate db size 152 use the BLAST db specified with -i (above) [File Out] Optional 153 154 This option produces a BLAST database alias file using a specified 155 database, but limiting the sequences searched to those in the GI list 156 given by the -F argument. See the section "Note on creating an alias file 157 for a GI list" for more information. 158 159 -F Gifile (file containing list of gi's) [File In] Optional 160 161 This option can be used to specify the GI list for the alias file 162 construction (-L option above) or to produce a binary GI list if 163 the -B option (below) is set. 164 165 -B Binary Gifile produced from the Gifile specified above [File Out] Optional 166 This option specifies the name of a binary GI list file. This option should 167 be used with the -F option. A text GI list may be specified with the -F 168 option and the -B option will produce that GI list in binary format. The 169 binary file is smaller and BLAST does not need to convert it, so it can 170 be read faster. 171 172 -T Taxid file to set the taxonomy ids in ASN.1 deflines [File In] Optional 173 174 This file specifies a text file containing Seq-id string/numeric taxonomy 175 id pairs, separated by a single white space character (or tab), one per 176 line. Gi numbers can also be used in place of Seq-id strings. Examples: 177 178 % cat seqid-taxid.txt 179 lcl|hmm271 4 180 lcl|hmm273 6 181 lcl|hmm276 9 182 % cat gi-taxid.txt 183 129295 9031 184 129296 9031 185 68738 9031 186 187 Configuration File 188 ------------------ 189 Starting from formatdb version 4, we have added a configuration file to allow 190 flexibility in specifying the membership and link bits to set in the ASN.1 191 defline structures. This feature is available by recompiling the formatdb 192 binary with the following compile time flag: -DSET_ASN1_DEFLINE_BITS. 193 The membership bit arrays are used to help distinguish sequences that belong 194 to a subset database (e.g.: pdb, swissprot) in a non-redundant database 195 (e.g.: nr). The link bit arrays are used to indentify which sequences should 196 have a user specified "link out" in the blast (html) report. These features 197 are still under development and useful within NCBI only. A sample 198 configuration file follows: 199 200 ; .formatdbrc: formatdb configuration file 201 ; 202 ; This information is needed for the new database format, to set 203 ; the links and membership information of the structured deflines. 204 205 ;;;;;;;;;;;;;;;;;;; Memberships section ;;;;;;;;;;;;;;;;;;;;;; 206 ; This section determines what the bits mean in the membership bit array. 207 ; These must be unique positive integers starting with 1. 208 ; When adding a new entry, remember to update the value of TotalNum 209 ; This information is used to distinguish database subsets (e.g.: pdb, 210 ; swissprot) in non-redundant databases (e.g.: nr). 211 [MembershipBitNumbers] 212 TotalNum = 3 213 1 = swissprot 214 2 = pdb 215 3 = refseq_protein 216 217 ;;;;;;;;;;;;;;;;;;;;;;;; Links section ;;;;;;;;;;;;;;;;;;;;;;;;;; 218 ; This section determines what the bits mean in the links bit array. 219 ; These must be unique positive integers starting with 1. 220 ; When adding a new entry, remember to update the value of TotalNum and 221 ; adding the appropriate file paths for each database in the LinkFiles 222 ; section. 223 ; This is used for the link out feature of the blast result formatter. 224 [LinkBitNumbers] 225 TotalNum = 4 ; total number of bits used so far 226 1 = LocusLink 227 2 = UniGene 228 3 = Structure 229 4 = Geo 230 231 ; These are the paths to the files containing the 232 ; gi's whose links will be modified. The format is 233 ; <link_type> = <file_path> 234 ; where link_type is one of the types of links defined in the LinkBitNumbers 235 ; section. 236 [LinkFiles] 237 LocusLink = /home/joe/locus_gis.txt 238 UniGene = /dev/null 239 Structure = /dev/null 240 Geo = /home/john/geo.txt 241 ; EOF 242 243 Formatdb Notes/Troubleshooting: 244 ------------------------------- 245 246 A.) Note on -o option: 247 248 It is always advantageous to use the '-o' option if the database 249 identifiers are in the format specified at 250 ftp://ftp.ncbi.nih.gov/blast/db/README. If the database identifiers 251 are in this parseable format, formatdb produces additional indices 252 allowing retrieval from the databases by identifier. The databases on 253 the NCBI FTP site contain parseable identifiers. It is sufficient if 254 the first word on the FASTA definition line is a unique identifier 255 (e.g., ">3091 Alcoho de..."). It is necessary to use parseable 256 identifiers for the following cases: 257 258 1.) ASN.1 is to be produced from blastall or blastpgp, then "-o" must 259 be TRUE. 260 261 2.) query-anchored alignments are desired (i.e., the '-m' option with a 262 non-zero value is used). 263 264 3.) The gi's are desired as part of the output (i.e., '-I' is used). 265 266 4.) fastacmd will be used to fetch sequences from the database by 267 accession or gi. 268 269 See Appendix 1: The Files Produced by Formatdb for more information 270 in the -o T option. 271 272 273 B.) Note on "SORTFiles failed" message: 274 275 Formatdb will use the 'standard' temporary directory to sort the string 276 indices on disk. Under UNIX this directory is often /var/tmp and if 277 there is not enough space there, then the error message: "ERROR: 278 [000.000] SORTFiles failed" will be issued. This can be avoided by 279 setting the TMPDIR environment variable to a partition with more free 280 space. This message may also often be avoided by using the sparse 281 option (-s) for formatdb described above. 282 283 284 C.) Note on formatting large (4 Gig and larger) FASTA files: 285 286 A single BLAST database can contain up to 4 billion letters. If one 287 wishes to formatdb a FASTA file containing more letters than this, several 288 databases, each of a maximum size of 4 billion bases, will be produced. 289 This will be done automatically if the -v argument is not set. One may 290 also specify a smaller size for the volume databases by using the -v option: 291 292 formatdb -i hugefasta -p F -v 2000000000 293 294 This command line will format the "hugefasta" FASTA file as a number of 295 database "volumes," each containing a maximum of two billion base 296 pairs, as specified by the "-v" option. Two billion is the current 297 limitation on the NCBI toolkit command-line parser. The volumes will 298 have names consisting of the root database name, "hugefasta" followed 299 by a two-digit volume extension, followed by the usual BLAST database 300 extensions. These smaller databases can be searched as if they were a 301 single entity using: 302 303 blastall -i infile -d hugefasta -p blastn -o out 304 305 In this case, BLAST recognizes that the database "hugefasta" has been 306 partitioned into several volumes because it detects a file with the 307 name of the root database followed by an extension of "nal" (for 308 protein databases, the extension is "pal"). This file specifies a 309 database list to be searched when the root database name is specified 310 to BLAST. BLAST sequentially searches each database listed in this 311 "nal" file and generates output that is indistinguishable from that of 312 a single database search. A sample "nal" file, resulting from 313 formatting the datafile "hugefasta" into three volumes, is given below. 314 The "DBLIST" line can also be edited to specify additional databases to 315 be searched. 316 317 # 318 # Alias file created Tue Jan 18 13:12:24 2000 319 # 320 # 321 TITLE hugefasta 322 # 323 DBLIST hugefasta.00 hugefasta.01 hugefasta.02 324 # 325 #GILIST 326 # 327 #OIDLIST 328 # 329 330 The "nal" and "pal" files can also be used to simplify searches of 331 multiple databases created separately. For instance, a file called 332 "multi.nal" containing the following lines could be created from 333 scratch using a text editor. 334 335 # 336 # Alias file created Tue Jan 18 13:12:24 2000 337 # 338 # 339 TITLE multi 340 # 341 DBLIST part1 part2 part3 342 # 343 #GILIST 344 # 345 #OIDLIST 346 # 347 348 The "multi.nal" file would allow the three databases, "part1", "part2", 349 and "part3", to be searched by specifying a single database name, 350 "multi", on the blastall command line as follows: 351 352 blastall -i infile -d multi -p blastn -o out 353 354 The reason for using database volumes, as opposed to simply making the 355 indices in the BLAST databases large enough to handle all conceivable 356 databases with an eight-byte 'integer', is that this would have 357 doubled the size of the indices for all searches no matter how small 358 the database. Hence very large FASTA files are broken down into a 359 couple of databases. 360 361 Formatdb must be able to open files larger than 2 Gig in order to work 362 on very large files. This is not a problem on a 64-bit OS and on 363 certain 32-bit OS that allows binaries to be made large-file aware. 364 The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled 365 large file aware. 366 367 368 D.) Note on running formatdb on a database without uncompressing it: 369 370 Under UNIX it is possible to uncompress a database on the fly and pipe 371 it to formatdb. This can reduce the disk-space needed for running 372 formatdb on a large database. In addition, some operating systems 373 cannot write files larger than 2 Gig to disk. To circumvent this on 374 Unix or Linux systems, use a "pipe" system such as: 375 376 uncompress -c nt.Z|formatdb -i stdin -o T -p T -n "nt" -v 100000000 377 378 In this case, no file is written which is larger than 1 Gig and an 379 arbitrarily large database is formatted as a set of 1 Gig volumes. 380 Note the use of the '-n' option that specifies the name of the 381 resulting BLAST database. Note also that 'stdin' specifies that input 382 will be coming from 'standard input'. The nt FASTA file is not needed 383 for running BLAST searches and nt.Z may be deleted after formatdb has 384 been run. 385 386 387 E) Note on creating custom databases 388 389 With Standalone BLAST it is possible to take any custom file of FASTA 390 sequences and use these as a database source file for searching. All 391 BLAST database source files must be in FASTA format. In order to use 392 the formatdb option -o T, especially for use with NCBI tool kit retrieval 393 tools the FASTA defline must follow a specific format. 394 395 F) Note on creating an alias file for a GI list. 396 397 Formatdb can now produce a BLAST database alias file that specifies a (real) 398 BLAST database to search as well as a GI list to limit the search. This 399 is useful if one often searches a subset of a database (e.g., based 400 on organism or a curated list). The alias file makes the search appear as 401 if one were searching a real database rather than the subset of one. 402 The procedure to produce an alias file for searching (protein) nr limiting it to a 403 list of zebrafish GI's would be: 404 405 1.) obtain the list of zebrafish GI's from Entrez or some other source and 406 keep it in a file called "zebrafish.gi.in". 407 408 2.) invoke formatdb to convert the text GI list to the more efficient binary format: 409 410 formatdb -F zebrafish.gi.in -B zebrafish.gi 411 412 3.) invoke formatdb with the following options: 413 414 formatdb -i nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database" 415 416 This will produce the alias file zebrafish.pal listing the database title, the 417 real database to be searched, the GI file, and some statistics: 418 419 # 420 # Alias file created Thu Jul 5 15:04:29 2001 421 # 422 # 423 TITLE My zebrafish database 424 # 425 DBLIST nr 426 # 427 GILIST zebrafish.gi 428 # 429 #OIDLIST 430 # 431 NSEQ 1836 432 LENGTH 640724 433 434 435 One can search this by invoking (for example): 436 437 blastall -p blastp -d zebrafish -i MYQUERY -o MYOUTPUT 438 439 NOTE: One may wish to prepare the alias file in one directory, but move it 440 to a different production directory that does not contain the real database. 441 In that case you may use the '-n' option to specify a path to the real 442 database in the production environment. In the example below the -n option is 443 used to specify that the nr database can really be found at a relative path 444 of ../../newest_blast/blast 445 446 formatdb -i nr -n ../../newest_blast/blast/nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database" 447 448 and the alias file will be: 449 450 # 451 # Alias file created Wed Nov 28 13:55:41 2001 452 # 453 # 454 TITLE My zebrafish database 455 # 456 DBLIST ../../newest_blast/blast/nr 457 # 458 GILIST zebrafish.gi 459 # 460 #OIDLIST 461 # 462 NSEQ 1836 463 LENGTH 640724 464 465 466 Notes on Version 4 of the BLAST databases 467 ----------------------------------------- 468 469 Version 4 of the BLAST databases address some important 470 shortcomings of the current (version 4) databases: 471 472 1.) Version 3 does not handle ambiguity characters correctly if a 473 database sequence is longer than about 16 million bases which may 474 lead to incorrect results. The new version does. 475 476 2.) Version 3 only allows one volume of a BLAST database to 477 contain at most about 4 billion bases. The new databases allows that to 478 be much larger. 479 480 481 PLEASE NOTE THAT VERSION 3 OF THE BLAST DATABASES WILL NO LONGER BE SUPPORTED. 482 483 484 The new databases keep the sequence descriptors in a structured format 485 (ASN.1) and some new information has been put into those fields. The new 486 information is: 487 488 1.) taxid. This integer specifies the taxonomy of the sequence and will 489 allow greater flexibility in how taxonomic information is presented in a 490 future version of BLAST. 491 492 2.) link bits. These specify whether LinkOut information about the database 493 sequence is available and permits the additon of a gif with a link to the relevant page. 494 in a future version of BLAST. 495 496 3.) membership bits. These specify that a given gi in a database also belongs to 497 a subset database. An example of this relationship is the EST's database. Est 498 contains all EST's, but also comprises est_human, est_mouse and est_others; with 499 the new membership bit it will be possible to search any of the subset est databases 500 with only the main est database and two other small files (an alias file and an "oidlist"). 501 This can reduce the amount of disk-space and memory needed by half in this case. 502 It is expected that the proper alias file and oidlist for searching such subsets will 503 be made available on the NCBI FTP site in mid January. 504 505 506 507 FASTA Defline Format 508 -------------------- 509 The syntax of FASTA Deflines used by the NCBI BLAST server depends on 510 the database from which each sequence was obtained. The table below lists 511 the identifiers for the databases from which the sequences were derived. 512 513 514 Database Name Identifier Syntax 515 516 GenBank gb|accession|locus 517 EMBL Data Library emb|accession|locus 518 DDBJ, DNA Database of Japan dbj|accession|locus 519 NBRF PIR pir||entry 520 Protein Research Foundation prf||name 521 SWISS-PROT sp|accession|entry name 522 Brookhaven Protein Data Bank pdb|entry|chain 523 Patents pat|country|number 524 GenInfo Backbone Id bbs|number 525 General database identifier gnl|database|identifier 526 NCBI Reference Sequence ref|accession|locus 527 Local Sequence identifier lcl|identifier 528 529 530 For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag 531 indicates that the identifier refers to a GenBank sequence, "M73307" is its 532 GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS. 533 534 "gi" identifiers are being assigned by NCBI for all sequences contained 535 within NCBI's sequence databases. The 'gi' identifier provides a uniform 536 and stable naming convention whereby a specific sequence is assigned 537 its unique gi identifier. If a nucleotide or protein sequence changes, 538 however, a new gi identifier is assigned, even if the accession number 539 of the record remains unchanged. Thus gi identifiers provide a mechanism 540 for identifying the exact sequence that was used or retrieved in a 541 given search. 542 543 For searches of the nr protein database where the sequences are derived 544 from conceptual translations of sequences from the nucleotide databases 545 the following syntax is used: 546 547 gi|gi_identifier 548 549 An example would be: 550 551 gi|451623 (U04987) env [Simian immunodeficiency..." 552 553 where '451623' is the gi identifier and the 'U04987' is the accession 554 number of the nucleotide sequence from which it was derived. 555 556 Users are encouraged to use the '-I' option for Blast output which will 557 produce a header line with the gi identifier concatenated with the database 558 identifier of the database from which it was derived, for example, from a 559 nucleotide database: 560 561 gi|176485|gb|M73307|AGMA13GT 562 563 And similarly for protein databases: 564 565 gi|129295|sp|P01013|OVAX_CHICK 566 567 The gnl ('general') identifier allows databases not on the above list to 568 be identified with the same syntax. An example here is the PID identifier: 569 570 gnl|PID|e1632 571 572 PID stands for Protein-ID; the "e" (in e1632) indicates that this ID 573 was issued by EMBL. As mentioned above, use of the "-I" option 574 produces the NCBI gi (in addition to the PID) which users can also 575 retrieve on. 576 577 The bar ("|") separates different fields as listed in the above table. 578 In some cases a field is left empty, even though the original 579 specification called for including this field. To make 580 these identifiers backwards-compatiable for older parsers 581 the empty field is denoted by an additional bar ("||"). 582 583 BLAST Databases without GI's or GenBank accessions 584 -------------------------------------------------- 585 Some BLAST users wish to format a FASTA file of sequences that do not 586 contain NCBI ID's such as accessions or GI numbers. This may be the 587 case if the sequences have not yet been submitted to GenBank or are 588 proprietary. If the only goal is to perform BLAST searches and format 589 a standard BLAST report then the simplest solution is to not set the 590 "-o" option when running formatdb and indices for the identifiers will 591 not be constructed. If one wishes to produce XML or ASN.1 output or 592 wants to fetch sequences by ID with fastacmd, it is necessary to 593 observe a few simple rules when constructing the ID's. These rules are 594 necessary to ensure that the ID's can be reliably parsed to make the ID 595 indices. 596 597 1.) ID's of type "local" or "general" should be used. This means that 598 the ID's will have the syntax "lcl|IDENTIFIER" (for "local") or 599 "gnl|DATABASE|IDENTIFIER" (for "general"). The tokens DATABASE and 600 IDENTIFIER should be assigned by the user here. The local ID has only 601 one user provided token, the general ID requires two. The fields are 602 separated by vertical bars ("|").. 603 604 2.) Letters, numbers, underscores ("_"), dashes, and periods may be 605 used. Uppercase and lowercase letters are treated as being distinct. 606 No spaces are allowed in the ID, this indicates the end of the ID. 607 608 3.) All ID's should be unique, if the entire ID is examined. As an 609 example consider the following four ID's: 610 611 gnl|H.sapiens|seq1 612 gnl|H.sapiens|seq2 613 gnl|M.Mus|seq1 614 lcl|seq1 615 616 All of these ID's are considered unique. The first two might be 617 sequences one and two of a collection of Human sequences; the fourth 618 might be the first sequence in a collection of mouse sequences; the 619 fourth is simply identified as the first sequence. 620 621 ID's must fit into the framework described above to ensure that they 622 can be reliably parsed and indices built for them. This means that it 623 is not possible for users to invent new ID formats on the fly. 624 Examples of illegal ID's would be: 625 626 H.sapiens|seq1 627 gnl|H.sapiens|seq1|A 628 629 The first identifier is missing a database tag (e.g., no "lcl"), the second 630 identifier has an extra field since vertical bars separate fields. Illegal 631 ID's will not be processed by formatdb if the "-o" option is used. 632 633 634 F. General troubleshooting tips. 635 636 The Latest Version: Make sure you are using the latest version of the 637 formatdb executable. Earlier versions of formatdb may not recognize 638 changes in the ASN.1 or FASTA definition line format of current BLAST 639 databases or other sources of NCBI sequences. 640 641 642 G. Troubleshooting "SeqIdParse Failure" errors 643 644 The most frequent cause of SeqIdParse Failure errors come from the 645 syntax of the FASTA definition lines in the source database file. Many 646 third parties do not follow the syntax in section F. If you are 647 getting a SeqIdParse error this can often be eliminated by formatting 648 the database with the -o option set to FALSE (e.g. -o F). The -o option 649 is really not important for BLAST searching unless you are going to use 650 the results to parse out the identifiers for searching Entrez and 651 downloading the sequences. If you need to use the -o T option then your 652 best option is to examine the definition lines of the database sequences 653 and attempt to make them conform the FASTA syntax. 654 655 656 H. Troubleshooting "FileOpen" errors. 657 658 This is caused when the formatdb program can not find the /data 659 subdirectory. When you download and extract the Standalone BLAST 660 executables, the formatdb program is located in the same directory as 661 the /data subdirectory. If either if these move, formatdb will not 662 function without a .ncbirc file telling it where the /data subdirectory 663 resides. Create a text file in the same directory as formatdb that 664 contains the following lines: 665 666 [NCBI] 667 Data="path/data/" 668 669 Where "path/data/" is the path to the location of the Standalone BLAST 670 "data" subdirectory. For Example: 671 672 Data=/root/blast/data 673 674 For PC's this would be 675 676 [NCBI] 677 Data="C:\path\data\" 678 679 Where "C:path\data\" is the path to the location of the Standalone 680 BLAST "data" subdirectory. For example: 681 682 Data=C:\blast\data 683 684 For Macintosh this would be a simpletext file called "ncbi.cnf", placed in 685 System folder, that contains: 686 687 [BLAST] 688 BLASTDB=root:blast:data 689 690 Where "root:blast:data" is the path to the location of the Standalone 691 BLAST "data" subdirectory. For example on the machine names "LabMac": 692 693 BLASTDB=LabMac:blast:data 694 695 696 Appendix 1: The Files Produced by Formatdb 697 ------------------------------------------ 698 Using formatdb without the "-o T" indexing option results in three 699 BLAST database files (.nhr, .nin, ,nsq). Using the "-o T" option will result 700 in additional files. 701 If gi's are present in the FASTA definition lines of the source file, there 702 will be four additional files created (.nsd, nsi, nni, nnd). These are 703 ISAM indices for mapping a sequence identifier to a particular sequence in 704 the BLAST database. If gi's are not use there will be only two additional 705 files created (.nsd, .nsi). 706 These files are listed below for both an example nucleotide and a protein 707 sequence database. The actual sequence data is stored in the files with 708 extension "nsq" or "psq". The compression ratio for these files is 709 about 4:1 for nucleotides and 1:1 for protein sequence source files. 710 711 Extension Content Format 712 --------------------------------------------- 713 Nucleotide database formatted without "-o T" 714 715 nhr deflines binary 716 717 nin indices binary 718 719 nsq sequence data binary 720 721 722 Nucleotide database formatted with "-o T" add these ISAM files: 723 724 nnd GI data binary 725 726 nni GI indices binary 727 728 nsd non-GI data binary 729 730 nsi non-GI indices binary 731 732 The formatdb index files involving deflines are small relative to the 733 source database due to entries such as the one below in which the 734 defline is much shorter than the sequence. 735 736 >gi|5819095|ref|NC_001321.1| Balaenoptera physalus mitochondrion, complete genome 737 GTTAATTACTAATCAGCCCATGATCATAACATAACTGAGGTTTCATACATTTGGTATTTTTTTATTTTTTTTGGGGGGCT 738 TGCACGGACTCCCCTATGACCCTAAAGGGTCTCGTCGCAGTCAGATAAATTGTAGCTGGGCCTGGATGTATTTGTTATTT 739 GACTAGCACAACCAACATGTGCAGTTAAATTAATGGTTACAGGACATAGTACTCCACTATTCCCCCCGGGCTCAAAAAAC 740 TGTATGTCTTAGAGGACCAAACCCCCCTCCTTCCATACAATACTAACCCTCTGCTTAGATATTCACCACCCCCCTAGACA 741 GGCTCGTCCCTAGATTTAAAAGCCATTTTATTTATAAATCAATACTAAATCTGACACAAGCCCAATAATGAAAATACATG 742 AACGCCATCCCTATCCAATACGTTGATGTAGCTTAAACACTTACAAAGCAAGACACTGAAAATGTCTAGATGGGTCTAGC 743 CAACCCCATTGACATTAAAGGTTTGGTCCCAGCCTTTCTATTAGTTCTTAACAGACTTACACATGCAAGTATCCACATCC 744 CAGTGAGAACGCCCTCTAAATCATAAAGATTAAAAGGAGCGGGTATCAAGCACGCTAGCACTAGCAGCTCACAACGCCTC 745 GCTTAGCCACGCCCCCACGGGACACAGCAGTGATAAAAATTAAGCTATAAACGAAAGTTCGACTAAGTCATGTTAATTTA 746 ....16398 bp total 747 748 749 Extension Content Format 750 ------------------------------------------ 751 Protein database formatted without "-o T" 752 753 phr deflines binary 754 755 pin indices binary 756 757 psq sequence data binary 758 759 Protein database formatted with "-o T" add these ISAM files: 760 761 pnd GI data binary 762 763 pni GI indices binary 764 765 psd non-GI data binary 766 767 psi non-GI indices binary 768 769 N.B.: The pre-formatted protein BLAST databases distributed by NCBI 770 (nr.*tar.gz and pataa.*tar.gz) contain a couple of extra files with the 771 extensions .ppi and ppd. These are ISAM index and data files for looking 772 up entries in the database using PIG as they key (see fastacmd -P option). 773 These files cannot be generated by the stand alone formatdb binary included in 774 this distribution. 775 776 The formatdb index files involving deflines are large relative to the 777 source database due to entries such as the one below in which the 778 defline is much longer than the sequence. 779 780 >gi|229659|pdb|1AAP|A Chain A, Protease Inhibitor Domain Of Alzheimer's 781 Amyloid Beta-Protein Precursor (APPI)gi|229660|pdb|1AAP|B Chain B, 782 Protease Inhibitor Domain Of Alzheimer's Amyloid Beta-Protein Precursor 783 (APPI) 784 VREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSA 785 786 787 DISCLAIMER: The internal structure of the BLAST databases is subject 788 to change with little or no notice. The readdb API should be 789 used to extract data from the BLAST databases. Readdb is part 790 of the NCBI toolkit (ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/), 791 readdb.h contains a list of supported function calls. 792 793 Last updated August 10 2006 794 </pre> 795 </body> 796 </html> 797
This page was automatically generated by the
LXR engine.
Visit the LXR main site for more information. |