Changes to the BLAST Databases March 8, 1996 This announcement describes a reorganization of the databases available for BLAST searches at the National Center for Biotechnology Information (NCBI). The same sequence data will be available for searching but will be organized for more efficient searching and will be better synchronized with the Entrez databases. The major differences will be the elimination of EST and STS sequences from the 'nr' (non-redundant database) and the introduction of a database ('month') containing only the sequences added over the past 30 days. Another change is a new definition line for protein sequences. WWW Blast and E-mail Blast users will switch to the new set of databases beginning March 11, 1996. Since most users search 'nr', the change should be minimal since the database name will stay the same, but EST and STS sequences will not be searched. For users of Network Blast, a new client (Blast2) is being introduced that will not only search the new set of databases, but also provide a better interface for post-processing search results. Blast2 represents the future direction of the Blast service and users of the existing Blast software, known as the 'Experimental' Blast service are encouraged to upgrade to Blast2. However, the existing 'Experimental' Blast clients will be able to operate with the new databases. (See Appendix 3 for technical details). Blast2 clients are available now for FTP and users of the 'Experimental' Blast clients are able to use the new databases now. Beginning March 11, 1996, the old databases will no longer be available and both the Experimental and Blast2 clients will use the new databases. These changes are described further below in the following topics: * New databases * Sequence identifiers * The Blast2 service * A new Entrez-based e-mail server * Databases on the FTP site Comments about these changes are welcome, please send them to blast-help@ncbi.nlm.nih.gov. For information about other NCBI services, send e-mail to: info@ncbi.nlm.nih.gov =========================================================================== New Databases: Presently both the old and the new databases are available. The old databases will be available until March 11, 1996, at which time only the new databases will be available. The new databases are now searchable with the Network version of Blast (see Appendix 3). New nucleotide databases: nr Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or STS's) est Non-redundant Database of GenBank+EMBL+DDBJ EST Division sts Non-redundant Database of GenBank+EMBL+DDBJ STS Division pdb PDB nucleotide sequences vector Vector subset of GenBank mito Database of mitochondrial sequences, Rel. 1.0, July 1995 kabat Kabat Sequences of Nucleic Acid of Immunological Interest epd Eukaryotic Promotor Database alu Select Alu Repeats from REPBASE month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days New protein databases: nr Non-redundant GenBank CDS translations+PDB+SwissProt+PIR pdb PDB protein sequences spdb Non-redundant SwissProt+PDB sequences kabat Kabat Sequences of Proteins of Immunological Interest alu Translations of Select Alu Repeats from REPBASE month All new or revised GenBank CDS translation+PDB+SwissProt+PIR sequences released in the last 30 days swissprot SwissProt sequences =========================================================================== Sequence identifiers for the new databases: The one-line descriptions for GenBank conceptual translations will change. The present descriptions describe the conceptual translation of a CDS in terms of the GenBank flatfile, but do not reliably point to a specific CDS if the order or number of CDS features changes. An example is: "gp|U04987|SIU04987_4 env gene product [Simian immunodef..."; "SIU04987_4" indicates that this protein is the fourth CDS on the entry with the accession U04987. Changes to the GenBank entry can change the order and number of CDS features. Therefore, in order to identify the specific protein sequence NCBI is now assigning a stable identifier, called a 'gi' for all sequences. A "gi" is a unique integer that changes when the sequence changes. It does not change, however, if only the features or references of an entry are updated. The new format for protein sequences will contain the identifer 'gi' followed by the 'gi number': "gi|451623 (U04987) env [Simian immunodeficiency..." Although the accession number of the translated nucleotide sequence will appear in the header line (U04987 in the example above), retrieval by 'gi number' is the only reliable method to locate the correct translated DNA sequence. The new e-mail retriever (see below) or Entrez may be used to retrieve sequences identified by "gi". An exhaustive list of sequence identifiers used in these new databases is provided in Appendix 1. Additional examples of definition lines are provided in Appendix 2. =========================================================================== The Blast2 service: Blast2 is the newest version of the BLAST client software and represents the foundation for NCBI's future development of the BLAST service. The Blast2 service permits BLAST searches with a number of different clients for different platforms, available on the NCBI FTP site. These clients can be obtained by FTP'ing to ncbi.nlm.nih.gov (login as anonymous and cd to blast/network/blast2). In contrast to the present BLAST service (designated "experimental"), these clients communicate with the BLAST server through a structured interface, allowing BLAST to interface better with other programs, e.g., post-processing programs. The blast2 service already uses the new databases. Although Blast2 is expected to eventually replace the 'experimental' Blast clients, NCBI will continue to support the 'experimental' Blast client for the near future. =========================================================================== New Entrez-based e-mail retrieve server ("QUERY"): QUERY uses the Entrez Query Engine to obtain data. Entrez can retrieve data by domain (i.e., nucleotide or protein) rather than by source database. QUERY can retrieve entries by "gi" (see above) and is synchronized with the new BLAST databases. To receive documentation about this service, send an email to "query@ncbi.nlm.nih.gov". The body of the message should consist of the word "help" (without quotes). =========================================================================== Databases on the FTP site All the databases listed above are available as FASTA files from the NCBI FTP site (ncbi.nlm.nih.gov). These FASTA files are not necessary to perform BLAST searches using the BLAST clients discussed here. They are only needed if one wishes to run the actual BLAST search engines in-house, rather than sending BLAST queries to the NCBI. To obtain these files, FTP to ncbi.nlm.nih.gov, login as anonymous and cd to "blast/db". These files are compressed and should be FTP'ed in binary mode. A FASTA file ("genpept.fsa") containing all the proteins in the GenBank release will also be available from the NCBI FTP site, in the directory "genbank". The one-line headers in this FASTA file have the same format as those presented in Appendix 1. Daily updates to this file are gpcu.fsa, in the directory "genbank/daily". These files serves as replacements for "genpept.fasta" and "gpcu.fasta", which will be discontinued on March 25, 1996. =========================================================================== Appendix 1: Sequence Identifier Syntax The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived. Database Name Identifier Syntax ============================ ======================== GenBank gb|accession|locus EMBL Data Library emb|accession|locus DDBJ, DNA Database of Japan dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|entry name Brookhaven Protein Data Bank pdb|entry|chain Kabat's Sequences of Immuno... gnl|kabat|identifier Patents pat|country|number GenInfo Backbone Id bbs|number For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag indicates that the identifier refers to a GenBank sequence, "M73307" is its GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS. "gi" identifiers are being assigned by NCBI for all sequences contained within NCBI's sequence databases. The 'gi' identifier provides a uniform and stable naming convention whereby a specific sequence is assigned its unique gi identifier. If a nucleotide or protein sequence changes, however, a new gi identifier is assigned, even if the accession number of the record remains unchanged. Thus gi identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search. For searches of the nr protein database where the sequences are derived from conceptual translations of sequences from the nucleotide databases the following syntax is used: gi|gi_identifier An example would be: gi|451623 (U04987) env [Simian immunodeficiency..." where '451623' is the gi identifier and the 'U04987' is the accession number of the nucleotide sequence from which it was derived. Users are encouraged to use the '-gi' option for Blast output which will produce a header line with the gi identifer concatenated with the database identifier of the database from which it was derived, for example, from a nucleotide database: gi|176485|gb|M73307|AGMA13GT And similarly for protein databases: gi|129295|sp|P01013|OVAX_CHICK Appendix 2: Examples of sequence header lines in Blast output: Protein: gi|808969 (V00383) reading frame [Gallus gallus] 641 4.6e-99 2 gi|763101 (V00387) seventh exon [Gallus gallus] 690 2.6e-90 1 (note: gi numbers used for GenBank translated sequences; other protein sequences are designated according to database of origin, e.g., Swiss-Prot, PDB, PRF). sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED). ... 1191 3.0e-159 1 sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). ... 949 2.7e-126 1 pdb|1OVA|A Ovalbumin (Egg Albumin) >pdb|1OVA|B ... 645 1.3e-99 2 prf||0705172A ovalbumin [Gallus gallus] 645 1.3e-99 2 Nucleotide: gb|U37104|APU37104 Aethia pusilla cytochrome b gene, mi... 1672 1.2e-133 1 gb|U37087|ACU37087 Aethia cristatella cytochrome b gene... 1627 5.7e-133 2 emb|F19596|HSPD04201 H.sapiens mitochondrial EST sequence... 997 3.9e-77 1 emb|F19081|HSPD03679 H.sapiens mitochondrial EST sequence... 939 2.8e-72 1 gb|L44587|CALMTCYBF Callithrix emiliae (clones CEM 1, CE... 785 4.0e-59 1 gb|L44588|CALMTCYBFA Callithrix jacchus (clones CJA1, CJA... 695 1.5e-51 1 Appendix 3: Technical details The new databases may be searched using the existing ('experimental') client that connects to a different port than the default. The 'experimental' server normally connects to port 5555 (service is "blast"). The new databases are available by connecting to port 5559 (service is "xblast"). 'Experimental' clients for UNIX, using "xblast", are available from the NCBI FTP site under "blast/network/experimental/unix". Blast2 clients search only the new databases and are available now on the NCBI FTP site in blast/network/blast2.