Entrez Nucleotide, and Protein FAQs
PubMed Nucleotide Protein Genome Structure PopSet Taxonomy OMIM Books

Entrez Nucleotide, and Protein FAQs

Last Updated: September 19, 2006

The FAQs are divided into four sections. Please note that links to web pages will open in a new browser window.

A. GenBank nucleotide records, GenPept protein records, and fields within records;

B. Searching tips;

C. Display of Records, format; and

D. Entrez data


Section A. GenBank nucleotide records, GenPept protein records, and fields within records: to Top

1. Why are there records that duplicate mine with NM_*, XM_*, and XP_* accession numbers?

2. My record needs to be updated. How do I correct it? What should I do if I find an error in a GenBank or RefSeq sequence record?

3. What does the date in the upper right-hand corner of a GenBank record mean?

4. How do I find out when a sequence record was released to the GenBank public database?

5. What is LinkOut?

6. Where can I find a description of the various fields in a GenBank record?

7. If a sequence has been updated, is it possible to retrieve earlier versions of it?

8. What are the sources of the Protein database sequences?

9. What is the "calculated Molecular Weight" displayed in protein records?

10. What is the 'DBSOURCE' field within a Protein record?

11. What do these symbols '>' and '<' mean when used in the features section of a nucleotide or protein record?


Answers. Section A. GenBank nucleotide records, GenPept protein records, and fields within records:


1. Why are there records that duplicate mine with NM_*, XM_*, and XP_* accession numbers? click to resturn to Section A questionsback to Questions

The records that have NM_* or XM_* or other two-letter underscore 6 or 12-digit formats, are reference sequences or RefSeqs. RefSeqs are curated from single or multiple sequence records which have been already directly submitted to GenBank. For a complete explanation which will include all of the accession number prefixes, click here for context on RefSeqs and a key to the RefSeq accessions.


2. My record needs to be updated. How do I correct it? What should I do if I find an error in a GenBank or RefSeq sequence record? click to resturn to Section A questionsback to Questions

To update your own NCBI direct submissions, email a note, including your accession number, to the NCBI Annotation staff at gb-admin@ncbi.nlm.nih.gov or use the BankIt Update form, or use Sequin to update. If you have an update for an EST STS, or GSS record, please email the update request to batch-sub@ncbi.nlm.nih.gov. If you have comments or updates to a record that does not belong to you, please email info@ncbi.nlm.nih.gov, the general NCBI Service Desk and be sure to provide the accession number of the record on which you are commenting.

3. What does the date in the upper right-hand corner of a GenBank record mean? click to resturn to Section A questionsback to Questions

The date in the upper right-hand corner of a GenBank record, to the far right on the LOCUS line, is the date of last modification. In some cases, it might correspond to the first release date into GenBank or when the record was last updated, but there is no way to tell simply from the data in the record. See corresponding FAQ 4. See the sample GenBank record for field descriptions.

4. How do I find out when a sequence record was released to the GenBank public database? click to resturn to Section A questionsback to Questions

To find out the approximate date on which a GenBank record was first released, send an email message, with the accessions of interest, to the NCBI general Service Desk address which is info@ncbi.nlm.nih.gov .

5. What is LinkOut? click to resturn to Section A questionsback to Questions

LinkOut allows publishers, aggregators, libraries, biological databases, sequence centers, and other Web resources to display links to their sites on items from the Entrez databases. These links can take you to the provider's site to obtain the full-text of articles or related resources, e.g., consumer health information or genome centers. There may be a charge to access the text or information. The current list of LinkOut providers is available here:
www.ncbi.nlm.nih.gov/entrez/journals/active_providers.html

6. Where can I find a description of the various fields in a GenBank record? click to resturn to Section A questionsback to Questions

To see a description of the various fields in a GenBank record, click here: www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

7. If a sequence has been updated, is it possible to retrieve earlier versions of it? click to resturn to Section A questionsback to Questions

Earlier versions of a GenBank record are available. If there was a change in the sequence, there will be a link within the GenBank record COMMENT field stating that the current sequence replaces or is replaced by GI number xxxxx. If the change was not to the actual bases of the sequence, the older versions of the GenBank records are accessed from the Sequence Revision History link. Example: U12345

See the Reports link that appears for each individual Entrez record result. Select Revision History from the Reports link menu. Example

8. What are the sources of the Protein database sequences? back to Questions

The protein sequences in the NCBI Protein database come from several different sources. There are GenPept translations for each of the coding sequences within the GenBank Nucleotide database. That means that there can be more than one protein sequence associated with a corresponding Nucleotide sequence record. example: DQ489526

There are records from other databases that are loaded periodically when builds become available such as UniProt (which has subsumed PIR and Swiss Prot records). A simple search to limit records to a specific component database within the Entrez Protein database is:

srcdb_swiss prot [prop]

9. What is the "calculated Molecular Weight" displayed in protein records? back to Questions

The calculated molecular weight '/calculated_mol_wt= ' as seen in protein records is calculated as part of the indexing process for protein records in Entrez. Entrez's molecular weight is an average molecular weight, not monoisotopic. Masses are rounded to the nearest integer. The weights are present only in the Molecular Weight index and are not shown explicitly on the protein sequence records. If completely unknown amino acids (e.g., "X") are found, a molecular weight is not calculated. Ambiguous amino acids are calculated as one of their possible forms:

B means D or N -- molecular weight is calculated as D Z means E or Q -- molecular weight is calculated as E See the Entrez Help document

10. What is the 'DBSOPURCE' field within a Protein record? back to Questions

The "DBSOURCE' field within a Protein record shows the source of protein records imported from other databases.

11. What do these symbols '>' and '<' mean when used in the features section of a nucleotide or protein record? back to Questions

The '>' '<' symbols used in the features section of a nucleotide record, as in DQ882243 for example, mean partial on the 3' and 5' ends, in the case below the start and stop codon are missing:

gene <1..>270
/gene="HLA-DRB1"
/allele="HLA-DRB1*1449 variant"
mRNA <1..>270

In a protein record, ABI31835 which is the GenPept translation for the DQ882243 nucleotide record, the '>' '<' symbols mean the protein translation is partial:

Protein <1..>89
/product="MHC class II antigen"
CDS 1..89


Section B. Searching tips: to Top


1. Are there standard keywords in Entrez GenBank that should be used for searching? How do I limit my retrieval to a specific field name, organism (like Xenopus laevis), biomolecule (like genomic DNA), or GenBank division (like EST or expressed sequence tag)?

2. How do I search for a gene sequence?

3. Can I retrieve a large dataset for a particular organism, for example?

4. How can I download data from the Nucleotide and Protein databases?

5. Can I store a search, update the stored search, run the stored search multiple times, and then save those search results?

6. How do I make search URLs for retrieving accession numbers or GIs or other record identifiers?

7. My search keeps returning messages that a term is not found. What can I do?

8. How do I search for sequences annotated with a specific Enzyme Commission number?

9. How can I perform a search to see all records in a specific Entrez database?



Answers. Section B. Searching tips:

1. Are there standard keywords in Entrez GenBank that should be used for searching? How do I limit my retrieval to a specific field name, organism (like Xenopus laevis), biomolecule (like genomic DNA), or GenBank division (like EST or expressed sequence tag)? click to resturn to Section A questionsback to Questions

Use the Entrez Preview/Index option to view the different terms that are indexed in the GenBank records. This is necessary when searching Entrez as standard keywords are not required when submitting sequences. The Preview/Index option is available from the search toolbar on the Entrez database pages:
See the toolbar at www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide . Select 'Preview/Index' and in the 'Add Terms to Query or Preview Index' section, enter the phrase "heat shock protein" and select 'Index'. The resulting list contains the terms that are indexed in GenBank records. Also try HSP in the 'Index' section to see that records can be indexed with synonymous terms. Note that in PubMed (MEDLINE abstracts database) literature can be searched using MeSH (Medical Subject Headings) terms.

2. How do I search for a gene sequence? #Bback to Questions

Search in Nucleotides using [gene] and organism qualifiers such as gene symbol[gene] AND genus species[organism]
e.g., brca1[gene] AND mouse[orgn]

or search in the Entrez Gene database with the following query to find the RefSeq mRNA and protein gene_symbol[sym] AND genus_species[orgn]click to resturn to Section A questionsback to Questions

3. Can I retrieve a large dataset for a particular organism, for example? click to resturn to Section A questionsback to Questions

For large datasets, you can formulate a search in Entrez, display all the records in your desired format, and then save using the Send to file option from the toolbar. Confirm the message that asks if you want to download xxx number of records. You can also use Batch Entrez to download a database-specific file of accessions or GIs.

4. How can I download data from the Nucleotide and Protein databases? click to resturn to Section A questionsback to Questions

You can get the current GenBank release and daily updates from the GenBank FTP link at: www.ncbi.nlm.nih.gov/Ftp You can obtain the Refseq build from the RefSeq FTP link at ftp.ncbi.nih.gov/refseq . See the BLAST link from : www.ncbi.nlm.nih.gov/Ftp for access to datasets available for download.

5. Can I store a search, update the stored search, run the stored search multiple times, and then save those search results? click to resturn to Section A questionsback to Questions

Use My NCBI:

Log into My NCBI, perform a search in the desired database, and click the 'Save Search' link to the right of the query box on the search toolbar.
This saves the search strategy. See MY NCBI for the Help, FAQ and Quick Tour.

6. How do I make search URLs for retrieving accession numbers or GIs or other record identifiers? click to resturn to Section A questionsback to Questions

Go to the Entrez Tools page and select the E-Utils link.

To link to specific Entrez pages from your Web page or application, select 'Linking to Entrez'

7. My search keeps returning messages that a term is not found. What can I do? click to resturn to Section A questionsback to Questions

Select the Details tab from the search toolbar to see how the query is being translated from the search terms you entered. You can edit the search in the Details page or use Preview/Index to explore alternate search fields.

8. How di I search for sequences annotated with a specific Enzyme Commission number? click to resturn to Section A questionsback to Questions

Start in either Nucleotide or Protein database and enter: the enzyme Commission number and field limiter [ecno]
examples: 1.1.1.53[ecno] A more general search can be done of Enzyme Commission numbers by entering a truncated EC. number 1.1.1*[ecno]

9. How can I perform a search to see all records in a specific Entrez database? click to resturn to Section A questionsback to Questions

Enter the following search in the search field for the specific database: all[filter] . This will provide the number of records for that database.click to resturn to Section A questionsback to Questions

Section C. Displaying records and formats: to Top


1. In what order are the resulting records displayed in Entrez and can I sort my results?

2. How do I display the sequence (bases) for some records that have only the join information instead of the whole sequence in the record?

3. Why are there N's in sequences in GenBank, example : NW_001149201

4. What are the BLink links on the Document Summary or results page for a Protein database search?



Answers. Section C. Display of Records, format:

1. In what order are the resulting records displayed in Entrez and can I sort my results? click to resturn to Section A questionsback to Questions

GenBank records are displayed generally in a 'last into the database first displayed' order. In Nucleotide and Protein databases one can sort by accession numbers by selecting the 'Sort by accession' pull down menu from the Protein results page or from the results page for each of the Nucleotide component databases: CoreNucleotide, EST and GSS.


2. How do I display the sequence (bases) for some records that have only the join information instead of the whole sequence in the record? click to resturn to Section A questionsback to Questions

To display the sequence for a contig record, a record where accession number join information has been provided in place of the sequence, select and then display the FASTA format. This will provide the entire sequence without line numbers in a single web page.
An example is a Whole Genome Shotgun record: NW_001149201 . Note the N's in the sequence which represent gaps.
CONTIG
join(AANU01169770.1:1..10827,gap(29605),AANU01169771.1:1..7919,
gap(86),complement(AANU01169772.1:1..6773))

3. Why are there N's in sequences in GenBank, example: NW_001149201? click to resturn to Section A questionsback to Questions

The N's represent a gap in a contig sequence such as a Whole Genome Shotgun (WGS) sequence. An expand N's link may be clicked, to 'uncompress the N's' in order to see the entire sequence including the gap N's.

4. What are the BLink links on the Document Summary or results page for a Protein database search? click to resturn to Section A questionsback to Questions

BLink means "BLAST link" and shows pre-calculated BLAST hits for protein sequences for protein sequence in the Entrez Proteins data domain. BLink shows graphical output of pre-computed blastp results against the protein non-redundant (nr) database. See the BLink Help document for further
details: www.ncbi.nlm.nih.gov/sutils/static/blinkhelp.html

Section D. Entrez Data: to Top


1. How often are the Entrez Nucleotide and Protein databases updated?


Answers. Section D. Entrez data:

1. How often are the Entrez Nucleotide and Protein databases updated? click to resturn to Section A questionsback to Questions

The Nucleotide database is updated every day. Records from the International Collaboration databases DDBJ and EMBL are added on a nightly build. The protein translations are added every night and for UniProt records, updates are processed when UniProt provides a new "cumulative update" at their FTP site which is about twice per month.

 


 

 

 

 

 

 

 

Write to the Help Desk
NCBI | NLM | NIH
Department of Health & Human Services
Freedom of Information Act | Disclaimer