README NCBI Prokaryotic Genomes Automatic Annotation Pipeline This readme describes the directory and file structure for the automatic annotation pipeline. For more information on the pipeline visit: http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html The annotation procedures are on this page: http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html Information on submission to GenBank can be found on this page: http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Information on submitting Genome Project information can be found here: http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi SUBMISSIONS. BEFORE YOU START: Make sure that you have contacted us at: genomes@ncbi.nlm.nih.gov In order to run the automatic annotation pipeline we will need to ask you a few questions by email first in order to assess the amount of resources required. The PGAAP is intended for the annotation of genomes in preparation for submission to GenBank. Genomes can either be complete or in WGS format (multiple contigs - at least 200 bases per contig) in any state of assembly. If you are intending to submit a complete genome, then please submit the project and register a locus_tag first at this page: http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi IMPORTANT! Make sure that the locus_tag that you submit matches the locus_tag you use in the files below. Submission format. Our annotation pipeline requires at least 3 files for either complete or WGS genomes. A fourth file containing assembly information can be added for WGS genomes if the assembly information is known. IMPORTANT! All files should have the same prefix (locus_tag). This prefix will reflect the unique identifier used in the subsequent files and should contain an underscore. Additional information can be added to the file prefix after an underscore (such as date of submission). This prefix should match the locus_tag prefix you submitted in the project submission form above if you are submitting a complete genome. NOTE: Everything before the underscore will be used as an identifier. The file suffix reflect the file contents. Therefore, each file will be named: _. The prefix is used to add unique identifiers to every gene in the output, and therefore should follow the locus_tag format guidelines found here: http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf The prefix will also be used to produce directory structures for the output files (see below). In description below, the prefix is referred to as "_". IMPORTANT! Do not use zeros at the start of any identifiers. The four files are: *.email (required) *.fasta (required) *.template (required) *.agp (optional for WGS submissions, and only if assembly information is known) IMPORTANT! Use the complete organism name for your submissions. This should match the genus species strain name and should be the same in your project submission, and all the files you submit to us. If your organism does not have a taxid then NCBI will assign one. Ex: "Escherichia coli K-12" The full name in quotes should be used, NOT "Escherichia coli", NOT "E. coli" but "Escherichia coli K-12". 1. *.email This file consists of the email address(es) to be used for communication between NCBI and the submitter. 2. *.fasta This file consists of the FASTA sequence(s) for each contig to be annotated. All contigs from a single genome should be in the same file including all chromosomes and plasmids. Fasta headers are essential and should contain the following minimal amount of information: A. A unique identifier B. The organism C. The strain for that particular organism D. The genetic code NOTE: There is no longer any requirement for the length of the contig in the FASTA header as it is now calculated from the submitted sequence. Information on FASTA headers can be found on this page: http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html It should look like this: >gnl|center| [organism=] [strain=] [gcode=11] Note that there is a limit on the total length of the string in this part of the header: gnl|center| Which should not contain more than 41 characters. (see below) FASTA sequences should not contain unnecessary line breaks or non-IUPAC bases. 3. Template file. The template file contains information on the submitters and the submitting organization in ASN.1 format. The template file can be created using Sequin, which can be downloaded from here: http://www.ncbi.nlm.nih.gov/Sequin/ For Genome Submissions, the template editor needs to be set. This requires an additional flag in the initialization file for Sequin which will either be .sequinrc for the Unix version, Sequin.ini for the PC version, or Sequin.cnf for OSX. Under the settings field add GENOMECENTER= EX: [SETTINGS] GENOMECENTER=NCBI This will enable the template editor when Sequin is started. Start the template editor and enter the required information, including submitters (author names), submitting organization, and organism name, etc. Save the template file and send it along with the submission. The essential elements of the template file are the authors, the submitting organization (affiliation), the organism, and manuscript title (optional). 4. AGP file. The AGP file is used for the submission of WGS genomes if the assembly information is known. Information on the AGP file format can be found here: http://www.ncbi.nlm.nih.gov/genome/guide/Assembly/AGP_Specification.html There is now an AGP validator available from this FTP archive for anonymous download: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/cmdline/ For our genome annotation pipeline, it is preferable to have the ability to streamline submission to GenBank. Therefore, the file format is more strict and requires the following: A. the identifiers should be of type "general" and contain the center information. B. the identifiers for contigs need to match the identifiers in the FASTA headers exactly. OUTPUT: Note that the output formats are intended for manual curation. They can be used to submit to GenBank. However, GenBank will require some time to analyze the submission for potential problems. If you wish to submit to GenBank then you will need to inform us. IMPORTANT! Do not edit the flatfiles if you intend to resubmit these files in the future after manual curation. These are merely produced for information purposes. If you wish to edit the files manually, you should edit the *.tbl files produced for every contig. Output directory structure (will differ depending on WGS or complete genomes) -> primary submit directory for submission | -->_ -> directory derived from each submission | ->Contigs -> directory containg list of directories for each contig | -> XXXX01 -> contig directory - ex. first contig .... -> XXXX99 -> contig directory - ex. last contig File structure directory _.email -> email file for notification _.fasta -> multi-fasta file for annotation _.sequin -> template file including author names _.agp -> agp file for submission _.email -> single email address for notification ex: rsmith@generic.com _.fasta -> multi-fasta file requires the following: fasta header >gnl|center| [organism=] [strain=] [gcode=11] >gnl|center| [organism=] [strain=] [gcode=11] Note the length limitation on this part of the header: gnl|center| It should only consist of 41 characters, maximum. etc. _.agp 1 2694 1 W gnl|center| 1 2694 + etc. .template -> asn format for submit block (authors, affiliation, affiliation) _ This directory will contain the same files as the above in directory if there are multiple submissions In addition, for WGS genomes, there will be reconstructed *.gbf, *.gbf_g4, and *.sqn files dependent on the AGP assembly file. These are provided solely for visualization and are not intended to be manually curated. Supplementary file: Frameshifts_intercontigs.list -> contains information on potential overlapping coding regions in adjacent contigs (see below) Contig directories Contig.fsa -> single fasta for contig Contig.gbf -> single GenBank flatfile view of annotation for contig Contig.sqn -> single asn file of above Contig.tbl -> single feature table view of above (this should be used for annotation changes) supplementary.3 -> supplementary file for the above contig supplementary.rRNA -> supplementary file containing information on rRNAs that failed length validation frameshifts -> information on frameshifts found by GeneMark supplementary.fs -> information on frameshift found by analyzing adjacent genes for hits to the same protein FORMAT OF SUPPLEMENTARY FILES: NOTE: THE SUPPLEMENTARY FILES ARE PROVIDED AS IS AND ARE SUBJECT TO CHANGE WITHOUT NOTICE. PLEASE USE AT YOUR OWN RISK supplementary.3 -> supplementary file This supplementary file is provided for submitters who wish to examine the evidence used for annotation and for additional information that might be of interest during manual curation. contains: locus_tag -> for correlation between *.tbl, *.gbf, *.sqn files Orig pos: -> +/- strand, position on the contig, partialness of the prediction (especially at the end of contigs) and position to the upstream/downstream CDSs (in codons) which shows overlap (if negative) with upstream/downstream CDSs NB: overlap may be negative even though upstream/downstream CDS does not have negative overlap - this would mean the upstream/downstream prediction was removed from the final file - ex: prediction with BLAST hits significantly overlaps upstream prediction with no BLAST hits - the upstream prediction will be removed in the final file COG name -> COG number and name for this CDS CDD name -> CDD numbers and domains for this CDS BLAST names -> top 5 BLAST hits for this CDS line1 -> name and organism from the hit line2 -> gi number, EC# (if present) and locus (gene name - if present) for the hit line3 -> evalue and position and length of the hit (coordinates) Re: coordinate system this is a complex system that reflects the length of the hit and the relative start and stop positions of the subject the middle number is the length of the hit between the query and subject the first number represents the distance from the start site of the query in relation to the start site of the subject the last number represents the distance from the stop site of the query in relation to the stop site of the subject a set of numbers like 0 <-- XXX --> 0 means that the start and stop match perfectly in both subject and query IRRESPECTIVE of the hit length this means that the hit length might be very short, but if the query and subject are mapped in relation to each other, then the start and stops correlate supplementary.fs -> supplementary file This file contains information on potential frameshifts by looking for adjacent genes that are hitting the same subject in a BLAST search. It contains information on the two query sequences (in the genome being annotated), the single subject that is being detected in a BLAST search by each query, and information on the genomic location and the overlap with the subject as well as information on the potential type of frameshift or internal stop codon that is occurring. Frameshifts_intercontigs.list This file is intended to help assembly of WGS genomes by detecting the same product on adjacent contigs that have similar BLAST hits. Each entry lists products from 2 contigs that (a) are neighbors in the scaffold, (b) have fuzzy ends matching the ends of their contigs First two lines list 2 products, the first number is the computed distance (in nucleotides) between the two products, followed by the contig_id, and the direction of the product relative to the scaffold. Third line lists the genomic location of two products. The first number is the distance as noted above, the 2nd number is the start of the left product relative to the scaffold, the number in parentheses is the start of the left product relative to the contig, the fouth is th end of the left product relative to the contig and the fifth is the end of the left product relative to the scaffold ( (ie. distance, scaffold start, contig start, contig end, scaffold end for the left product) Fuzziness is denoted by '>'. Then the distance is repeated and the same order is repeated for the right product (on the next contig in the scaffold). Fourth line lists the structure of the frameshift candidate. Distance, total minimal length of product = length of left product plus distance between plus length of right product. Additional files: Disrepancy report is detailed here: http://www.ncbi.nlm.nih.gov/Genbank/asndisc.html Unicog and conserved prk report. Unicog shows conserved COG function, name, contig, and then nucleotide location. Conseved PRK report shows taxonomically conserved function, PRK ID, conservation in which taxonomic branch (input sequence), and whether the function is completely missing from all protein and nucleotide sequences, or whether a potential hit was found in one of the nucleotide sequences with the location as in the above report. notAGCT report. This file lists nucleotdes which do not correspond to AGCT and their position (does include IUPAC ambiguity symbols). MANUAL CURATION. NCBI recommends that submitters examine the output files prior to submission to GenBank. If any manual curation takes place, the *.tbl files should be modified as these are used to produced the submission to GenBank. IMPORTANT! Do not modifify the flatfiles or sequin files, (*.gbf or *.sqn files) if you plan to modify prior to submission to GenBank. Modify only the *.tbl files. SUMMARY 1. Contact NCBI regarding your submission. 2. If it is a complete genome intended to be submitted to GenBank, register your project. 3. Taxids will be assigned if necessary. 4. Obtain an FTP account. 5. Submit the files for automatic annotation. CONTACT: If you have any questions, then please send an email to: genomes@ncbi.nlm.nih.gov LINKS: All links are available in this section: http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html http://www.ncbi.nlm.nih.gov/Sequin/ http://www.ncbi.nlm.nih.gov/genome/guide/Assembly/AGP_Specification.html ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/cmdline/ Original: April 28/2005 Updated: Nov 26/2007 Updated: Oct 6/2008