NCBI Logo Using tbl2asn to Prepare an HTG Submission
Sequin Entrez BLAST OMIM Taxonomy Structure
spacer

HTG home

Clone registry

Submitting HTGs

Sequin

fa2htgs

tbl2asn

Processing HTGs

HTG FAQs

HTG article

Examples


Using tbl2asn to Prepare an HTG Submission

This document assumes that you are already familiar with the Sequin program. If you are not, please visit the Sequin home page at http://www.ncbi.nlm.nih.gov/Sequin.


Setting Up tbl2asn

Basic information about the commandline program tbl2asn is described on the tbl2asn home page. tbl2asn is available by anonymous FTP from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/.

Copy the right version for your platform, then uncompress the file, rename it to "tbl2asn", and set the permissions (as necessary for the platform)


Creating a Sequin Submission Template File

Before you create your first HTG submission, you need to make a Sequin submission template that contains contact and citation information. You can then use this template for subsequent submissions.

To create the template, follow the instructions on the tbl2asn page: click on the "Start New Submission" button on the first Welcome to Sequin form. All of the information from the subsequent Submitting Authors form will be used for the HTG record. Enter a manuscript title, if desired. Fill out the Contact, Authors, and Affiliation pages carefully. Instructions are provided in the Sequin help documentation. Return to the submission tab and choose File-> Export Submitter info, and save the file as template.sbt.


Formatting Sequences in FASTA Format

Single, unsegmented sequences should be in standard FASTA format. Segmented sequences for phase 0, 1, or 2 HTGS should be in a modified FASTA format such as this:

>P74A8 [chromosome=2] [clone=ABC12345]
gatcagcccaaagcattgattaggggaacttacctgtagagggctgcagcaatggggaac
acctggctgggtcacagagtggtcaatgcactccatgacttttgggtcaggacacagaaa
gaaagagcggggaaccggggggccctacagtgatgaattatactaactgattttagaatg
>segment2
ttaaacaaacattgcatttccagaataaaccccatttagtaacgcatagtgtgcttgtat
ctcagcctcccaaagtgctgggattatagacatgagccagcgcacctggctttgttagcc
>segment3
ttttcaaataactttttgaactttgttaattttttaattgcacgttttctccttcattta
ctaattccattcaaaagtagcatcaatgagaataaattacttaggaatacatttaattaa
aaagtgctagacttgtacactgaaaattacaaagtactctggagatatattc

The first line has the Sequence Id (=SeqID or sequence name), P74A8, and source information. Each segment is separated by-

>segmentx

This line must be unique among all the lines of FASTA-formatted sequence being processed (e.g., ">segment2", ">segment3", etc).


Formatting Quality Scores in FASTA Format

Because the information is useful to database users, please include the quality scores of unfinished (phase 0, 1 or 2) HTG records in the submission. If you are using .fsa files to include the sequences, then tbl2asn will include the associated CONSED/PHRAP quality scores in the output file if they are in a file named and formatted as described on the tbl2asn page. The instructions are:

Put the scores in a .qvl file whose basename matches the fasta (.fsa) file's basename, and whose definition line has the same identifier (SeqID) as the corresponding .fsa file. Put this file in the same directory as the .fsa file.

To account for gaps of unknown size (the default), include 100 zeroes between the contigs' scores.

Formatting Ace Sequences from Phrap

Alternatively, you can import sequence and quality scores in the .ace file format, which is an output of Phrap. This format is not described here.


Including Source Information

The source information must be included in the FASTA file definition line and/or the tbl2asn commandline. All of the source information, including the HTG phase, can be included in the FASTA file definition line. Alternatively, common information such as the organism name can be included in the commandline, and only unique information included in the FASTA file definition line, if desired. If the same qualifiers are present both places, the information in the FASTA definition line will be used. The source qualifiers are described on the tbl2asn page.

The HTG phase is included as:

  • [tech=htgs 0]
  • [tech=htgs 1]
  • [tech=htgs 2]
  • [tech=htgs 3]

Keywords, such as HTGS_DRAFT, are included as:

  • [keyword=HTGS_DRAFT]

Creating the HTG Record

A basic commandline for a new phase 2 submission is:

  • tbl2asn -a di -i gap.fsa -t template.sbt -C center_name -j "[tech=htgs 2] [organism=Solanum lycopersicum] [cultivar=Heinz 1706]" -Y comment_file -o gap.ss

A commandline for an update must include the accession number:

  • tbl2asn -a di -i gap.fsa -t template.sbt -C center_name -A accession_number -j "[tech=htgs 2] [organism=Solanum lycopersicum] [cultivar=Heinz 1706]" -Y comment_file -o gap.ss

In both cases the chromosome and clone names would be included in the definition line of the .fsa file.

Type "tbl2asn -" to see the program's command line arguments. Note that several command line arguments were changed in version 10.0 of tbl2asn, to make it more flexible and expandable. Below, we list some arguments along with additional comments.

tbl2asn 10.1 arguments:

-i Filename for .fsa FASTA input [File In]

-t Filename for Seq-submit template [File In]

-p Path to Files [String] Optional

To run tbl2asn on all the files with a common suffix in a directory.

-x Suffix [String] Optional

default = .fsa

-o Filename for ASN.1 output [File Out] Optional

default = basename.sqn
The convention until now has been to name the output file as clone_name.ss (e.g., P74A8.ss), but using tbl2asn's default name is fine. Our scripts/code report with the same file name convention used in the submission. Note that because we are working in Unix space, "case" of letter is important. Also avoid "meta-characters" (such as ^*/\ etc.).

-C Genome Center tag [String]

This is the same as the login name on the NCBI FTP server.

-j Source Qualifiers [String] Optional

Include multiple qualifiers between double quotes. The source qualifiers can be included in the FASTA definition line instead of or in addition to using -j. If the same qualifier is in the FASTA definition line and the -j argument, tbl2asn will use the information in the FASTA file.

-n Organism name [String] Optional

This can be included in the FASTA definition line or with the -j argument instead.

-Y Filename for the comment: [File In] Optional

Will read the comment from a given file. Maximum 100 characters per line. New lines can be incorporated with "~"; if you actually want to include the "~" in your text, you need to escape it with "`". Please ensure that the correct format is obtained by viewing your comment in Sequin.

-y Comment [String] Optional

Will read a short comment from the commandline.

-V Verification (combine any of the following letters) Optional

v Validate. Will generate a file basename.val for each record. This is especially important for phase 3 records where many annotations may be added, as described on the tbl2asn page, or for DRAFT entries that include quality scores. Taxonomy-related errors can generally be ignored.

b Will generate GenBank flatfile.

-a Specifies the File type.

di Will read FASTA file as delta sequences, with implicit gaps (=unknown size represented by 100 Ns) between contigs. Use this argument when the input is a FASTA file.

e Filename for Phrap/ACE input. Using this argument implies that you are NOT using the -i above.

-A Accession [String] Optional

Required in an update, to include the accession number.

After Using tbl2asn

When you are finished with the submission, deposit it on your FTP account under the "SEQSUBMIT" directory. Our software will look for it there every day, validate the center and sequence name ids, check whether the record is an update, and write a report that you can pick up the next day. Further information about how HTG records are processed is available from http://www.ncbi.nlm.nih.gov/HTGS/processing.html.

Revised: February 13, 2008.

Questions or Comments?
Write to the NCBI Service Desk

Disclaimer     Privacy Statement