Finally, there are also format variants which have been added to account for FASTA, NBRF and IG/Stanford format limitations commonly in use. For FASTA and IG/Stanford, the limitation is that only one header line (any line beginning with a '>' or ';') may appear in the entry. For NBRF, the limitation is that no lines like "C;Accession:" or "C;Comment:" may appear after the sequence. The formats below have a different output function which outputs entries in these limited formats (at the cost of losing some information about the sequences). Thus, the package can output entries that are readable by other programs which require the limited format.
(NOTE: Why is having someplace to store comments so important? Well, one of the goals of this package is to try to unify all of the file formats and be able to capture and transfer as much information from one format to another. The plans are to use these comment sections as the place to store any extra information for which there is not explicit spot in the entry. And that can't happen if the file format doesn't have a comment section. This is also the reason for the FASTA, NBRF and IG/Stanford variants mentioned above.)
GenBank - "LOCUS ", "GB???.SEQ Genetic Sequence Data Bank" NBRF - ">??;" FASTA - ">" EMBL - "ID ", "CC ", "XX " PIR - "\\\", "ENTRY", "P R O T E I N S E Q U E N C E D A T A B A S E" IG/Stanford - ";" ASN.1 - "Bioseq-set ::= {", "Seq-set ::= {" FASTA-out - "FASTA", "TFASTA", "SSEARCH", "LFASTA", "LALIGN", "ALIGN" PHYLIP - "0", "1", "2", "3", "4", "5", "6", "7", "8", "9" Clustalw - "CLUSTAL" MSF - "PileUp" BLAST-out - "BLASTN", "BLASTP", "BLASTX"The keyword matching occurs in the order specified here, and the first matching keyword specifies the file format. So, for NBRF and FASTA files, if the first entry's header line has a ';' as the third character after the initial '>', the file format is taken to be NBRF. Files without that semi-colon are taken to be in FASTA format.
If there's a match, then the file format has been determined. Otherwise, the file's format is considered to be `Plain' at this point.
In addition, if the string occurring before the last semi-colon on the "ID " line is "EPD", then the entry identifier is taken to be an EPD database identifier, but the entry itself is still considered to be an EMBL formatted entry.
This prettying operation can be turned off or turned on for all sequences using the function `seqfsetpretty'.
The read operation simply reads the whole file. The getseq and rawseq operations return that text. The getinfo operation merely stores the filename in the description field. The putseq operation just outputs the sequence characters. And there is no annotate operation.
The read operation reads in the whole file. The getseq operation extracts all of the alphabetic characters from the text. The rawseq operation extracts all of the non-whitespace and non-numeric characters from the text. The getinfo operation stores the filename in the description field.
The putseq operation outputs the sequence in one of two formats, depending on the sequence's alphabet. If the alphabet is DNA, RNA or Protein, or the alphabet is Unknown but does not contain newline characters, the sequence is output 60 sequence characters per line, with interspersed spaces to improve the look of the output. If the alphabet is Unknown and it contains newline characters, then it is output as is.
The getseq operation scans the sequence lines, from just after the "ORIGIN" line to the "//" line. All alphabetic characters there are assumed to be part of the sequence. No assumptions are made about the format of these lines.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation looks first at the "LOCUS" line. It takes the identifier from positions 13-22 (and assumes it's a GenBank id, unless marked by an identifier prefix), the alphabet determination from positions 37-40, whether it's circular from the existence of the keyword "circular" at positions 43-52, and the date from positions 63-73. Then, it looks for the "ACCESSION", "NID", "PID", "DEFINITION", "COMMENT" and "SOURCE" lines, where `lines' here mean one or more text lines corresponding to that part of the entry and where the lines can appear in any order. Accession numbers, NID numbers and PID numbers are extracted from the "ACCESSION", "NID" and "PID" lines, respectively. The description is taken from the "DEFINITION" line. Comments are retrieved from the "COMMENT" line. The organism name is taken from the "ORGANISM" sub-record of the "SOURCE" line. The getinfo operation cannot determine the value of the isfragment field (since that is not explicitly given anywhere in the entry).
The putseq operation outputs an entry with the following lines (in order): LOCUS, DEFINITION, ACCESSION, NID, SOURCE/ORGANISM, COMMENT, BASE COUNT, ORIGIN, sequence lines, //. The form of these lines follows that described in the GenBank Release Notes, with the following exceptions:
Example GenBank entry:
LOCUS A02201 664 bp DNA UNC 10-MAR-1993 DEFINITION Phage phi-105 DNA for immF plypeptide. ACCESSION A02201 SOURCE . ORGANISM Bacteriophage phi-105 COMMENT NCBI gi: 345121 SEQIO retrieval from GenBank database entry. 07-Feb-1996 BASE COUNT 237 a 111 c 144 g 172 t ORIGIN 1 tgatcaccta tctcctttac aacacatagt gcctcactgt gccactgtgt cttgtggcat 61 gacacaatta tagtatccga atgtcggaaa tacaatacta aaaaagacgg aaatacaagt 121 attttttagt aaattgacgg aaatacaaga taaatactct ctgaatcttt aaaatgcttg 181 aatttcgtca aatttcgact tttacaaaat gtcgtgaata ccatacaatt tagacatacc 241 ttaacgggag gtgataatca tgctggatgg gaaaaagctt ggggctttaa ttaaggacaa 301 aagaaaagaa aagcacttga aacagacaga aatggcgaag gcactgggta tgtccagaac 361 ttatctctct gatatcgaaa acggcagata tctgccgagt acaaaaacac tttccagaat 421 agcgatttta ataaatctgg atttaaatgt gttaaaaatg acggaaatac aagtagttga 481 ggagggtgga tatgatagag ctgccggcac atgtagaaga caggctttat gagattttta 541 tgaaactatc agttccaagg ttgcttgaga aagaagccct ggagaaagga gagaagccga 601 atgcggaaag aaaaggcgct tgacctcgcg gccttcttcg ctgaatttga acaaatgatg 661 atca //
The getseq operation assumes that the sequence lines are in the format described in the previous paragraph, and all of the characters in the correct positions in that format are assumed to be characters of the sequence. So, if the line format is incorrect, you will get garbage as the sequence.
The rawseq operation here is exactly the same as the getseq operation, since the GenBank sequences don't contain other characters.
The getinfo, putseq and annotate functions are the same as in the GenBank format.
The getseq operation scans the sequences lines from just after the "SEQUENCE" line to the "///" line ending the entry. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation first looks at the "ENTRY" line. The next word (i.e., non-whitespace string) after the "ENTRY" keyword is taken for an identifier, and then the rest of the line is searched for a "#type" option. If the word after "#type" is "fragment", the isfragment field is set to 1. Then, the entry is searched for the "ACCESSIONS", "COMMENT", "DATE", "ORGANISM" and "TITLE" lines, which can appear in any order. The "ACCESSIONS" line holds accession numbers (and the search for the "ACCESSIONS" line will also find lines beginning with just "ACCESSION", for backward compatibility). The "COMMENT" lines hold comments. The "DATE" line holds the date, and the date taken is the last given on the line, with the assumption being that the dates on the line are specified from oldest to newest (not absolutely accurate, but handling dates better is on my TODO list). The "TITLE" line holds the description, an optional organism name and possibly one of the keywords "(fragment)", "(fragment)" or "(tentative sequence)". The text before the string " - " is taken for the description, and the rest of the text, except for a trailing keyword, is taken for the organism name. If the keywords "(fragment)" or "(fragments)" appear at the end of the string, isfragment is set to 1. If "(tentative sequence)" appears, it is considered part of the description. The "ORGANISM" line holds an organism name which is taken if the "TITLE" line does not specify an organism.
The putseq operation outputs a PIR entry containing the following lines (in order): ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS, COMMENT, SUMMARY, SEQUENCE, sequence lines, ///. The format of those lines follows the PIR Release Notes, with the following exceptions:
Example PIR entry:
ENTRY CCMST #type complete TITLE cytochrome c, testis-specific - mouse ORGANISM #formal_name mouse DATE 04-Nov-1994 ACCESSIONS B28160; A00012 COMMENT Mammalian testis contains two forms of cytochrome c, one identical with the form found in somatic tissues and another that is expressed in a stage-specific manner during spermatogenic differentiation. SEQIO retrieval from PIR database entry. 07-Feb-1996 SUMMARY #length 105 SEQUENCE 5 10 15 20 25 30 1 M G D A E A G K K I F V Q K C A Q C H T V E K G G K H K T G 31 P N L W G L F G R K T G Q A P G F S Y T D A N K N K G V I W 61 S E E T L M E Y L E N P K K Y I P G T K M I F A G I K K K S 91 E R E D L I K Y L K Q A T S S ///
The getseq operation assumes that the sequence lines are in the format described in the previous paragraph, and all of the characters in the correct positions in that format are assumed to be characters of the sequence. So, if the line format is incorrect, you will get garbage as the sequence.
The rawseq operation here does not use the "fast" implementation, but uses the rawseq operation of the basic PIR format.
The getinfo, putseq and annotate functions are the same as in the PIR format.
NOTE: The EMBL and Swiss-Prot file format implementations are essentially the same, differing only in their putseq and annotate operations. So, we'll describe them together.The read operation first looks for an "ID " line. It then looks for the entry ending "//" line, but during this scan it also looks for an "SQ " line and a line beginning with two spaces. If the "SQ " line is found and the next word after "SQ Sequence" consists of digits, it is taken for the sequence length. The first line beginning with two spaces is assumed to be the beginning of the sequence lines, and if no such lines appear, the entry is assumed to contain no sequence.NOTE2: The EMBL read, getseq and getinfo implementations have been tested on, and are compatible with, the "EMBL" entries in the EMBL, EPD, aids-db, ENZYME, PROSITE and Swiss-Prot databases. Because of the variations of the entries in these databases, some of the assumptions made in the implementations will differ from the official EMBL or Swiss-Prot file format descriptions.
The getseq operation scans the sequences lines from the first line beginning with two spaces to the "///" line ending the entry. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation first looks at the "ID " line. The next word (i.e., non-whitespace string) after the "ID" keyword is taken for an identifier, and an attempt is made to determine if it is an EMBL id, an EPD id, a Swiss-Prot id, or something else. It does this by counting the number of semi-colons on the line and checking whether the line ends with a period. If three semi-colons and a period are found, then the string just before the third identifier is checked, and the identifier is assumed to be an EPD id if that string is "EPD" and is assumed to be an EMBL id otherwise. If two semi-colons and a period are found, and the string just before the second semi-colon is "PRT", the identifier is assumed to be a Swiss-Prot id. Otherwise, the identifier is some other id. After figuring out the type of identifier and extracting it from the line, the rest of the line is searched for words that specify the alphabet ("DNA", "RNA", "PRT", and so on) and whether the sequence is circular ("circular").
Then the rest of the entry is searched for the "AC ", "NI ", "PI ", "DT ", "DE ", "OS ", "CC " and "XX " lines, which can appear in any order. The "AC ", "NI " and "PI " lines contain accession, NID and PID numbers. The "DT " lines contain dates, of which the date on the last "DT " line is taken, under the assumption that the dates are given from oldest tonewest. The "DE " lines contain the description, and may end with one of the keywords "(fragment)" or "(fragments)", in which caseisfragment is set to 1. The "OS " lines specify the organism name. The "CC " and "XX " lines specify the comment lines, about which there are a couple things to note. First, an "XX " line isdifferent from any line beginning with "XX", in that three spacesmust appear after the "XX" and non-whitespace text must appear after that, in order for it to be considered a comment line. These lines do not occur in the official EMBL or Swiss-Prot formats, but do appear in some of the variations. Second, more than one comment section can appear in an entry. When a "CC " line is reached, the comment section beginning at that line is assumed to consist of all "CC " and "XX" lines (note the lack of spaces after the "XX") following that line, upto the first line not beginning with "CC" or "XX" (and ignoring a trailing "XX" line). When an "XX " line is seen, all following "XX " lines are considered part of that comment section. The text for these sections are concatenated together to make up the comment lines.
For the EMBL format, the putseq operation outputs an EMBL entry containing the following lines (in order): ID, AC, NI, DT, DE, OS, CC, SQ, sequence lines, //. In the output, XX lines are added between each of the lines (except the sequence lines) as specified in the EMBL format. The format of the lines follows the EMBL Release Notes, with the following exceptions:
For the Swiss-Prot format, the annotate operation replaces or appends to the "CC " lines, if they exist. If no comment section exists, then a new comment section will be inserted (or rather output between the existing lines of the entry) as follows. If a "DR ", "KW " or "FT " line appears in the entry, the comment is inserted just before the first of those lines. Otherwise, the comment is inserted just before the "SQ " or sequence lines. One of these lines must appear in the entry.
Example EMBL entry:
ID CM23SRIBR converted; DNA; UNC; 805 BP. XX AC X80636; XX DT 22-MAR-1995 XX DE C.mucosalis gene for 23S ribosomal RNA (fragment) XX OS Campylobacter mucosalis XX CC SEQIO retrieval from EMBL-format entry. 07-Feb-1996 XX SQ Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other; gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt 60 actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc 120 ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg 180 taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa 240 gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg 300 atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag 360 gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct 420 tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata 480 atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga 540 agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta 600 actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact 660 gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg 720 cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc 780 cgagtaaacg gccgccgtaa ctata 805 //Example Swiss-Prot entry:
ID 104K_THEPA CONVERTED; PRT; 924 AA. AC P15711; DT 01-AUG-1992 DE 104 KD MICRONEME-RHOPTRY ANTIGEN. OS THEILERIA PARVA. CC -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN. CC -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES. CC CC SEQIO retrieval from Swiss-Prot database entry. 07-Feb-1996 SQ SEQUENCE 924 AA; MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP KKPDSAYIPS ILAILVVSLI VGIL //
The getseq operation assumes that the sequence lines are in the format described in the previous paragraph, and all of the characters in the correct positions in that format are assumed to be characters of the sequence. So, if the line format is incorrect, you will get garbage as the sequence.
The rawseq operation here is exactly the same as the getseq operation, since the EMBL and Swiss-Prot sequences don't contain other characters.
The getinfo, putseq and annotate functions are the same as in the EMBL/Swiss-Prot format.
NOTE: The implementation of the FASTA format here follows the format described in the FASTA program documentation, with the exception that, at the beginning of the entry, multiple lines beginning with either '>' or ';' can appear. This was done in order to better distinguish the entry's header lines from the sequence lines (where comments beginning with ';' are permitted). This exception only occurs when reading FASTA entries. The FASTA output functions only use ';' for those additional header lines.The read operation looks for a line beginning with '>'. That line is taken as the header/description line for the entry. If that line has been formatted using the standard one-line description format (see file "user.doc"), then the sequence length is extracted from that line. The operation then looks for the next line which does not begin with a '>' and which does not begin with a ';'. If such a line occurs before the next line with a '>', that line is the first line of the sequence. Finally, the operation looks for the entry's end at either the next line which does begin with a '>' or the end of the file.
The getseq operation scans the sequences lines (all of the lines not beginning with '>'). All alphabetic characters on those lines are assumed to be in the sequence, except that when a semi-colon appears on a line, the rest of that line is considered a comment and not part of the sequence. No format for those lines is assumed.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation first looks at the first header line of the entry, and parses it according to the one-line description format specified in file "user.doc". It then considers any following lines that begin either with a '>' or a ';' as comment lines. Any other comments in the entry are ignored.
In the FASTA format, the putseq operation outputs a first header line according to the one-line description format. The comment/history lines and the sequence identifiers are output as additional header lines that begin with a ';'. Finally, the sequence is output.
In the FASTA-old format, the putseq operation only outputs the first header line and the sequence lines. No comment/history lines are output, and the identifiers appear in the header line.
In the FASTA format, the annotate operation either replaces, appends or inserts the comment lines just after the first header line. There is no annotate operation in the FASTA-old format.
Example FASTA entry:
>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. ; ;NCBI gi: 579066 ; ;SEQIO retrieval from GenBank database entry. 07-Feb-1996 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat gtaaactgtc aaagcaatca cagagatgat cExample FASTA-old entry:
>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat gtaaactgtc aaagcaatca cagagatgat c
NOTE: The implementation of the NBRF format follows the format descriptions given in the release notes of the VMS version of the PIR database, with the following exceptions:The read operation first looks for a line beginning with '>', which contains a two-character code and database identifiers for the sequence. The next line, which should not begin with a '>', contains a one-line description of the sequence, and the operation attempts to extract the sequence length from that line. After that, the operation scans the sequence lines looking for the beginning of the header lines or the end of the entry. The header lines begin with the first line whose second character is ';', and they are not required to appear in an entry. The end of the entry is either the first line which begins with a '>', or the end of the file.
- An identifier list (with identifiers separated by '|') can appear after the ';' on the first line of the entry, and there is no limitation to the length of that identifier list.
- The second line of the entry is treated as a full one-line description (so it can contain more than just the description and organism name).
- The NBRF header lines (which occur after the sequence) are assumed to begin at the first line whose second character is a ';', and run until the end of the entry. So, the sequence lines cannot contain such a line (or the sequence will only be partially read).
- Every "C;Comment: " line in the header lines is assumed to contain a space between the "C;Comment:" and the comment text. This space (or whatever character appears there) is not considered part of the comment text.
The getseq operation scans the sequences lines from just after the description line to either the first occurrence of a '*', the beginning of the header lines or the end of the entry. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation first looks at the initial identification line. The format of that line is ">??;..." where "??" is a two character description and "..." is a list of identifiers. Six forms of the two character description are recognized
In the NBRF format, the putseq operation outputs a initial identification line of the appropriate form, containing one of the two character descriptions above (or "XX" if the alphabet is Unknown) and containing the list of identifiers in idlist. It then outputs a one-line description according to the one-line description format. The sequence is output and terminated with a '*'. Finally, the date, accession numbers and comments/history are output in lines beginning with "C;Accession:", "C;Comment:" and "C;Date:".
In the NBRF-old format, the putseq operation only outputs the initial identification line, the description line and the sequence lines. In addition, only one identifier is placed on the initial identification line, and if that identifier was not an accession number, the main accession number is added to the beginning of the description line.
For the NBRF format, the annotate operation replaces or appends the "C;Comment: " lines, if they exists. If no comment lines exists, then a new comment section will be inserted (or rather output between the existing lines of the entry) as follows. If a "C;Genetics:", C;Complex:", "C;Function:", "C;Superfamily:", "C;Keywords:" or "F;" line appears in the entry, the comment is inserted just before the first of those lines. Otherwise, the comment is inserted at the end of the entry.
There is no annotate operation in the NBRF-old format.
Example NBRF entry:
>DL;gb:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat gtaaactgtc aaagcaatca cagagatgat c* C;Date: 18-AUG-1994 C;Accession: A14666 C;Comment: NCBI gi: 579066 C;Comment: C;Comment: SEQIO retrieval from GenBank database entry. 23-Mar-1996Example NBRF-old entry:
>DL;gb:A14666 ~A14666 PRLB promoter - Bacteriophage lambda, 281 bp. gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*
The getseq operation scans the sequence lines from just after the description line until either the end of the entry is reached, or a '1' or a '2' appears. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation first gets the comment lines at the beginning of the entry, and then parses the description line according to the one-line description format. Finally, it looks for a '1' or '2' at the end of the sequence, and sets iscircular to 0 or 1, respectively.
In the IG/Stanford format, the putseq operation outputs any comment/history lines (or just the line ";\n" if there are no comment/history lines, a one-line description, the sequence and finally either a '1' or '2' depending on the value of iscircular.
In the IG-old/Stanford-old format, the putseq operation outputs the same text as in the IG/Stanford format except that exactly one comment/history line is output.
In the IG/Stanford format, the annotate operation either replaces, appends or inserts the comment lines at the beginning of the entry. There is no annotate operation in the IG-old/Stanford-old format.
Example IG/Stanford entry:
;NCBI gi: 579066 ; ;SEQIO retrieval from GenBank database entry. 07-Feb-1996 gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1Example IG-old/Stanford-old entry:
;NCBI gi: 579066 gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1
NOTE: This file format implementation is not nearly complete enough to handle all of the variations of ASN.1 text files. I concentrated the implementation on handling the "Bioseq" sequence records defined as part of the "Bioseq-set" structure, i.e., it looks for each "Bioseq-set.seq-set.seq" record in the file, where '.' separates the initial keywords for each level of sub-record. (See the NCBI toolkit for the definitions of the "Bioseq-set" and "Bioseq" syntax, and the values of those initial keywords).The read operation looks for the beginning of each "Bioseq-set.seq-set.seq" record in the file. The operation assumes that this record is a "Bioseq" record, and looks for the end of it. Also, the read operations makes the syntactic requirement that the open brace beginning the "seq" record is separated from its initial keyword by exactly one space (i.e., the operation looks for the string "seq {"). After scanning to the end of the "seq" record, the operation looks for the "seq.inst.length" sub-record. If found, the sequence length is extracted from that sub-record.However, it does handle all of the syntactic requirements of the ASN.1 text format. It makes no assumptions on the structure of the file, handling a completely free-form file (with one exception listed below). It does assume that the format consists of a hierarchy of records, where a record consists of a text string identifier and then a pair of matching braces bounding the contents of the record (except for simple records which contain only one or more strings and numbers).
The getseq operation looks for the "seq.inst.seq-data" sub-record in the entry. If found, the sequence is extracted from that sub-record. (NOTE: This operation can only handle sequences that have been encoded in the `iupacna', `iupacaa', `ncbi2na' or `ncbi4na' formats.)
The rawseq operation is the same as the getseq operation, since the `iupacna', `iupacaa', 'ncbi2na' and 'ncbi4na' formats do not contain non-alphabetic characters.
The getinfo operation looks for a large number of possible sub-records for information about the sequence. To find database identifiers, it looks in the "seq.id" sub-record for the sub-sub-records "pir.name", "pir.accession", "swissprot.name", "swissprot.accession", "genbank.name", "genbank.accession", "embl.name", "embl.accession", "ddbj.name", "ddbj.accession", "prf.name", "prf.accession", "other.name", "other.accession", "pdb.mol", "gi", "giim.id", "gibbsq" and "gibbmt". Any identifiers found are added to the idlist. To find the date information, it looks in the "seq.descr" sub-record to find the sub-sub-records "create-date", "update-date", "genbank.date", "genbank.entry-date", "embl.creation-date", "embl.update-date", "pir.date", "sp.created", "sp.sequpd", "sp.annotupd" and "pdb.deposition".
Then, the operations searches for the description, organism and comment information in the "seq.descr" sub-record. For the description, the operation searches for the sub-sub-records "title", "pdb.compound" and "name" and picks one of them for the description ("title" if found, else "pdb.compound", else "name"). For the organism, the sub-sub-records "org.taxname", "org.common", "pir.source" and "pdb.source" are searched. For the comments, all of the "comment" sub-sub-records in "seq.descr" are concatenated together to make up the comment lines.
Finally, the alphabet is picked up from the "seq.descr.mol-type", "seq.descr.modif.dna", "seq.descr.modif.rna" or "seq.inst.mol" sub-records, the isfragment field is set to 1 if "seq.descr.modif.partial" exists, and the iscircular field is set to 1 if data string in "seq.inst.topology" is "circular".
The putseq operation outputs a "Bioseq" record for the sequence as part of a "Bioseq-set" structure (i.e., the appropriate strings are output before the first putseq operation, between the "Bioseq" records and when the file is closed, so that the file consists of a correctly formatted "Bioseq-set" record). The form of the file mirrors that of the Bioseq-set example given in the NCBI toolkit.
(NOTE: Because some text must be output when the file is closed (i.e., when seqfclose is called), you MUST call seqfclose when writing an ASN.1 file. If you don't call seqfclose, the text file will not be complete.)
The annotate operation either replaces, creates or appends the comment lines in the "seq.descr" sub-record (i.e., the comment lines are the "seq.descr.comment" records). If no "seq.descr" sub-record exists, one is created in the most appropriate place in the "seq" record. If the entry given to the annotate operation is not a Bioseq "seq" record, an error occurs.
(NOTE: Using the annotate operation by itself will NOT create a valid ASN.1 text file. You must output the following strings before the first entry, between entries, and after the last entry (again, assuming the entries are "Bioseq" records taken from the "Bioseq-set" hierarchy):
Before the first entry: "Bioseq-set ::= {\n seq-set {\n" Between entries: " ,\n" After the last entry: " } }\n"A Complete ASN.1 Text File:
Bioseq-set ::= { seq-set { seq { id { genbank { name "A14666" , accession "A14666" } } , descr { title "PRLB promoter" , org { taxname "Bacteriophage lambda" } , update-date str "18-AUG-1994" , comment "NCBI gi: 579066" , comment "SEQIO retrieval from GenBank database entry. 07-Feb-1996" } , inst { repr raw , mol dna , length 281 , seq-data iupacna "gatcagctgcgacacaactagtttacttactcgcttattaaaccagacccacaatcttt tacacagatacaatatttttagtggaaacttcttgacatttcggcccatgacctttactctgttataaattactttta tgggggacgatcacactagcaaaggagttacctaagccccgaatgttcaatgggaagacttccccaatcatgacccac attacgggaccccaagttgcggagaagaaggcgatgtaaactgtcaaagcaatcacagagatgatc" } } } }
gb:A02201 Length: 664 June 21, 1996 18:42 Type: N Check: 9896 ..although any or all of this information (except the "..") can be missing. If the line contains the "Length:" keyword, then the read operation will extract the sequence length. The read operation then reads the rest of the file, and assumes that those lines contain the sequence.
The getseq operation scans the sequences lines. All alphabetic characters on those lines are assumed to be in the sequence. No format for those lines is assumed.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence. During this operation, any period `.' appearing in the sequence lines is assumed to be a gap character and translated into a dash `-' (the SEQIO's canonical gap character).
The getinfo operation takes the date and the alphabet from the GCG information line (if the date and the "Type:" fields are there), sets the description to the first word of the GCG information line (if it isn't "Length:"), and then takes all of the lines up to the GCG information line as the comment.
The putseq operation first outputs any comment lines, outputs a complete GCG information line (with a valid checksum), and then outputs the sequence lines in the default format shown below. Any dash `-' appearing in the output sequence is assumed to be a gap character and automatically translated into a period `.'.
There currently is no annotate function.
The one exception to this rule is the relationship between the NBRF and GCG-NBRF formats. Since the NBRF entries contain "header" information that actually appears at the end of the entry, and the GCG format requires that the last thing in an entry be the sequence, the GCG and non-GCG forms of the NBRF entries differ more than the other formats. In the GCG-NBRF format, the lines before the GCG information line are assumed to contain the two header lines normally found in the NBRF entries, immediately followed by the lines normally appearing at the end of the file (the "C;Comment:", "C;Accession:" and other lines). After those lines, the GCG information line and sequence lines should appear, and be the last things in the entry. The fmtseq program and SEQIO package have been implemented to make this transformation between the NBRF and GCG-NBRF formats.
An example GCG-Genbank entry:
LOCUS A14666 281 bp DNA PHG 18-AUG-1994 DEFINITION PRLB promoter. ACCESSION A14666 KEYWORDS . SOURCE Bacteriophage lambda. ORGANISM Bacteriophage lambda Viridae; ds-DNA nonenveloped viruses; Siphoviridae. REFERENCE 1 (bases 1 to 281) AUTHORS Michiels,F., Delcour,J., Mahillon,J., Joos,H., Platteeuw,C. and Josson,K. TITLE Transformed lactic acid bacteria JOURNAL Patent: EP 0311469-A 10 12-APR-1989; PLANT GENETIC SYSTEMS N.V.; UNIVERSITE CATHOLIQUE DE LOUVAIN COMMENT NCBI gi: 579066 FEATURES Location/Qualifiers source 1..281 /organism="Bacteriophage lambda" RBS 158..166 CDS 180..254 /note="PRLB; NCBI gi: 579067" /codon_start=1 /translation="MFNGKTSPIMTHITGPQVAEKKAM" BASE COUNT 89 a 67 c 52 g 73 t ORIGIN gb:A14666 Length: 281 June 28, 1996 16:23 Type: N Check: 2754 .. 1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc 51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt 101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca 151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc 201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat 251 gtaaactgtc aaagcaatca cagagatgat c
An example GCG-NBRF entry:
>DL;gb:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. C;Date: 18-AUG-1994 C;Accession: A14666 C;Comment: NCBI gi: 579066 C;Comment: C;Comment: SEQIO retrieval from GenBank database. 28-Jun-1996 gb:A14666 Length: 281 June 28, 1996 16:22 Type: N Check: 2754 .. 1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc 51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt 101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca 151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc 201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat 251 gtaaactgtc aaagcaatca cagagatgat c
Pileup.Msf MSF: 729 Type: N June 21, 1996 15:02 Check: 3171 ..although any or all of this information can be missing, except the ".." and the "MSF: %d" section, the second of which the read operation uses to get the sequence length. After the information line, the read operation looks for the sequence name lines, which are of the form
Name: Humhbbbpc Len: 729 Check: 6463 Weight: 1.00where the "Name: " field gives the sequence identifier and must appear on any non-blank line in this section of the MSF file (the other fields are ignored, and the length is assumed to be the same as the global length). The sequence name lines section ends when a line beginning with "//" appears. Any number of blank lines can be interspersed in this section, but any non-blank line should contain the above format. The rest of the file is assumed to contain the sequence lines, where each sequence line begins with the sequence name followed by a space, as in:
401 450 Humhbbbpc CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........ Humhbbbpd CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........ Humhbbbpe CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG..... Humhbbbpf CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG..... Humhbbbpg AATACAAAAT CAGTAGCATT TCATATATAA A......... .......... Humhbbbph AATACAAAAT CAGTAGCATT TCATATATAA A......... .......... Humhbbbp1 AAGTGATGAA ATTGTGTATT CAATGTAGTC TCAAGAGAAT TGAAAACCAA Humhbbbpa AAATAAAAGG ATGGAGGAAG ATCTACCAAG CA........ .......... Humhbbbpb AAATAAAAGG ATGGAGGAAT ATCTACCAAG CA........ .......... Humhbbbp2 AGCT.AAAGG ATTGTAAATG CACTAATCAG CACTCTGTGT CTAGCTCAAGNo format of the sequence lines or presence or absence of the position number lines (401...450) is assumed, except for the initial sequence name. The sequence lines run to the end of the file.
The getseq operation finds every sequence line beginning with the corresponding sequence name (the sequences are ordered by the order of sequence names in the sequence names section). All alphabetic characters appearing after the sequence name are taken for the sequence.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence. During this operation, any period `.' appearing in the sequence lines is assumed to be a gap character and translated into a dash `-' (the SEQIO's canonical gap character).
The getinfo operation takes the date and the alphabet from the GCG information line (if the date and the "Type:" fields are there), sets the description to the sequence name found in the sequence name section, and then takes all of the lines up to the GCG information line as the comment.
The putseq operation outputs an MSF file exactly mimicing the files output by GCG using "PileUp" in its default mode, except that only the keyword "PileUp" appears on the first line and no comments are output. Any dashes `-' found in the sequences are assumed to be gap characters and are automatically translated into periods `.'. If the sequences are of different lengths, the putseq operation will pad the smaller sequences with periods `.'.
(IMPORTANT: The one unusual feature about the putseq operation is that, unlike all of the other putseq operations except Clustalw and PHYLIP, the actual output does not occur until `seqfclose' is called to close the file. Because the MSF format must know the number of entries before it can begin the output, the sequences cannot be output at each call to `seqfwrite'. What the putseq operation does, on each call to `seqfwrite', is make a copy of the sequence and a sequence identifier (either the main identifier, description or organism name). Then, when `seqfclose' is called, all of the sequences are output in the correct format.)
There currently is no annotation function.
An example MSF file:
PileUp pir.msf MSF: 104 Type: P June 28, 1996 17:04 Check: 3466 .. Name: pir:CCCZ Len: 104 Check: 9501 Weight: 1.00 Name: pir:CCMQR Len: 104 Check: 9512 Weight: 1.00 Name: pir:CCMKP Len: 104 Check: 9066 Weight: 1.00 Name: pir:CCRB Len: 104 Check: 8395 Weight: 1.00 Name: pir:CCGW Len: 104 Check: 8496 Weight: 1.00 Name: pir:CCCM Len: 104 Check: 8496 Weight: 1.00 // 1 50 pir:CCCZ GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA pir:CCMQR GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA pir:CCMKP GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE pir:CCRB GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD pir:CCGW GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD pir:CCCM GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 51 100 pir:CCCZ ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK pir:CCMQR ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK pir:CCMKP ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK pir:CCRB ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK pir:CCGW ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK pir:CCCM ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 101 pir:CCCZ ATNE pir:CCMQR ATNE pir:CCMKP ATNE pir:CCRB ATNE pir:CCGW ATNE pir:CCCM ATNE
NOTE: The implementation here is more flexible than other implementations, however it is a bit restrictive in its output, in thatThe read operation first skips whitespace characters and then looks for the number of sequences and the sequence length (those two numbers must be the first thing in the entry). On that initial line, it also looks for the option characters 'A', 'C', 'F', 'M', 'U', 'W'. If any of the options except 'U' are found, the operation then skips any subsequent lines that begin with a match to the character strings "ANCESTOR ", "CATEGORIES", "FACTORS ", "MIXTURE ", or "WEIGHTS ". A line is considered to match one of the strings if the first 10 characters of the line contain a prefix of the string padded by spaces. Also, these lines are skipped only if the corresponding option was given on that first line.
- Both interleaved and sequential formats are supported and rigorously distinguished. See below for the details.
- An input file in the PHYLIP format can contain one or more PHYLIP entries, where each entry must be separated only by whitespace. Mixed files (some interleaved entries, some sequential entries) are supported.
- Any number of blank lines or lines filled only with whitespace can be included in the file. Blank lines do not disrupt the parsing of the entries.
- The output operation does NOT output more than one entry per file, because I have yet to completely figure out the SEQIO interface issues. (Note that this may change in a future version.)
- This implementation was done using the documentation from Version 3.5c. Whether it works with earlier versions is not known.
3 6 A A ABCDEF B BCDEFG C CDEFGHbecause the second line of the entry is treated as an "ANCESTOR " line, when in fact it was a sequence line. But, from looking at the documentation, the PHYLIP programs would die on this entry, too. And replacing "A " with something like "Alpha " eliminates the problem.)
After skipping those initial lines, the read operation tries to match the subsequent lines to the interleaved and sequential file formats. The following criteria are the keys to distinguishing between the two formats:
Finally, if the 'U' option has been set on the entry's first line, the read operation skips the user trees listed in the entry, to get to the end of the entry. The format of the user trees consists of a line giving the number of trees, followed by any number of lines of text where each user tree description is ended by a semi-colon (the operation just counts the semi-colons it sees). The end of the entry is at the end of the line containing the last semi-colon.
The getseq operation finds the first line of the appropriate sequence in the entry (i.e., the `seqfseqno' sequence), skips the 10 character identifier and retrieves the sequence. All alphabetic characters are considered to be in the sequence.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation takes the 10 character sequence identifier to be the description of the sequence. No other information is retrieved.
The putseq operation outputs an Interleaved or Sequential entry exactly as described in the PHYLIP program documentation. If the sequences output are of different lengths, the putseq operation will pad the smaller sequences with dashes `-'.
(IMPORTANT: The one unusual feature about the putseq operation is that, unlike all of the other putseq operations except Clustalw and MSF, the actual output does not occur until `seqfclose' is called to close the file. Because the PHYLIP format must know the number of entries before it can output the first line, the sequences cannot be output at each call to `seqfwrite'. What the putseq operation does is, on each call to `seqfwrite', it makes a copy of the sequence and a sequence identifier (either the mainid, mainacc, description or organism name). Then, when `seqfclose' is called, all of the sequences are output in the correct format.)
There is no annotate function.
Example PHYLIP Interleaved entry:
6 104 pir:CCCZ GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA pir:CCMQR GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA pir:CCMKP GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE pir:CCRB GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD pir:CCGW GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD pir:CCCM GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK ATNE ATNE ATNE ATNE ATNE ATNEExample PHYLIP Sequential entry:
6 104 pir:CCCZ GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE pir:CCMQR GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE pir:CCMKP GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE pir:CCRB GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK ATNE pir:CCGW GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK ATNE pir:CCCM GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK ATNE
The getseq operation finds the first line of the appropriate sequence in the entry (i.e., the `seqfseqno' sequence), skips the 15 character identifier and retrieves the sequence. All alphabetic characters are considered to be in the sequence.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation takes the 15 character sequence identifier to be the description of the sequence. No other information is retrieved.
The putseq operation outputs a Clustalw entry exactly as the clustalw program does, except that the version number is replaced with "*.**" and the package does not look for closely related columns in the output alignment (it simply outputs a line of whitespace without any '*' or '.' characters). If the sequences are of different lengths, the putseq operation will pad the smaller sequences with dashes '-'.
(IMPORTANT: The one unusual feature about the putseq operation is that, unlike all of the other putseq operations except PHYLIP and MSF, the actual output does not occur until `seqfclose' is called to close the file. Because the Clustalw format must know the number of entries before it can output the first line, the sequences cannot be output at each call to `seqfwrite'. What the putseq operation does is, on each call to `seqfwrite', it makes a copy of the sequence and a sequence identifier (either the mainid, mainacc, description or organism name). Then, when `seqfclose' is called, all of the sequences are output in the correct format.)
There is no annotate function.
Example Clustalw file:
CLUSTAL W(*.**) multiple sequence alignment pir:CCCZ GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG pir:CCMQR GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWG pir:CCMKP GDVFKGKRIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQASGFTYTEANKNKGIIWG pir:CCRB GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG pir:CCGW GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG pir:CCCM GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG pir:CCCZ EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE pir:CCMQR EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE pir:CCMKP EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE pir:CCRB EDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE pir:CCGW EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE pir:CCCM EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
NOTE: With one or two exceptions, this implementation can read and understand the output from the FASTA, TFASTA, SSEARCH, LFASTA, LALIGN and ALIGN programs which were run either in interactive or non-interactive mode, and where the output was formatted with MARKX option set to any of 0, 1, 2, 3 or 10.The read operation first scans the text occurring before the first alignment in the file. This initial text is ignored, except where it gives information about the sequences being aligned. The initial texts of some of the output formats contain lines of the following form.The exceptions are
- The program must have been run in non-interactive mode in order for the automatic format determination to work correctly. By "non-interactive", I mean that the initial header output by the program:
FASTA searches a protein or DNA sequence data bank version 2.0u4 Feb., 1996 Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 . . .must appear in the text given as input.
- If the FASTA, TFASTA or SSEARCH is run in interactive mode, no information will be known about the query sequence (its information is in the initial header, which is not included in the file specified to receive the program output),
- The ALIGN program must be run in non-interactive mode in order for the package to correctly parse it (i.e., that initial header must occur in the text). For the other programs, the package will parse its output correctly, if the file format is specified as `FASTA-output'.
- The implementation was tested against version 2.0u4. If the output was different in previous versions, the implementation may not work.
>GT8.7 transl. of pa875.con, 19 to 675: 217 aa >musplfm transl. of musplfm.seq, 2 to 676 : 224 aa (A) musplfm.aa >musplfm transl. of musplfm.seq, 2 to 676 - 224 aa (B) lcbo.aa >LCBO - Prolactin precursor - Bovine - 229 aa >musplfm transl. of musplfm.seq, 2 to 676 224 aa vs. >LCBO - Prolactin precursor - Bovine 229 aaThe text after the '>' is parsed to extract the sequence id (the first word after the '>'), a sequence description, the sequence length and alphabet information about the sequence.
Then, the read operation reads the "entries" of the file, where each entry is considered to be the text describing an alignment between two sequences. Different programs output different sets of alignments, but all six of the FASTA programs supported output one or more two-sequence alignments. Thus, every entry in this format contains two sequences.
The getseq operation extracts the appropriate sequence from the entry (the first or second sequence if the `seqfseqno' value is 1 or 2, respectively). All alphabetic characters are considered part of the sequence, except that if the output was generated with MARKX=2, then any periods occurring in the second sequence are replaced with the corresponding character of the first sequence.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence (with the exception of period substitution mentioned above).
The getinfo operation extracts a main identifier, a description and an alphabet for the appropriate sequence, if available. It also constructs a comment that begins with the following:
From SSEARCH output alignment of: >musplfm transl. of musplfm.seq, 2 to 676, 224 aa >LCBO - Prolactin precursor - Bovine, 229 aaThis gives the name of the program whose output is being parsed, and the descriptions of the two sequences from whose alignment came the current sequence. This text is then followed by any information from the alignment describing the score of that pairwise alignment. The format of this text depends on the FASTA program executed and the MARKX value, as it is just copied from the program output.
There is no putseq or annotate operation.
NOTE: With one or two exceptions, this implementation can read and understand the output from the BLASTN, BLASTP or BLASTX (and maybe even the TBLAST* programs, although that has not been tested yet). The exceptions are:The read operation first scans the text occurring before the first alignment in the file. This initial text is ignored, except where it gives information about the sequences being aligned. The initial texts of some of the output formats contain lines of the following form.
- Automatic recognition of the BLAST-output format requires that one of the keywords BLASTN, BLASTP or BLASTX be the first word in the file (possibly after an e-mail header). Many of the BLAST e-mail servers prepend a description of their service before the actual BLAST output, and so disrupt the recognition by the package. So, for output gotten by an e-mail server, the input format must be set.
- The implementation was tested on output generated by versions 1.2 and 1.4.9. If the output is different in version 1.3 or 2.0, the implementation may not work (although the implementation can correctly handle gaps in the alignments, so that change from 1.* to 2.0 is handled).
Query= gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi- (665 letters)The text after "Query=" and before the line containing the "(... letters)" is parsed as a oneline description, and the number inside the "(... letters)" is taken as the length of the query sequence.
Then, the read operation reads the "entries" of the file, where each entry is considered to be the text describing an alignment between two sequences. The BLAST alignment format consists of header lines specifying the sequence that matches the query, following by one or more pairwise alignments of substrings of the matching sequence and the query. The read operation first scans the header lines, which are of the form:
>emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity region with repressor gene and ORF >emb|A11144|A11144 phage phi 105 repressor (ORF1)-Orf 2 genes and there flanking regions Length = 1306where the "Length =" line ends the list of oneline descriptions of the sequences that match the query (in the next pairwise alignment(s) ). It extracts the oneline description and length of the sequence.
The read operation considers an "entry" to consist only of the actual score reporting text and pairwise alignment text. So, while the header lines above are scanned for their information, the entry reported by the package begins at the line containing either "Plus Strand HSPs:", "Minus Strand HSPs:" or "Score =". And the entry ends just after the last line of the pairwise alignment text. This is done to make the entry text reported by the package more uniform. Thus, the following BLAST output would be reported as two entries, the first beginning at the "Plus Strand HSPs:" line and running through the first pairwise alignment, and the second beginning with the "Score = 89..." line. The header lines will not be reported in any alignment, and will only be scanned to extract the oneline description and length information.
>emb|Z68118|CER01E6 Caenorhabditis elegans cosmid R01E6 Length = 40,937 Plus Strand HSPs: Score = 127 (35.1 bits), Expect = 3.2, Sum P(2) = 0.96 Identities = 39/56 (69%), Positives = 39/56 (69%), Strand = Plus / Plus Query: 426 ATTTTAATAAATCTGGATTTAAATGTGTTAAAAATGACGGAAATACAAGTAGTTGA 481 |||||||||||||| |||||| | ||||||||| | || | || || | Sbjct: 35266 ATTTTAATAAATCTCATCTTAAATTAGATAAAAATGAATGCAAAATTTATATTTTA 35321 Score = 89 (24.6 bits), Expect = 3.2, Sum P(2) = 0.96 Identities = 25/34 (73%), Positives = 25/34 (73%), Strand = Plus / Plus Query: 93 ACAATACTAAAAAAGACGGAAATACAAGTATTTT 126 |||||||||||||| | || || |||||| Sbjct: 31613 ACAATACTAAAAAATCTTGTAAACAAAATATTTT 31646
The getseq operation extracts the appropriate sequence from the entry (the first or second sequence if the `seqfseqno' value is 1 or 2, respectively). All alphabetic characters are considered part of the sequence.
The rawseq operation is the same as the getseq operation, except that all non-whitespace and non-numeric characters are considered part of the sequence.
The getinfo operation extracts a main identifier, a description and an alphabet for the appropriate sequence, if available. It also constructs a comment that begins with the following:
From BLASTN/BLASTP/BLASTX output alignment of: >gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi and >emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity region with repressor gene and ORF >emb|A11144|A11144 phage phi 105 repressor (ORF1)-Orf 2 genes and there flanking regionsThis gives the name of the program whose output is being parsed, and the descriptions of the two sequences from whose alignment came the current sequence. This text is then followed by any information from the alignment describing the score of that pairwise alignment.
There is no putseq or annotate operation.