GENSAT Project Data Now in Entrez My NCBI Influenza Virus Resource NCBI ToolKit Utility Programs New Microbial Genomes in GenBank Iceman Preserved in GenBank RefSeq Updates RefSeq Release 11 New Organisms in UniGene GenBank Release 147 New Genome Build CCDS Database NCBI Courses PubMed Corrects Spelling LocusLink Retired Masthead |
Seedtop, like blastall and formatdb, is a commandline program with parameters specified with a leading dash, followed by a one-letter parameter code. To find a pattern in a protein sequence, we may use: seedtop -i input -k pat -p patmchp -o pat_out The file “pat” contains the pattern for a serine protease motif:
The file “input” contains the sequence for human kallikrein 1 pre-proprotein, NP_002248.1, then, upon completion of the search. The parameter “-p”, for “program mode”, specifies that the search will be of a protein sequence; for nucleotide sequences use “-p pattern”. The result of the search will be placed in file “pat_out”:
The format of the pattern file is rigid. There must be an id line starting with the letters “ID”, followed by a space and some identifying free form text. The second line must begin with the letters “PA”, followed by a space and the pattern, in Prosite format. To find pattern matches to a set of protein sequences requires an extra step, in which a file with multiple FASTA sequences is formatted with formatdb, another program from the standalone blast package. If the sequences to be searched are in the file “proteases”, we begin by formatting the file using: formatdb -i proteases -pT -oT To perform a batch pattern match, we will call the database using “-d database” in lieu of the “-i filename” option. The following command line searches the database “proteases” and identifies a pattern specified in the file “active_site”, the identified matches and their coordinates are saved in “pat_out”: seedtop -d proteases -k active_site -p patternp -o pat_out As shown by a partial result below, the target sequence with a match is marked by the “seqno=” line. This is followed by the reiterated pattern and matching positions marked by the first and last numbers in the “HI” field. The actual coordinates should be incremented by 1 due to offset by zero. For the match below, the actual coordinates are 198 and 287.
Since the database has been formatted using the -oT option, we can retrieve the sequence of the matched region using fastacmd from standalone blast package: fastacmd -d proteases -s 45580723 -L 282,301 which returns the subsequence containing the pattern:
In addition to protein pattern matches, seedtop can search for patterns in a nucleotide sequence or database. A pattern which is specified using IUPAC codes needs to be converted to ProSite syntax. To search a single nucleotide sequence for given patterns, use: seedtop -i input_seq -p patmatch -k pat_file To search a nucleotide database for given patterns, use: seedtop -d database -p pattern -k pat_file For questions regarding the BLAST services, please contact the BLAST help desk at: |
|||||||
|