NCBI Logo
NCBI News




In this issue


GENSAT Project Data Now in Entrez

My NCBI

Influenza Virus Resource

NCBI ToolKit Utility Programs

New Microbial Genomes in GenBank

Iceman Preserved in GenBank

RefSeq Updates

RefSeq Release 11

New Organisms in UniGene

GenBank Release 147

New Genome Build

CCDS Database

NCBI Courses

PubMed Corrects Spelling

BLAST Lab

LocusLink Retired

Masthead





Blast Lab logo

Using seedtop to find patterns in protein and nucleotide sequences

Seedtop is one of the programs included within the NCBI standalone BLAST package and is used to find matches to a pattern in a protein or nucleotide sequence database.

Using seedtop to locate a pattern in protein and nucleotide sequences

Seedtop, like blastall and formatdb, is a commandline program with parameters specified with a leading dash, followed by a one-letter parameter code. To find a pattern in a protein sequence, we may use:

seedtop -i input -k pat -p patmchp -o pat_out

The file “pat” contains the pattern for a serine protease motif:

ID Serine Protease Motif, cd00190
PA C-[AVLS]-X(3,9)-[DSNAR]-X-[CG]-X-[GSR]-[DE]-[SAPG]-G-[GS]-[PAG]-[LFMV]

The file “input” contains the sequence for human kallikrein 1 pre-proprotein, NP_002248.1, then, upon completion of the search.

The parameter “-p”, for “program mode”, specifies that the search will be of a protein sequence; for nucleotide sequences use “-p pattern”.

The result of the search will be placed in file “pat_out”:

Name Serine Protease Active Site, cd00190
Pattern C-[AVLS]-X(3,9)-[DSNAR]-X-[CG]-X-[GSR]-[DE]-[SAPG]-G-[GS]-[PAG]-[LFMV]
At position 219 of query sequence

The format of the pattern file is rigid. There must be an id line starting with the letters “ID”, followed by a space and some identifying free form text. The second line must begin with the letters “PA”, followed by a space and the pattern, in Prosite format.

To find pattern matches to a set of protein sequences requires an extra step, in which a file with multiple FASTA sequences is formatted with formatdb, another program from the standalone blast package. If the sequences to be searched are in the file “proteases”, we begin by formatting the file using:

formatdb -i proteases -pT -oT

To perform a batch pattern match, we will call the database using “-d database” in lieu of the “-i filename” option. The following command line searches the database “proteases” and identifies a pattern specified in the file “active_site”, the identified matches and their coordinates are saved in “pat_out”:

seedtop -d proteases -k active_site -p patternp -o pat_out

As shown by a partial result below, the target sequence with a match is marked by the “seqno=” line. This is followed by the reiterated pattern and matching positions marked by the first and last numbers in the “HI” field. The actual coordinates should be incremented by 1 due to offset by zero. For the match below, the actual coordinates are 198 and 287.

seqno=254   gi|5031829|ref|NP_005542.1|
ID Serine Protease Motif, cd00190
PA C-[AVLS]-X(3,9)-[DSNAR]-X-[CG]-X-[GSR]-[DE]-[SAPG]-G-[GS]-[PAG]-[LFMV]
HI (197 268) (276 276) (278 278) (280 286)

seqno=257 gi|4504875|ref|NP_002248.1|

Since the database has been formatted using the -oT option, we can retrieve the sequence of the matched region using fastacmd from standalone blast package:

fastacmd -d proteases -s 45580723 -L 282,301

which returns the subsequence containing the pattern:

>gi|45580723:282-301 haptoglobin-related protein ...
CVGMSKYQEDTCYGDAGSAF

In addition to protein pattern matches, seedtop can search for patterns in a nucleotide sequence or database. A pattern which is specified using IUPAC codes needs to be converted to ProSite syntax. To search a single nucleotide sequence for given patterns, use:

seedtop -i input_seq -p patmatch -k pat_file

To search a nucleotide database for given patterns, use:

seedtop -d database -p pattern -k pat_file

For questions regarding the BLAST services, please contact the BLAST help desk at:

—TT

back to previous articleContinue to next article

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003