Standalone PSI/PHI-BLAST: blastpgp

Tao Tao, Ph.D.
NCBI User Service

TOC

1. Introduction

The program blastpgp performs regular gapped blastp searches. In addition, it can perform iterative searches in "Position-Specific Interative BLAST" (PSI-BLAST) mode or pattern initiated searched in "Pattern-Hit Initialed BLAST" (PHI-BLAST) mode. The result from a PHI-BLAST search can serve as the input to subsequent PSI-BLAST searches. The program is commonly used to search for sequences distantly related to the query that cannot be detected by traditional blastp search. The program can build a position-specific scoring matrix (PSSM) from the search result, for a specific family of related sequences.

1.1. PSI-BLAST

In PSI-BLAST mode, blastpgp first performs a standard blastp search against the target database. The sequences found in the first round are used to build a PSSM, which is then used in the next round of search. As explained in the paper by the Altschul et al [1] and Schäffer et al [2], the traditional BLAST algorithm was implemented using an A A substitution matrix, where A is the alphabet size currently set at 28. PSI-BLAST instead uses a Q A matrix, where Q is the length of the query sequence. At each position, the score for a letter depends on the position with respect to the query and the letter in the subject sequences.

The position-specific matrix for round i + 1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value, below that set by the inclusion threshold -h, in round i. The top part of the output for each round distinguishes the sequences into "sequences found previously and used in the scoring matrix" and "sequences not used in the scoring matrix". PSI-BLAST search "converges" or stops if all sequences found at round i + 1 below the e-value inclusion threshold were already in the model at the beginning of the round.

Blastpgp can also store and reuse the PSSM through -C and -R parameters, respectively. To reuse a stored checkpoint file, the exact same query should be used. See Section 3.2 for examples and more detailed explanations.

Using the option -B, PSI-BLAST is also capable of importing a "master-slave" style multiple sequence alignment, computed outside of PSI-BLAST, to jump start the search from that alignment. The multiple sequence alignment must include the complete query sequence, which does not need to be the first in the alignment. The multiple alignment file must be specified in a format that is similar to ClustalW minus the headers and trailers, see Section 3.3 below for example.

The -R parameter, restart from a checkpoint, and -B, jump start from a multiple alignment, are mutually exclusive.

1.2 PSI-TBLASTN

PSI-BLASTN is a variant of blastall's tblastn mode and it searches a protein query sequence against a (translated) nucleotide sequence database like tblastn. However, instead of using the default matrix as specified by -M, a PSSM created by a PSI-BLAST run using the same query is used to score the potential matches. The nucleotide sequence database is dynamically translated in all reading frames during PSI-TBLASTN search. PSI-TBLASTN search using a PSSM may help find more distantly related sequences encoded by unannotated nucleotide sequences.

The first step for this type of search is to run PSI-BLAST iterations with the query against a target protein database by capturing the PSSM or checkpoint file output through the -C option.

The same query is then searched with blastall against a nucleotide database in PSI-TBLASTN mode using "-p psitblastn" and the checpoint file (from the first step) read through -R parameter. All other parameter settings applicable to "blastall -p tblastn ..." also apply here for "blastall -p psitblastn ...". Some restrictions to parameters are 1) query must be the same as the one used in blastpgp for PSSM generation, 2) by default, blastpgp uses -F F and blastall uses -F T. To ensure consistent usage of the blastpgp/psitblastn combination, the -F option should be explicitly set to be consistent with one another.

1.3 PHI-BLAST

The PHI-BLAST mode combines matching of regular expressions with local alignments surrounding the match. The incorporation of this function into the BLAST software framework is partly for user convenience and partly so that it can be combined seamlessly with PSI-BLAST.

One very restrictive way to identify protein motifs is by regular expressions that must contain each instance of the motif. The PROSITE database is a compilation of such restricted regular expressions that describe protein motifs.

This search mode takes a motif file and a query sequence containing the motif as inputs throught the -k and -i parameters. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps identify other protein sequences contain "an occurrence of P" and are "similar to S in the vicinity of the P". In another word, PSI-BLAST addresses the question of "What other proteins containing the same motif are similar to the query protein around the motif occurrence?"

PHI-BLAST is preferable to searching pattern occurrences alone because it filters out those cases where the pattern occurrence may be random and are not indicative of homology. PHI-BLAST may be preferable to other flavors of BLAST because it is faster and allows the use of a rigid pattern occurrence requirement.

The pattern search methods in PHI-BLAST are based on the algorithms described by Baeza-Yates and Gonnet [3] and Wu and Manber [4]. The calculation of local alignments is done using a method very similar to gapped BLAST and much of the same code. The method of evaluating statistical significance of the alignment is somewhat different.

The query sequence should be in FASTA format as for all other BLAST searches. The syntax for patterns follows the rules of PROSITE and is documented in Section 1.5 as well as in "Pattern Search with seedtop". An input pattern to PHI-BLAST does not have to be from PROSITE list. PHI-BLAST does only gapped alignment since ungapped alignments contradict almost all patterns in PROSITE.

PHI-BLAST can restrict the local alignment search to a proper subset of the pattern occurrences, through -p seedp, if a specified pattern occurs more than once in the query. The use of the "seedp" option requires the user to specify the location(s) of the occurrence(s) in the pattern file through the HI line. When there are multiple pattern occurrences in the query, it may be important to decide how many are of interest because the E-value for matches is effectively multiplied by the number of interesting pattern occurrences.

PHI-BLAST is integrated with the standard and non-restart usage of PSI-BLAST. In the command-line mode, PSI-BLAST can be invoked by using the -j option, as for standard PSI-BLAST search. When -j value is assigned a value different from the default value of 1, then the results of the single round of PHI-BLAST searching are used to construct a PSSM used in PSI-BLAST. One may continue PSI-BLAST for more than one round.

1.4 PHI-BLAST Statistical Significance

When a query sequence Q matches a database sequence D in PHI-BLAST, it is useful to subdivide Q and D into 3 disjoint pieces:

Qleft - Qpattern - Qright
Dleft - Dpattern - Dright

The substrings Qpattern and Dpattern contain the pattern specified in the pattern file. The pieces Qpattern and Dpattern are aligned and that alignment is displayed as part of the PHI-BLAST output, but the score for that alignment is mostly ignored. The "reduced" score r of an alignment is the sum of the scores obtained by aligning Qleft with Dleft and Qright with Dright.

The expected number of alignments with a reduced score = x is given by:

CN (Lambda*x + 
1)e^(-Lambda *x)
where:
C, Lambda : "constants" depending on the score matrix and the gap costs.
N : (number of occurrences of pattern in database) * (number of occurrences of pattern in Q)
e : the base of the natural logarithm.

It is important to understand that this method of computing the statistical significance of a PHI-BLAST alignment is mathematically different from the method used for BLAST and PSI-BLAST alignments. The E-values from PHI-BLAST are displayed using a similar syntax as those in tranditional blast and PSI-BLAST.

1.5. Pattern Syntax for PHI-BLAST.

The syntax for patterns in PHI-BLAST follows the PROSITE convention. When using the stand-alone program, it is permissible to have multiple patterns in a file. For more details, see Section 3.1 of "Pattern Search with seedtop".

Table 1.5.1 Meanings of the Pattern Symbols
Pattern SymbolsMeaning of the Symbols
A-N,P-T,V-ZValid Protein Code
ACGTXValid Nucleotide Code
[ ]any one of the characters enclosed in the brackets
-Represents nothing. This is a spacer used by PROSITE
X (followed by nothing) means any residue
X(5) Represents 5 positions in which any residue is allowed.
X and 5 can be substituted by other code and number, respectively.
X(2,4) Represents 2 to 4 positions where any residue is allowed. The numbers can be substituted by any other two numbers.
> Can occur only at the end of a pattern and means nothing. It may occur before a period.
. Can be used at the end of the pattern and means nothing

Here is an example record from PROSITE.

ID  CNMP_BINDING_2; PATTERN.
AC  PS00889;
DT  OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE  Cyclic nucleotide-binding domain signature 2.
PA  
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR  /RELEASE=32,49340;
NR  /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR  /FALSE_NEG=1; /PARTIAL=1;
CC  /TAXO-RANGE=??EP?; /MAX-REPEAT=2;
When using the stand-alone program, the pattern should be in a file, with the first line starting with ID followed by a space and the text strings for the pattern name. The actual pattern, immediately below the ID line, should start with PA followed by a space and the text string for pattern description. Optional HI initialed lines can be present to specify the pattern position.

In the previous PROSITE example, the pattern name is "CMMP_BINDING_2; PATTERN." The pattern specified by the PA line is explained in the table below. Other lines, if present, are ignored by PHI-BLAST.

Table 1.5.2 Explanation of Pattern Example CNMP_BINDING_2
Position in PatternPattern SyntaxPattern RepresentationActual Length ¹
1[LIVMF]Any one of L, I, V, M, or F1
2GOne G1
3EOne E1
4XAny one residue1
5[GAS]Any one of G, A, or S1
6[LIVM]Any one of L, I, V, or M1
7x(5,11)5 to 11 any residue5 to 11
8ROne R1
9[STAQ]Any one of S, T, A, or Q1
10AOne A1
11xAny one residue1
12[LIVMA]Any one of L, I, V, M, or A1
13XAny one residue1
14[STACV]Any one of S, T, A, C, or V1
Note: 1 The actual length of matched patterns, found in the database entries, will be between 18 to 24 letters long.

A simplified pattern for PHI-BLAST, with HI lines for pattern position on the query, is give below.

ID  ER_TARGET; PATTERN.
PA  [KRHQSA]-[DENQ]-E-L>.
HI  (19 22)
HI  (201 204)

In this example, the HI lines specify that two pattern occurrences in the input query are of interest, 19 - 22 and 201 - 204. These HI lines are relevant when stand-alone PHI-BLAST is used with the seedp mode, in which the interesting occurrences of the pattern in the sequence need to be specified to inform PHI-BLAST only the specified pattern occurrences should be used to find good alignments. The seedp option is qualitatively more useful than the standard patternp option ONLY when the number of pattern occurrences, K, is greater than 1 AND the user is interested only in matches for J < K occurrences. The use of the HI lines enables the user to specify which occurrences of the pattern should be evaluated.

1.6 Additional Functionality Related to PHI-BLAST

PHI-BLAST takes a sequence pattern and a query containing that pattern as input and searches a sequence database for other sequences containing the same pattern and having a good alignment in the vicinity of the pattern match to the query. This program, however, does not address the following two simpler questions:

Q1: Given a sequence and a list of patterns, which patterns occur in the 
sequence
    and where?
Q2: Given a pattern and a sequence database, which sequences contain the pattern and where?

They are better addressed with seedtop, a program closely related to PHI-BLAST, but they do not fit into the output framework of BLAST because the answers are simple lists without alignments and with no notion of statistical significance.

2. Installation and Setup of blastpgp

No additional setup is needed if the blast package is set up using the steps described in pc_setup.html and unix_setup/html.

3. Practical Usages

We can use blastpgp to perform different searches and address different questions. Here we will present a list of common scenarios and discuss them in details. Each scenario will be in its own subsection with example command lines.

      3.1 Simple PSI-BLAST Search

PSI-BLAST search is generally used in the following situations:

  • The initial blastp search is not informative
  • There is a need to identify the distant members of a protein family
  • To create a model PSSM in one database and use that in another to find better matches

To perform a PSI-BLAST search with my_kinase.txt against refseq_protein database for 6 iterations, and save the results in my_kinase.out, we can use the following command line. Parameters not specified with assume default values. For details, see Appendix.

blastpgp -i my_kinase.txt -d refseq_protein -j 6 -o my_kinase.out

The parameter -j specifies a maximum number of iterations. The run is considered to "converge" and will stop early, if blastpgp finds no new matches with E-value lower than the threshold specified by -h (default -0.002). One can run until guaranteed convergence by specifying -j 0, but this is risky because the running time and output size are essentially unbounded. Some tests [2] suggest that -j 5 or -j 6 yields the best results for typical queries. For queries belonging to a large protein family, we may need to increase the -b setting to make sure the inclusion of most (if not all) of the family members.

3.2 Save Checkpoint File for Use in Other Searches

We can run PSI-BLST to create and save a PSSM file through -C option. In the following command line, the "-u 1" instructs blastpgp to saved the PSSM in text format. This requires the -J T. Since we do not need the alignment output, it is redirected to the Linux /dev/null directory.

blastpgp -i query -d refseq_protein -C chk.asn -j 2 -u 1 -J T > /dev/null

The file chk.asn is called a "checkpoint" based on a standard usage of this term in computer science. The file name used to save the checkpoint can be any unique name and need not have the suffix .asn or any suffix at all. The PSSM saved in chk.asn is the PSSM used to search in the last completed round. It is NOT the PSSM that would be hypothetically constructed from the alignment of the last round of search. Users should be able to check the outputs to know that at least under some conditions the saved PSSM yields acceptable results.

The file written by -C is now in ASN.1 format. Under "-u 1", the file is human readable. To get a truly human readable version of the checkpoint file listing the scores for matches at each position, use the -Q parameter.

The motivation for the -C and -R paramters is to support changing databases, so that the PSSM model built by searching one database and can be reused to search another database. This can be useful either to switch from a protein database to a nucleotide database and/or from a larger database to a smaller database. For example, we can save the PSSM (IL.chk) from a PSI-BLAST search against nr and use it in a psitblastn search against against est_others:

blastpgp -d nr -i interleukin -j 2 -C IL.ckp -J T -u 1 -o IL_nr.out

blastall -i interleukin -d est_others -p psitblastn -R IL.ckp -r 1 -o 
IL_est.out

The first run saves the PSSM that was used in the last round of search against the protein nr. The second run uses this saved PSSM to search a translated version of est_others. The second type of search can be performed on a new genome:

blastpgp -i interleukin -d new_genome -R IL.ckp -p psitblastn -o 
IL.genome.out

This is a more sensitive way to search for matches to interleukin protein in a new, and not yet annotated genome.

Another practical usage of the checkpoint and restart feature is to search a few iterations at a time.

blastpgp -i query -d nr -C query.ckp -u 1 -J T -j 4 -o output.j1-4

blastpgp -i query -d nr -R query.ckp -r 1 -J T -j 3 -o output.j5-7

The above examples will do at most "4 + 3 - 1" or "6" iterations. The "minus one (-1)" is because the checkpoint corresponds to the last matrix used, so the checkpoint saved after the first run is the PSSM derived from the results of iteration 3 and used to search in iteration 4.

If the search did not converge, we can restart the iterations by reading in the saved checkpoint file through -R. This will require that we use the same query, database, and command line.

3.3 Jump Start PSI-BLAST Search Using a Multiple Sequence Alignment

PSI-BLAST can also read in a multiple alignment file to jump start the search. When doing so, the PSSM will be constructed from the imported alignment instead of the previous iteration. The alignment file needs to be in the required format.

Suppose the multiple alignment file has N sequences. It may be presented in one or more, alignment blocks, where each block presents a range of columns from the multiple alignment e.g., the first block might have columns 1-60, the second block might have columns 61-120, the third block might have columns 121-180, etc. Each block should have N rows, one row per sequence. The sequences should be in the same order in every block. Blocks are separated by one or more blank lines.

Within a block there are no blank lines, and each line consists of a sequence identifier followed by some white space, and then by characters or dashes for that sequence in the multiple alignment. In each column, all letters must be in upper case, or all letters must be in lower case. Upper case means that this column is to be given position-specific scores. Lower-case means to use the underlying matrix (specified by -M) for this column, e.g., if the query sequence has an 'i' residue in the column, then the standard scores for matching an 'I' are used in the column.

A sample command line is:

blastpgp -i seq1 -B align1 -j 2 -d nr -C seq1.pssm -J T -o seq1.align

where "-i seq1" specifies the query, "-B align1" provides the jump start alignment file, "-j 2" determines the number iterations, and "-d nr" specifies the target database. The PSSM and alignment are saved in seq1.pssm and seq1.align, respectively.

The example files seq1 and align1 copied below were reformatted from those kindly supplied by Dr. Aravind, as described in one of his paper [6]. Dr. Aravind first helped define how -B should work. Dr. Wolf subsequently helped design a more flexible input format for the alignments. Dr. Schäffer did the actual implementation in C. For use in test blastpgp run, you can download the query (seq1) and the alignment (align1) by right click the link and select "Save link as ..." from the menu.

seq1
----
>26SPS9_Hs
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEALKCV
AQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERK
LSQMILDKKFHGILDQGEGVLIIFDEPP

align1
------
26SPS9_Hs     IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgkla
F57B9_Ce      LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsakl
YDL097c_Sc    ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvkniln
YMJ5_Ce       LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaake
FUS6_ARATH    KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrc
COS41.8_Ci    SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrr
644879        KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqt
YPR108w_Sc    IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspe
eif-3p110_Hs  SKAMKMGDWKTCHSFIINEKMNGkvw----------------------------------
T23D8.4_Ce    SKAMLNGDWKKCQDYIVNDKMNQkvw----------------------------------
YD95_Sp       IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspe
KIAA0107_Hs   LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgae
F49C12.8_Hs   LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircne
Int-6_Mm      KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlket

26SPS9_Hs     lryagrqtealkcvaqasknrsladfekaltdy---------------------------
F57B9_Ce      alkyngsdldamkaiaaaaqkrslkdfqvafgsf--------------------------
YDL097c_Sc    akytketyqsrgidamkavaeaynnrslldfntalkqy----------------------
YMJ5_Ce       ivayqkspriiairsmadafrkrslkdfvkalaeh-------------------------
FUS6_ARATH    asglahlelkkyklaarkfldvnpelgnsyneviapqdiatygglcalasfdrselkqkv
COS41.8_Ci    kfleaaqrynelsyksaiheteqtkalekalncailapagqqrsrmlatlfkdercqllp
644879        qailtklkcaaglaelaarkykqaakclllasfdhcdfpellspsnvaiygglcalatfd
YPR108w_Sc    llslisttaalqsissltislyasdyasyfpyllety-----------------------
eif-3p110_Hs  ------------------------------------------------------------
T23D8.4_Ce    ------------------------------------------------------------
YD95_Sp       vlavlpqnesmssleacinslylcdysgffrtladve-----------------------
KIAA0107_Hs   ilevlhslpavrqylfslyecrysvffqslavv---------------------------
F49C12.8_Hs   vqeqltggglngtlipvreylesyydchydrffiqlaale--------------------
Int-6_Mm      idnnsvssplqslqqrtwlihwslfvffnhpkgrdniidlflyqpqylnaiqtmcphilr

26SPS9_Hs     ------------------------------------------------------------
F57B9_Ce      ------------------------------------------------------------
YDL097c_Sc    ------------------------------------------------------------
YMJ5_Ce       ------------------------------------------------------------
FUS6_ARATH    idninfrnflelvpdvrelindfyssryascleylasl----------------------
COS41.8_Ci    sfgilekmfldriiksdemeefar------------------------------------
644879        rqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem--------------
YPR108w_Sc    ------------------------------------------------------------
eif-3p110_Hs  ------------------------------------------------------------
T23D8.4_Ce    ------------------------------------------------------------
YD95_Sp       ------------------------------------------------------------
KIAA0107_Hs   ------------------------------------------------------------
F49C12.8_Hs   ------------------------------------------------------------
Int-6_Mm      ylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkklrec

26SPS9_Hs     RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMI
F57B9_Ce      PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMI
YDL097c_Sc    EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMI
YMJ5_Ce       KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMI
FUS6_ARATH    KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALI
COS41.8_Ci    QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMI
644879        KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLI
YPR108w_Sc    ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFI
eif-3p110_Hs  DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMI
T23D8.4_Ce    NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMI
YD95_Sp       VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFI
KIAA0107_Hs   EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFI
F49C12.8_Hs   SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLI
Int-6_Mm      ESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLI

26SPS9_Hs     LDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce      LDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc    LDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce       LDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH    TDNQIQARIDSHNKILYARHADQ
COS41.8_Ci    CESRMKGHIDQIDGIVFFERRET
644879        LEGLISARVDSHSKILYARDVDQ
YPR108w_Sc    PNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs  INEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce    IQEELSATLDEPTDCLIMHRVEP
YD95_Sp       PDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs   AAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs   ATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm      RNARLDAKIDSKLGHVVMGNNAV

3.4 PHI-BLAST Search with a Pattern in "patseedp" Mode

A typical PHI-BLAST search may look like this:

blastpgp -i query -k pat -p patseedp -d refseq_protein -o refp_pat.out

here -i specifies the input FASTA query sequence file, -k input is a file describing the pattern and where in the query the pattern is found, -d specifies the target database to search against, -p indicates the mode of search, patseedp in this case, and -o specifies the output file. In this search, the program does not differentiate the multiple occurrences of the pattern.

The pairwise alignment output is essentially the same as blastp with the exception that the pattern occurrence in the alignment will be marked by asterisks. Working example for query and pat files are given below.

pat
---
ID  Cytidine and deoxycytidylate deaminases zinc-binding region signature.
PA  [CH]-[AGV]-E-x(2)-[LIVMFGAT]-[LIVM]-x(17,33)-P-C-x(2,8)-C-x(3)-[LIVM].
HI  (70 76)(100 101)(106 110)

query
-----
>gi|21955158|ref|NP_663745.1| phorbolin 1 [Homo sapiens]
MEASPASGPRHLMDPHIFTSNFNNGIGRHKTYLCYEVERLDNGTSVKMDQHRGFLHNQAKNLLCGFYGRHAEL
RFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFAARIYDYDPLYKEALQMLRDA
GAQVSIMTYDEFKHCWDTFVDHQGCPFQPWDGLDEHSQALSGRLRAILQNQGN

Sample alignment from PHI-BLAST patseedp search is excerpted below.

Score =  261 bits (688), Expect = 1e-076
Identities = 171/193 (88%), Positives = 178/193 (92%), Gaps = 3/193 (1%)

Query:  10  RHLMDPHIFTSNFNNG---IGRHKTYLCYEVERLDNGTSVKMDQHRGFLHNQAKNLLCGF 
66
            R+LMDP  FT NFNN    + R +TYLCYEVERLDNGT V MDQH GFL N+AKNLLCGF
Sbjct:  190 RYLMDPDTFTFNFNNDPLVLRRRQTYLCYEVERLDNGTWVLMDQHMGFLCNEAKNLLCGF 
249

Query:  67  YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA 
126
pattern 70     *****************************************
            YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA
Sbjct:  250 YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA 
309

3.5 PHI-BLAST Search with Pattern in "seedp" Mode

When the pattern occurs in the query more than once and only a specific occurrence is of interest, PSI-BLAST search should be done in seedp mode with a command line like:

blastpgp -i query2 -k pat2 -p seedp -d refseq_protein ...

Examples for pat2 and query2 are given below. The HI line in the pattern file instructs PHI-BLAST to focus on that specific pattern occurrence during the search. The position is broken into sections demarcated by the ambigous length. To find the positions of the pattern occurrence in a given query, we can use the program seedtop found in the blast binary distribution package. See "Pattern Search with seedtop" for more information.

pat2
----
ID  Lipolytic enzymes "G-D-S-L" family, serine active site  (PATTERN)
PA  G-D-S-[LIVM]-x(1,2)-[TAG]-G.
HI  (152 155) (158 159)

query2
------
>gi|15223790|ref|NP_173441.1| Family II extracellular lipase [A. 
thaliana]
MNPPTPDPSPKPVAPPGPSSKPVAPPGPSPCPSPPPKPQPKPPPAPSPSPCPSPPPKPQPKPVPPPACPPTPPKPQPKPA
PPPEPKPAPPPAPKPVPCPSPPKPPAPTPKPVPPHGPPPKPAPAPTPAPSPKPAPSPPKPENKTIPAVFFFGDSVFDTGN
NNNLETKIKSNYRPYGMDFKFRVATGRFSNGMVASDYLAKYMGVKEIVPAYLDPKIQPNDLLTGVSFASGGAGYNPTTSE
AANAIPMLDQLTYFQDYIEKVNRLVRQEKSQYKLAGLEKTNQLISKGVAIVVGGSNDLIITYFGSGAQRLKNDIDSYTTI
IADSAASFVLQLYGYGARRIGVIGTPPLGCVPSQRLKKKKICNEELNYASQLFNSKLLLILGQLSKTLPNSTFVYMDIYT
IISQMLETPAAYGFEETKKPCCKTGLLSAGALCKKSTSKICPNTSSYLFWDGKLWGYIKKSQTLIDGLQMLVSMFFFGDS
IIDTGNNNNLTTEMKCNFSPYGKDFPLGVATGRFSNGKVVSDYISEYLGVKPIVPAYFDPNVQLEDLLTGVSFASGGSGY
YHLTPKISRVKSMLEQLTYFQRHIARVKRLVGEEKTDQLLAKGLSVVVAGSNDLAITYYGHGAQLLKDDIHYFTSKMANS
AASFVMQLYEYGARQIAVLGTPPLGCVPILRTLKGGLRRECAQDINYASQLFNVKLSNILDQLAKNLPNSNLIYIDIYSA
FSHILENSADYAQTGTFSAVLAFGDSILDTGNNNLLMTVSRGNFLPYGRDFPHRIPTGRFGNGRVLSDLVASGLGVKDLL
PAFRSPFLKNSELATGVCFASGGSGLDKFTASIQGVIWVQDQVSDFQRYLEKLNQQVGDAAKVKEIIANAVILVSAGNND
LAITYFSTPKRQTRYTVQAYTDMLIGWKTTFINSLYDLGARKFAILGTLPLGCLPGARQITGNLICLPNVNYGARVYNDK
VANLVNQYNQRLPNGKFVYIDMYNSLLEVINNPSQYGKTLSNTRKF

Sample alignment output from PHI-BLAST seedp search:

Score =  289 bits (760), Expect = 7e-085
Identities = 158/269 (58%), Positives = 200/269 (74%), Gaps = 13/269 (4%)

Query:  147 AVFFFGDSVFDTGNNNNLETKIKSNYRPYGMDFKFRVATGRFSNGMVASDYLAKYMGVKE 
206
pattern 152      ********
            ++FFFGDS+ DTGNNNNL T++K N+ PYG DF   VATGRFSNG V SDY+++Y+GVK
Sbjct:  473 SMFFFGDSIIDTGNNNNLTTEMKCNFSPYGKDFPLGVATGRFSNGKVVSDYISEYLGVKP 
532

3.6 Integrated PHI-BLAST and PSI-BLAST search

PHI-BLAST is integrated with PSI-BLAST, its search result can be used for subsequent PSI-BLAST iteration directly if the -j option is specified with a value greater than 1. A typical search command line, modified from that in 4.6 looks like this:

blastpgp -i query2 -k pat2 -d refp -p patseedp -j 5 -o phi_5.out -Q pssm

The first search will be PHI-BLAST in patseedp mode. The result from PHI-BLAST is passed on to PSI-BLAST search to run for 4 interations. The alignment and PSSM are saved to phi_5.out and pssm, respectively.

3.7 Miscellaneous Information

Users who develop their own sequence analysis software may wish to develop their own scoring systems. They can adapt codes in the function posit.c, which writes out the checkpoint, to write out scoring systems derived by other algorithms in such a way that PSI-BLAST can read the files in later. The checkpoint structure is generated in a way that it can handle any position-specific matrix that fits in the Karlin-Altschul statistical framework for BLAST scoring.

When seaching through the PSI-BLAST page on the Web, only one pattern is allowed per query. The input pattern has an upper size limit of 100 characters. The PHI-BLAST Web page supports only the "patseedp" option since it does not take the position specification. There the user must explicitly invoke one round of iteration at a time, and the PHI-BLAST page provides the option to initiate a PSI-BLAST round with the PHI-BLAST results.

The PSI-BLAST page is different from the command line blastpgp (with -j setting) in at least two ways. First, each submission on the Web page runs only one iteration, and the user must explicitly approve continuation to the next iteration. Second, the user can explicitly override the -h setting by either excluding matches with lower E-values or explicitly including matches with higher E-values for use in constructing the PSSM.

Standalone PSI-BLAST can also be used to convert multiple sequence alignments from Pfam release into scoremat. To do this, we need to extract out individual alignment blocks from the master file and saved it to a file. Furthermore, we need to extract out the longest sequences from each alignment block and save it to a separate file for use as query. For example, if we saved an alignment block to file A and its longest sequence query.A, we can convert the alignment block into an ASN.1 scoremat using the following command (unix/linux):

blastpgp -i query.A -B A -u 1 -C scoremat.A -d dummy_db >/delv/null

The dummy_db is a formatted protein blast database with only one sequence. We can do this since we are not interested in the alignment result. Rather, we are after the checkpoint scoremat used in judging or scoring the alignment. The collection of -C output for all the entries can be used for generate an RPSBLAST database using formatrpsdb tool.

4. Technical Assistance

For questions and comments on this document and BLAST in general, please send them to:

blast-help@ncbi.nlm.nih.gov
Questions and comments on other NCBI resources should be addressed to:
info@ncbi.nlm.nih.gov

5. Appendix: Parameters and Their Accepted Values

All the program parameters for blastpgp can be printed on the screen using the following command line:
blastpgp - <hit enter key>

Here, we list them out individually in the tables below, each with detailed description and example usage.

Table 5.1
Parameter-p
FunctionSpecifies search mode for PHI-BLAST
Defaultblastpgp
Input Format[String]
ExampleTo run PHI-BLAST, used: -p patseedp
Note
A third option is "seedp", which reads in pattern file with HI lines.

Table 5.2
Parameter-d
FunctionSpecifies target database
Defaultnr
Input Format[String]
ExampleTo run search against refseq_protein, use: -d refseq_protein
Note
Database file name should be provided WITHOUT file extension. Multiple databases should be provided with double quotes, like in -d "refseq_protien pdbaa".

Table 5.3
Parameter-i
FunctionSpecifies input query file
Defaultstdin
Input Format[File In]
ExampleTo search with my_kinase.txt, use: -i my_kinase.txt
Note
User the complete query file name WITH extension. The sequences should be in FASTA format, and the file should be saved as a plain text file.

Table 5.4
Parameter-o
FunctionSaves the oput in the specified file
Defaultstdout
Input Format[String]
ExampleTo save result in my_kinase.out, use: -o my_kinase.out
Note
Users can use redirecttion with < symbol followed by file name.

Table 5.5
Parameter-e
FunctionThe upper limit of Expect value cutoff
Default10.0
Input Format[Real] use either floating point or scientific notation
ExampleTo decrease the e value to 0.1, use: -e 0.1
Note
Accepted formats are 10, 3/4, 0.01, or 2e-34. Smaller values increase the search stringency.

Table 5.6
Parameter-m
Functionalignment view options
Default0
Input Format[Integer]
ExampleTo get result in XML format, use: -m 7
Note
Accepted values and the format they specify

 0 = pairwise
 1 = query-anchored showing identities
 2 = query-anchored no identities
 3 = flat query-anchored, show identities
 4 = flat query-anchored, no identities
 5 = query-anchored no identities/blunt ends
 6 = flat query-anchored, no identities/blunt ends
 7 = XML Blast output
 8 = tabular
 9 = tabular with comment lines
10 = ASN, text
11 = ASN, binary [Integer]

Table 5.7
Parameter-F
FunctionFilters query sequence
DefaultF
Input Format[String]
ExampleTo activate filter, use: -F T
To mask lookup table when filtering, use: -F "m S"
Note
It uses SEG when set to T. For more information see Section 6.4 of "BLAST URLAPI". When -t setting is > 0 (the default is 1), it has the side effect of masking the database sequences. Additional masking of the query sequence under this is likely to be counterproductive.

Table 5.8
Parameter-f
FunctionSets score threshold for extending an initial word match
Default11
Input Format[Integer]
ExampleTo increase the requirement to 12, use: -f 12
Note
Increase the setting makes the search more stringent. This parameter is one of the main reasons that BLAST is "heuristic"; increasing the value of -f saves time but eliminates matches. The default value of -f is set with the assumption that the the default word size -W 3 (Table 5.30) is used. If one instead uses -W 2, then -f should be lowered.

Table 5.9
Parameter-A
FunctionMultiple hit window size
Default40
Input Format[integer]
ExampleTo reduce the window size to 30, use: -A 30
Note
When -P is set to 0 (Table 5.13), blastpgp requires two word hits within the distance specified by -A to initiate alignment extension. A reduction in setting of .-A. saves time at the risk of missing matches.

Table 5.10
Parameter-G
FunctionSpecifies the cost for opening a gap
Default11
Input Format[integer]
ExampleTo increase this cost to 12, use: -G 12
Note
Corresponding -E change may be required. For allowed value when using non-PSSM, see Section 6.11.2 of BLAST URLAPI.

Table 5.11
Parameter-E
FunctionCost to extend a gap
Default1
Input Format[integer]
ExampleTo increase this to 2, use: -E 2
Note
Corresponding -G change may be required.

Table 5.12
Parameter-N
FunctionSets threshold for gapped extension for an ungapped alignment
Default22
Input Format[integer]
ExampleTo increase this to 25, use: -N 25
Note
Similar to X-dropoff for ungapped alignment extension(?). Score threshold measured in bits. Increasing -N saves time, but risks missing matches.

Table 5.13
Parameter-P
FunctionType of hits
Default0
Input Format[integer]
ExampleTo run in single hit mode, use: -P 1
Note
0 for multiple hits; 1 for single hit.

Table 5.14
Parameter-S
FunctionStart of required region in query
Default1
Input Format[Integer]
ExampleTo start in position 100 in the query, use: -S 100
Note
To be used in conjunction with -H.

Table 5.15
Parameter-H
FunctionEnd of required region in query
Default-1
Input Format[integer]
ExampleTo end in position 200 in the query, use: -H 200
Note
This and -S specify the subsequence of the query to use. They function in a manner different from -L in blastall, in that the subsequence has to be in the alignment, but the alignment may be extend beyond the specified region.

Table 5.16
Parameter-a
FunctionNumber of processors to use
Default1
Input Format[integer]
ExampleTo utilize both processors in a two processor machine, use: -a 2
Note
1 up to the number of processors available.

Table 5.17
Parameter-I (upper case i)
FunctionDisplays GI in the deflines for matches from NCBI database
DefaultF
Input Format[T/F]
ExampleTo activate gi display, use: -I T
Note

-I F: ref|NP_000241.1| 
myeloperoxidase [Homo sapiens]

-I T: gi|4557759|ref|NP_000241.1| myeloperoxidase [Homo sapiens]

Table 5.18
Parameter-h
FunctionE-value threshold for inclusion of hits from round i
Default0.002
Input Format[real]
ExampleTo loosen this to 0.005, use: -h 0.005
Note
Hits with expect value lower than the setting will be included in the construction of the PSSM, which will be used in the scoring of hits in round i+1.

Table 5.19
Parameter-c
FunctionConstant in pseudocounts for multipass version
Default9
Input Format[Integer]
ExampleTo change this to 10, use: -c 10
Note
Pseudocounts from a standard probability distribution are combined with the actual letter distribution of the query.

Table 5.20
Parameter-j
FunctionMaximal rounds of iteration
Default1
Input Format[Integer]
ExampleTo search for 10 iterations, use: -j 10
Note
Default is single iteration - regular blastp or PHI-BLAST without subsequent iterations in PSI-BLAST mode. To generate PSSM, setting should be ≥ 2 unless -B or -R is used. Use -j 0 if you want the search to continue till convergence. This may be risky since tests in [2] suggest that -j 5 or -j 6 works best on average.

Table 5.21
Parameter-J
FunctionBelieve the query defline
DefaultF
Input Format[T/F]
ExampleTo believe the query defline, use: -J T
Note
-J T is required when saving SeqAlign object (-O) or scoremat (-C, -Q).

Table 5.22
Parameter-Z
FunctionX dropoff value for final gapped alignment (in bits)
Default25
Input Format[Integer]
ExampleTo decrease this to 20, use: -Z 20
Note
Smaller settings tend to generate shorter alignments because they reduce the flexibility to include large gaps.

Table 5.23
Parameter-O
FunctionSaves SeqAlign Object to file
DefaultOptional
Input Format[File Out]
ExampleTo same it to my_blastobj, use: -O my_blastobj
Note
-J, 'Believe the query defline', must be set to T. The file can be used to regenerate the alignment in various formats without rerun of the search.

Table 5.24
Parameter-M
FunctionMatrix
DefaultBLOSUM62
Input Format[String]
ExampleTo change it to PAM70, use: -M PAM70
Note
This only affects the initial blastp or PHI-BLAST search.

Table 5.25
Parameter-v
FunctionSets the upper limit of descriptions to display in result
Default500
Input Format[Integer]
ExampleTo decrease the descriptions to 100, use: -v 100
Note
Web counterpart: Descriptions.

Table 5.26
Parameter-b
FunctionUpper limit of database sequences to show alignments for
Default250
Input Format[Integer]
ExampleTo increase the alignments saved to 5000, use: -b 500
Note
Setting this parameter to a smaller value may affect the PSSM constructed.

Table 5.27
Parameter-C
FunctionSaves PSI-BLAST checkpointing output file to specified file
DefaultOptional
Input Format[File Out]
ExampleTo save checkpointing file to my_checkpoint, use: -C my_checkpoint
Note
The saved checkpoint file or scoremat can be used to search the same query against another database. To be read in through -R parameter or used in RPSBLAST database construction.

Table 5.28
Parameter-R
FunctionInput File for PSI-BLAST Restart
DefaultOptional
Input Format[File In]
ExampleTo read in checkpoint file my_chechpoint, use: -R my_checkpoint
Note
This is for restart a previously unfinished PSI-BLAST search or use the scoremat to score search against a different database using the same query.

Table 5.29
Parameter-W
FunctionWord size
Default3
Input Format[Integer]
ExampleTo set word size to 2, use: -W 2
Note
Only two options are available. If -W 2 is used, one may need to reduce the -f setting also.

Table 5.30
Parameter-z
FunctionEffective length of the database
Default0
Input Format[Real]
ExampleTo set this to an arbitrary (reference) value of 109, use: -z 1e9
Note
Use zero for the actual database size.

Table 5.31
Parameter-K
FunctionNumber of best hits from a region to keep
Default0
Input Format[Integer]
ExampleTo keep only the best 100 hits, use: -K 100
Note
Custom value not recommended.

Table 5.32
Parameter-s
FunctionCompute locally optimal Smith-Waterman alignments
DefaultF
Input Format[T/F]
ExampleTo do local optimal Smith-Waterman alignments, use: -s T
Note
This may change the alignment and its score. This parameter is not available in the PSI-BLAST page.

Table 5.33
Parameter-Y
FunctionEffective length of the search space
Default0
Input Format[Real]
ExampleTo set this to a reference value of 2000000, use: -Y 2000000
Note
Use zero for the real size.

Table 5.34
Parameter-k
FunctionSpecifies the input pattern for PHI-BLAST search
Defaulthit_file
Input Format[File In]
ExampleTo read in pattern from protease_pat, use: -k protease_pat
Note
This pattern input is used by PHI-BLAST search in patseedp or seedp mode.

Table 5.35
Parameter-T
FunctionProduces HTML output
DefaultF
Input Format[T/F]
ExampleTo generate HTML output, use: -T T
Note
Database entries will be hot-linked to corresponding Entrez records when the result is viewed with a browser. For NCBI databases only.

Table 5.36
Parameter-Q
FunctionGenerates human readable PSSM
DefaultOptional
Input Format[File Out]
ExampleTo save the PSSM to my_pssm, use: -Q my_pssm
Note
The file is an AxQ matrix with log odd score for matches to different residues at each position.

Table 5.37
Parameter-B
FunctionReads in an alignment file to jump start PSI-BLAST search
DefaultOptional
Input Format[File In]
ExampleTo jump start with alignment in my_alignment, use: -B my_alignment
Note
See Section 3.3 for input example and file format.

Table 5.38
Parameter-l (lower case L)
FunctionRestricts search of database to a list of GI's
DefaultOptional
Input Format[String]
ExampleTo restrict the search to GIs in sequences.gi, use: -l sequences.gi
Note
This only applies to NCBI supplied databases formatted with -o T. It enables users to search a virtual database consists of a subset of entries specified by the input file.

Table 5.39
Parameter-U
FunctionUses lower case filtering of query sequence
DefaultF
Input Format[T/F]
ExampleTo activate this filtering function, use: -U T
Note
It enables users to mannually mask specific regions in the query with lower case letters. This will make the search focus on the region of interest left in UPPERCASE.

Table 5.40
Parameter-t
FunctionUses composition-based statistics or compositional matrix adjustment
Default1 (T)
Input Format[String]
ExampleTo inactivate composition-based statistics, use: -t F or -t 0
Note
A value > 0 also has the side-effect of masking the database sequences. At this time, the settings 2 and 3 are supported only if -M is set to the default matrix BLOSUM62. Accepted values and their functions are

0, F, or f: no composition-based statistics
1, T, or t: Composition-based statistics [2]
2: Composition-based score adjustment [5] conditioned on 
sequence properties in round 1
3: Composition-based score adjustment [5] 
unconditionally in round 1

Table 5.41
Parameter-q
FunctionSpecifies the input format of scoremat (checkpoint)
Default0
Input Format[Integer]
ExampleTo read in an ASCII scoremat, use: -q 1
Note
To be used with -R. Accepted values and their functions are

0: no scoremat input
1: Restart is from ASCII scoremat checkpoint file
2: Restart is from binary scoremat checkpoint file

Table 3.42
Parameter-u
FunctionSpecifies the output format of scoremat (checkpoint)
Default0
Input Format[Integer]
ExampleTo generate ASCII scoremat, use: -u 1
Note
Works together with -C. Accepted values and their functions are

0: no scoremat output
1: Output is ASCII scoremat checkpoint file (requires -J)
2: Output is binary scoremat checkpoint file (requires -J)

Table 5.43
Parameter-L
FunctionCost to decline alignment
Default0
Input FormatInteger
ExampleTo set this to 50, use: -L 50
Note
Disabled when set to 0.

6. References

[1] Altschul et al, 1997. Nucleic Acids Res. 25: 3389 - 3402.
[2] Schaffer et al, 2001. Nucleic Acids Res. 29: 2994 - 3005.
[3] Baeza-Yates and Gonnet, 1992. Comm. ACM 35: 74 - 82.
[4] Wu and Manber, 1992. Comm. ACM 35: 83 - 91.
[5] Yu and Altschul, 2005. Bioinformatics 21: 902-911.
[6] Aravind and Ponting, 1998. Protein Science 7: 1250 - 1254.