blastpgp usage

Standalone PSI/PHI-BLAST: blastpgp

Tao Tao, Ph.D.
NCBI User Service

TOC

1. Introduction

3. General Usage
4. Technical Assistance
5. Appednix: Parameters and Accepted Values
6. References

The program blastpgp performs regular gapped blastp searches. In addition, it can perform iterative searches in "Position-Specific Interative BLAST" (PSI-BLAST) mode or pattern initiated searched in "Pattern-Hit Initialed BLAST" (PHI-BLAST) mode. The result from a PHI-BLAST search can serve as the input to subsequent PSI-BLAST searches. The program is commonly used to search for sequences distantly related to the query that cannot be detected by traditional blastp search. The program can build a position-specific scoring matrix (PSSM) from the search result, for a specific family of related sequences.

1.1. PSI-BLAST

In PSI-BLAST mode, blastpgp first performs a standard blastp search against the target database. The sequences found in the first round are used to build a PSSM, which is then used in the next round of search. As explained in the paper by the Altschul et al [1] and Schäffer et al [2], the traditional BLAST algorithm was implemented using an A ╳ A substitution matrix, where A is the alphabet size currently set at 28. PSI-BLAST instead uses a Q ╳ A matrix, where Q is the length of the query sequence. At each position, the score for a letter depends on the position with respect to the query and the letter in the subject sequences.

The position-specific matrix for round i + 1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value, below that set by the inclusion threshold -h, in round i. The top part of the output for each round distinguishes the sequences into "sequences found previously and used in the scoring matrix" and "sequences not used in the scoring matrix". PSI-BLAST search "converges" or stops if all sequences found at round i + 1 below the e-value inclusion threshold were already in the model at the beginning of the round.

Blastpgp can also store and reuse the PSSM through -C and -R parameters, respectively. To reuse a stored checkpoint file, the exact same query should be used. See Section 3.2 for examples and more detailed explanations.

Using the option -B, PSI-BLAST is also capable of importing a "master-slave" style multiple sequence alignment, computed outside of PSI-BLAST, to jump start the search from that alignment. The multiple sequence alignment must include the complete query sequence, which does not need to be the first in the alignment. The multiple alignment file must be specified in a format that is similar to ClustalW minus the headers and trailers, see Section 3.3 below for example.

The -R parameter, restart from a checkpoint, and -B, jump start from a multiple alignment, are mutually exclusive.

1.2 PSI-TBLASTN

PSI-BLASTN is a variant of blastall's tblastn mode and it searches a protein query sequence against a (translated) nucleotide sequence database like tblastn. However, instead of using the default matrix as specified by -M, a PSSM created by a PSI-BLAST run using the same query is used to score the potential matches. The nucleotide sequence database is dynamically translated in all reading frames during PSI-TBLASTN search. PSI-TBLASTN search using a PSSM may help find more distantly related sequences encoded by unannotated nucleotide sequences.

The first step for this type of search is to run PSI-BLAST iterations with the query against a target protein database by capturing the PSSM or checkpoint file output through the -C option.

The same query is then searched with blastall against a nucleotide database in PSI-TBLASTN mode using "-p psitblastn" and the checpoint file (from the first step) read through -R parameter. All other parameter settings applicable to "blastall -p tblastn ..." also apply here for "blastall -p psitblastn ...". Some restrictions to parameters are 1) query must be the same as the one used in blastpgp for PSSM generation, 2) by default, blastpgp uses -F F and blastall uses -F T. To ensure consistent usage of the blastpgp/psitblastn combination, the -F option should be explicitly set to be consistent with one another.

1.3 PHI-BLAST

The PHI-BLAST mode combines matching of regular expressions with local alignments surrounding the match. The incorporation of this function into the BLAST software framework is partly for user convenience and partly so that it can be combined seamlessly with PSI-BLAST.

One very restrictive way to identify protein motifs is by regular expressions that must contain each instance of the motif. The PROSITE database is a compilation of such restricted regular expressions that describe protein motifs.

This search mode takes a motif file and a query sequence containing the motif as inputs throught the -k and -i parameters. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps identify other protein sequences contain "an occurrence of P" and are "similar to S in the vicinity of the P". In another word, PSI-BLAST addresses the question of "What other proteins containing the same motif are similar to the query protein around the motif occurrence?"

PHI-BLAST is preferable to searching pattern occurrences alone because it filters out those cases where the pattern occurrence may be random and are not indicative of homology. PHI-BLAST may be preferable to other flavors of BLAST because it is faster and allows the use of a rigid pattern occurrence requirement.

The pattern search methods in PHI-BLAST are based on the algorithms described by Baeza-Yates and Gonnet [3] and Wu and Manber [4]. The calculation of local alignments is done using a method very similar to gapped BLAST and much of the same code. The method of evaluating statistical significance of the alignment is somewhat different.

The query sequence should be in FASTA format as for all other BLAST searches. The syntax for patterns follows the rules of PROSITE and is documented in Section 1.5 as well as in "Pattern Search with seedtop". An input pattern to PHI-BLAST does not have to be from PROSITE list. PHI-BLAST does only gapped alignment since ungapped alignments contradict almost all patterns in PROSITE.

PHI-BLAST can restrict the local alignment search to a proper subset of the pattern occurrences, through -p seedp, if a specified pattern occurs more than once in the query. The use of the "seedp" option requires the user to specify the location(s) of the occurrence(s) in the pattern file through the HI line. When there are multiple pattern occurrences in the query, it may be important to decide how many are of interest because the E-value for matches is effectively multiplied by the number of interesting pattern occurrences.

PHI-BLAST is integrated with the standard and non-restart usage of PSI-BLAST. In the command-line mode, PSI-BLAST can be invoked by using the -j option, as for standard PSI-BLAST search. When -j value is assigned a value different from the default value of 1, then the results of the single round of PHI-BLAST searching are used to construct a PSSM used in PSI-BLAST. One may continue PSI-BLAST for more than one round.

1.4 PHI-BLAST Statistical Significance

When a query sequence Q matches a database sequence D in PHI-BLAST, it is useful to subdivide Q and D into 3 disjoint pieces:

Qleft - Qpattern - Qright
Dleft - Dpattern - Dright

The substrings Qpattern and Dpattern contain the pattern specified in the pattern file. The pieces Qpattern and Dpattern are aligned and that alignment is displayed as part of the PHI-BLAST output, but the score for that alignment is mostly ignored. The "reduced" score r of an alignment is the sum of the scores obtained by aligning Qleft with Dleft and Qright with Dright.

The expected number of alignments with a reduced score = x is given by:

CN (Lambda*x + 
1)e^(-Lambda *x)

where:
C, Lambda : "constants" depending on the score matrix and the gap costs.
N : (number of occurrences of pattern in database) * (number of occurrences of pattern in Q)
e : the base of the natural logarithm.

It is important to understand that this method of computing the statistical significance of a PHI-BLAST alignment is mathematically different from the method used for BLAST and PSI-BLAST alignments. The E-values from PHI-BLAST are displayed using a similar syntax as those in tranditional blast and PSI-BLAST.

1.5. Pattern Syntax for PHI-BLAST.

The syntax for patterns in PHI-BLAST follows the PROSITE convention. When using the stand-alone program, it is permissible to have multiple patterns in a file. For more details, see Section 3.1 of "Pattern Search with seedtop".

Table 1.5.1 Meanings of the Pattern Symbols

Pattern Symbols Meaning of the Symbols

A-N,P-T,V-Z Valid Protein Code

ACGTX Valid Nucleotide Code

[ ] any one of the characters enclosed in the brackets

- Represents nothing. This is a spacer used by PROSITE

X (followed by nothing) means any residue

X(5) Represents 5 positions in which any residue is allowed.
X and 5 can be substituted by other code and number, respectively.

X(2,4) Represents 2 to 4 positions where any residue is allowed. The numbers can be substituted by any other two numbers.

> Can occur only at the end of a pattern and means nothing. It may occur before a period.

. Can be used at the end of the pattern and means nothing

Here is an example record from PROSITE.

ID  CNMP_BINDING_2; PATTERN.
AC  PS00889;
DT  OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE  Cyclic nucleotide-binding domain signature 2.
PA  
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR  /RELEASE=32,49340;
NR  /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR  /FALSE_NEG=1; /PARTIAL=1;
CC  /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

When using the stand-alone program, the pattern should be in a file, with the first line starting with ID followed by a space and the text strings for the pattern name. The actual pattern, immediately below the ID line, should start with PA followed by a space and the text string for pattern description. Optional HI initialed lines can be present to specify the pattern position.

In the previous PROSITE example, the pattern name is "CMMP_BINDING_2; PATTERN." The pattern specified by the PA line is explained in the table below. Other lines, if present, are ignored by PHI-BLAST.

Table 1.5.2 Explanation of Pattern Example CNMP_BINDING_2

Position in Pattern Pattern Syntax Pattern Representation Actual Length ¹

1 [LIVMF] Any one of L, I, V, M, or F 1

2 G One G 1

3 E One E 1

4 X Any one residue 1

5 [GAS] Any one of G, A, or S 1

6 [LIVM] Any one of L, I, V, or M 1

7 x(5,11) 5 to 11 any residue 5 to 11

8 R One R 1

9 [STAQ] Any one of S, T, A, or Q 1

10 A One A 1

11 x Any one residue 1

12 [LIVMA] Any one of L, I, V, M, or A 1

13 X Any one residue 1

14 [STACV] Any one of S, T, A, C, or V 1

Note: ¹ The actual length of matched patterns, found in the database entries, will be between 18 to 24 letters long.

A simplified pattern for PHI-BLAST, with HI lines for pattern position on the query, is give below.

ID  ER_TARGET; PATTERN.
PA  [KRHQSA]-[DENQ]-E-L>.
HI  (19 22)
HI  (201 204)

In this example, the HI lines specify that two pattern occurrences in the input query are of interest, 19 - 22 and 201 - 204. These HI lines are relevant when stand-alone PHI-BLAST is used with the seedp mode, in which the interesting occurrences of the pattern in the sequence need to be specified to inform PHI-BLAST only the specified pattern occurrences should be used to find good alignments. The seedp option is qualitatively more useful than the standard patternp option ONLY when the number of pattern occurrences, K, is greater than 1 AND the user is interested only in matches for J < K occurrences. The use of the HI lines enables the user to specify which occurrences of the pattern should be evaluated.

1.6 Additional Functionality Related to PHI-BLAST

PHI-BLAST takes a sequence pattern and a query containing that pattern as input and searches a sequence database for other sequences containing the same pattern and having a good alignment in the vicinity of the pattern match to the query. This program, however, does not address the following two simpler questions:

Q1: Given a sequence and a list of patterns, which patterns occur in the sequence and where?
Q2: Given a pattern and a sequence database, which sequences contain the pattern and where?

They are better addressed with seedtop, a program closely related to PHI-BLAST, but they do not fit into the output framework of BLAST because the answers are simple lists without alignments and with no notion of statistical significance.

2. Installation and Setup of blastpgp

No additional setup is needed if the blast package is set up using the steps described in pc_setup.html and unix_setup/html.

3. Practical Usages

We can use blastpgp to perform different searches and address different questions. Here we will present a list of common scenarios and discuss them in details. Each scenario will be in its own subsection with example command lines.

3.1 Simple PSI-BLAST Search

PSI-BLAST search is generally used in the following situations:

The initial blastp search is not informative
There is a need to identify the distant members of a protein family
To create a model PSSM in one database and use that in another to find better matches

To perform a PSI-BLAST search with my_kinase.txt against refseq_protein database for 6 iterations, and save the results in my_kinase.out, we can use the following command line. Parameters not specified with assume default values. For details, see Appendix.

blastpgp -i my_kinase.txt -d refseq_protein -j 6 -o my_kinase.out

The parameter -j specifies a maximum number of iterations. The run is considered to "converge" and will stop early, if blastpgp finds no new matches with E-value lower than the threshold specified by -h (default -0.002). One can run until guaranteed convergence by specifying -j 0, but this is risky because the running time and output size are essentially unbounded. Some tests [2] suggest that -j 5 or -j 6 yields the best results for typical queries. For queries belonging to a large protein family, we may need to increase the -b setting to make sure the inclusion of most (if not all) of the family members.

3.2 Save Checkpoint File for Use in Other Searches

We can run PSI-BLST to create and save a PSSM file through -C option. In the following command line, the "-u 1" instructs blastpgp to saved the PSSM in text format. This requires the -J T. Since we do not need the alignment output, it is redirected to the Linux /dev/null directory.

blastpgp -i query -d refseq_protein -C chk.asn -j 2 -u 1 -J T > /dev/null

The file chk.asn is called a "checkpoint" based on a standard usage of this term in computer science. The file name used to save the checkpoint can be any unique name and need not have the suffix .asn or any suffix at all. The PSSM saved in chk.asn is the PSSM used to search in the last completed round. It is NOT the PSSM that would be hypothetically constructed from the alignment of the last round of search. Users should be able to check the outputs to know that at least under some conditions the saved PSSM yields acceptable results.

The file written by -C is now in ASN.1 format. Under "-u 1", the file is human readable. To get a truly human readable version of the checkpoint file listing the scores for matches at each position, use the -Q parameter.

The motivation for the -C and -R paramters is to support changing databases, so that the PSSM model built by searching one database and can be reused to search another database. This can be useful either to switch from a protein database to a nucleotide database and/or from a larger database to a smaller database. For example, we can save the PSSM (IL.chk) from a PSI-BLAST search against nr and use it in a psitblastn search against against est_others:

blastpgp -d nr -i interleukin -j 2 -C IL.ckp -J T -u 1 -o IL_nr.out blastall -i interleukin -d est_others -p psitblastn -R IL.ckp -r 1 -o IL_est.out

The first run saves the PSSM that was used in the last round of search against the protein nr. The second run uses this saved PSSM to search a translated version of est_others. The second type of search can be performed on a new genome:

blastpgp -i interleukin -d new_genome -R IL.ckp -p psitblastn -o IL.genome.out

This is a more sensitive way to search for matches to interleukin protein in a new, and not yet annotated genome.

Another practical usage of the checkpoint and restart feature is to search a few iterations at a time.

blastpgp -i query -d nr -C query.ckp -u 1 -J T -j 4 -o output.j1-4 blastpgp -i query -d nr -R query.ckp -r 1 -J T -j 3 -o output.j5-7

The above examples will do at most "4 + 3 - 1" or "6" iterations. The "minus one (-1)" is because the checkpoint corresponds to the last matrix used, so the checkpoint saved after the first run is the PSSM derived from the results of iteration 3 and used to search in iteration 4.

If the search did not converge, we can restart the iterations by reading in the saved checkpoint file through -R. This will require that we use the same query, database, and command line.

3.3 Jump Start PSI-BLAST Search Using a Multiple Sequence Alignment

PSI-BLAST can also read in a multiple alignment file to jump start the search. When doing so, the PSSM will be constructed from the imported alignment instead of the previous iteration. The alignment file needs to be in the required format.

Suppose the multiple alignment file has N sequences. It may be presented in one or more, alignment blocks, where each block presents a range of columns from the multiple alignment e.g., the first block might have columns 1-60, the second block might have columns 61-120, the third block might have columns 121-180, etc. Each block should have N rows, one row per sequence. The sequences should be in the same order in every block. Blocks are separated by one or more blank lines.

Within a block there are no blank lines, and each line consists of a sequence identifier followed by some white space, and then by characters or dashes for that sequence in the multiple alignment. In each column, all letters must be in upper case, or all letters must be in lower case. Upper case means that this column is to be given position-specific scores. Lower-case means to use the underlying matrix (specified by -M) for this column, e.g., if the query sequence has an 'i' residue in the column, then the standard scores for matching an 'I' are used in the column.

A sample command line is:

blastpgp -i seq1 -B align1 -j 2 -d nr -C seq1.pssm -J T -o seq1.align

where "-i seq1" specifies the query, "-B align1" provides the jump start alignment file, "-j 2" determines the number iterations, and "-d nr" specifies the target database. The PSSM and alignment are saved in seq1.pssm and seq1.align, respectively.

The example files seq1 and align1 copied below were reformatted from those kindly supplied by Dr. Aravind, as described in one of his paper [6]. Dr. Aravind first helped define how -B should work. Dr. Wolf subsequently helped design a more flexible input format for the alignments. Dr. Schäffer did the actual implementation in C. For use in test blastpgp run, you can download the query (seq1) and the alignment (align1) by right click the link and select "Save link as ..." from the menu.

seq1 ---- >26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEALKCV AQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERK LSQMILDKKFHGILDQGEGVLIIFDEPP align1 ------ 26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgkla F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsakl YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvkniln YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaake FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrc COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrr 644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqt YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspe eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw---------------------------------- T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw---------------------------------- YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspe KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgae F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircne Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlket 26SPS9_Hs lryagrqtealkcvaqasknrsladfekaltdy--------------------------- F57B9_Ce alkyngsdldamkaiaaaaqkrslkdfqvafgsf-------------------------- YDL097c_Sc akytketyqsrgidamkavaeaynnrslldfntalkqy---------------------- YMJ5_Ce ivayqkspriiairsmadafrkrslkdfvkalaeh------------------------- FUS6_ARATH asglahlelkkyklaarkfldvnpelgnsyneviapqdiatygglcalasfdrselkqkv COS41.8_Ci kfleaaqrynelsyksaiheteqtkalekalncailapagqqrsrmlatlfkdercqllp 644879 qailtklkcaaglaelaarkykqaakclllasfdhcdfpellspsnvaiygglcalatfd YPR108w_Sc llslisttaalqsissltislyasdyasyfpyllety----------------------- eif-3p110_Hs ------------------------------------------------------------ T23D8.4_Ce ------------------------------------------------------------ YD95_Sp vlavlpqnesmssleacinslylcdysgffrtladve----------------------- KIAA0107_Hs ilevlhslpavrqylfslyecrysvffqslavv--------------------------- F49C12.8_Hs vqeqltggglngtlipvreylesyydchydrffiqlaale-------------------- Int-6_Mm idnnsvssplqslqqrtwlihwslfvffnhpkgrdniidlflyqpqylnaiqtmcphilr 26SPS9_Hs ------------------------------------------------------------ F57B9_Ce ------------------------------------------------------------ YDL097c_Sc ------------------------------------------------------------ YMJ5_Ce ------------------------------------------------------------ FUS6_ARATH idninfrnflelvpdvrelindfyssryascleylasl---------------------- COS41.8_Ci sfgilekmfldriiksdemeefar------------------------------------ 644879 rqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem-------------- YPR108w_Sc ------------------------------------------------------------ eif-3p110_Hs ------------------------------------------------------------ T23D8.4_Ce ------------------------------------------------------------ YD95_Sp ------------------------------------------------------------ KIAA0107_Hs ------------------------------------------------------------ F49C12.8_Hs ------------------------------------------------------------ Int-6_Mm ylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkklrec 26SPS9_Hs RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMI F57B9_Ce PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMI YDL097c_Sc EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMI YMJ5_Ce KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMI FUS6_ARATH KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALI COS41.8_Ci QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMI 644879 KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLI YPR108w_Sc ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFI eif-3p110_Hs DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMI T23D8.4_Ce NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMI YD95_Sp VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFI KIAA0107_Hs EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFI F49C12.8_Hs SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLI Int-6_Mm ESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLI 26SPS9_Hs LDKKFHGILDQGEGVLIIFDEPP F57B9_Ce LDQKLSGSLDQGEGMLIVFEIAV YDL097c_Sc LDKIFYGVLDQGNGWLYVYETPN YMJ5_Ce LDKKLMGSIDQHGDTVVVYPKAD FUS6_ARATH TDNQIQARIDSHNKILYARHADQ COS41.8_Ci CESRMKGHIDQIDGIVFFERRET 644879 LEGLISARVDSHSKILYARDVDQ YPR108w_Sc PNKQLNCVIDRVNGIVETNRPDN eif-3p110_Hs INEELMASLDQPTQTVVMHRTEP T23D8.4_Ce IQEELSATLDEPTDCLIMHRVEP YD95_Sp PDNKLNCVIDRVNGVVFTNRPDE KIAA0107_Hs AAGRLHCKIDKVNEIVETNRPDS F49C12.8_Hs ATGQLQCRIDAVNGVIEVNHRDS Int-6_Mm RNARLDAKIDSKLGHVVMGNNAV

3.4 PHI-BLAST Search with a Pattern in "patseedp" Mode

A typical PHI-BLAST search may look like this:

blastpgp -i query -k pat -p patseedp -d refseq_protein -o refp_pat.out

here -i specifies the input FASTA query sequence file, -k input is a file describing the pattern and where in the query the pattern is found, -d specifies the target database to search against, -p indicates the mode of search, patseedp in this case, and -o specifies the output file. In this search, the program does not differentiate the multiple occurrences of the pattern.

The pairwise alignment output is essentially the same as blastp with the exception that the pattern occurrence in the alignment will be marked by asterisks. Working example for query and pat files are given below.

pat --- ID Cytidine and deoxycytidylate deaminases zinc-binding region signature. PA [CH]-[AGV]-E-x(2)-[LIVMFGAT]-[LIVM]-x(17,33)-P-C-x(2,8)-C-x(3)-[LIVM]. HI (70 76)(100 101)(106 110) query ----- >gi|21955158|ref|NP_663745.1| phorbolin 1 [Homo sapiens] MEASPASGPRHLMDPHIFTSNFNNGIGRHKTYLCYEVERLDNGTSVKMDQHRGFLHNQAKNLLCGFYGRHAEL RFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFAARIYDYDPLYKEALQMLRDA GAQVSIMTYDEFKHCWDTFVDHQGCPFQPWDGLDEHSQALSGRLRAILQNQGN

Sample alignment from PHI-BLAST patseedp search is excerpted below.

Score = 261 bits (688), Expect = 1e-076 Identities = 171/193 (88%), Positives = 178/193 (92%), Gaps = 3/193 (1%) Query: 10 RHLMDPHIFTSNFNNG---IGRHKTYLCYEVERLDNGTSVKMDQHRGFLHNQAKNLLCGF 66 R+LMDP FT NFNN + R +TYLCYEVERLDNGT V MDQH GFL N+AKNLLCGF Sbjct: 190 RYLMDPDTFTFNFNNDPLVLRRRQTYLCYEVERLDNGTWVLMDQHMGFLCNEAKNLLCGF 249 Query: 67 YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA 126 pattern 70 ***************************************** YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA Sbjct: 250 YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA 309

3.5 PHI-BLAST Search with Pattern in "seedp" Mode

When the pattern occurs in the query more than once and only a specific occurrence is of interest, PSI-BLAST search should be done in seedp mode with a command line like:

blastpgp -i query2 -k pat2 -p seedp -d refseq_protein ...

Examples for pat2 and query2 are given below. The HI line in the pattern file instructs PHI-BLAST to focus on that specific pattern occurrence during the search. The position is broken into sections demarcated by the ambigous length. To find the positions of the pattern occurrence in a given query, we can use the program seedtop found in the blast binary distribution package. See "Pattern Search with seedtop" for more information.

pat2 ---- ID Lipolytic enzymes "G-D-S-L" family, serine active site (PATTERN) PA G-D-S-[LIVM]-x(1,2)-[TAG]-G. HI (152 155) (158 159) query2 ------ >gi|15223790|ref|NP_173441.1| Family II extracellular lipase [A. thaliana] MNPPTPDPSPKPVAPPGPSSKPVAPPGPSPCPSPPPKPQPKPPPAPSPSPCPSPPPKPQPKPVPPPACPPTPPKPQPKPA PPPEPKPAPPPAPKPVPCPSPPKPPAPTPKPVPPHGPPPKPAPAPTPAPSPKPAPSPPKPENKTIPAVFFFGDSVFDTGN NNNLETKIKSNYRPYGMDFKFRVATGRFSNGMVASDYLAKYMGVKEIVPAYLDPKIQPNDLLTGVSFASGGAGYNPTTSE AANAIPMLDQLTYFQDYIEKVNRLVRQEKSQYKLAGLEKTNQLISKGVAIVVGGSNDLIITYFGSGAQRLKNDIDSYTTI IADSAASFVLQLYGYGARRIGVIGTPPLGCVPSQRLKKKKICNEELNYASQLFNSKLLLILGQLSKTLPNSTFVYMDIYT IISQMLETPAAYGFEETKKPCCKTGLLSAGALCKKSTSKICPNTSSYLFWDGKLWGYIKKSQTLIDGLQMLVSMFFFGDS IIDTGNNNNLTTEMKCNFSPYGKDFPLGVATGRFSNGKVVSDYISEYLGVKPIVPAYFDPNVQLEDLLTGVSFASGGSGY YHLTPKISRVKSMLEQLTYFQRHIARVKRLVGEEKTDQLLAKGLSVVVAGSNDLAITYYGHGAQLLKDDIHYFTSKMANS AASFVMQLYEYGARQIAVLGTPPLGCVPILRTLKGGLRRECAQDINYASQLFNVKLSNILDQLAKNLPNSNLIYIDIYSA FSHILENSADYAQTGTFSAVLAFGDSILDTGNNNLLMTVSRGNFLPYGRDFPHRIPTGRFGNGRVLSDLVASGLGVKDLL PAFRSPFLKNSELATGVCFASGGSGLDKFTASIQGVIWVQDQVSDFQRYLEKLNQQVGDAAKVKEIIANAVILVSAGNND LAITYFSTPKRQTRYTVQAYTDMLIGWKTTFINSLYDLGARKFAILGTLPLGCLPGARQITGNLICLPNVNYGARVYNDK VANLVNQYNQRLPNGKFVYIDMYNSLLEVINNPSQYGKTLSNTRKF

Sample alignment output from PHI-BLAST seedp search:

Score = 289 bits (760), Expect = 7e-085 Identities = 158/269 (58%), Positives = 200/269 (74%), Gaps = 13/269 (4%) Query: 147 AVFFFGDSVFDTGNNNNLETKIKSNYRPYGMDFKFRVATGRFSNGMVASDYLAKYMGVKE 206 pattern 152 ******** ++FFFGDS+ DTGNNNNL T++K N+ PYG DF VATGRFSNG V SDY+++Y+GVK Sbjct: 473 SMFFFGDSIIDTGNNNNLTTEMKCNFSPYGKDFPLGVATGRFSNGKVVSDYISEYLGVKP 532

3.6 Integrated PHI-BLAST and PSI-BLAST search

PHI-BLAST is integrated with PSI-BLAST, its search result can be used for subsequent PSI-BLAST iteration directly if the -j option is specified with a value greater than 1. A typical search command line, modified from that in 4.6 looks like this:

blastpgp -i query2 -k pat2 -d refp -p patseedp -j 5 -o phi_5.out -Q pssm

The first search will be PHI-BLAST in patseedp mode. The result from PHI-BLAST is passed on to PSI-BLAST search to run for 4 interations. The alignment and PSSM are saved to phi_5.out and pssm, respectively.

3.7 Miscellaneous Information

Users who develop their own sequence analysis software may wish to develop their own scoring systems. They can adapt codes in the function posit.c, which writes out the checkpoint, to write out scoring systems derived by other algorithms in such a way that PSI-BLAST can read the files in later. The checkpoint structure is generated in a way that it can handle any position-specific matrix that fits in the Karlin-Altschul statistical framework for BLAST scoring.

When seaching through the PSI-BLAST page on the Web, only one pattern is allowed per query. The input pattern has an upper size limit of 100 characters. The PHI-BLAST Web page supports only the "patseedp" option since it does not take the position specification. There the user must explicitly invoke one round of iteration at a time, and the PHI-BLAST page provides the option to initiate a PSI-BLAST round with the PHI-BLAST results.

The PSI-BLAST page is different from the command line blastpgp (with -j setting) in at least two ways. First, each submission on the Web page runs only one iteration, and the user must explicitly approve continuation to the next iteration. Second, the user can explicitly override the -h setting by either excluding matches with lower E-values or explicitly including matches with higher E-values for use in constructing the PSSM.

Standalone PSI-BLAST can also be used to convert multiple sequence alignments from Pfam release into scoremat. To do this, we need to extract out individual alignment blocks from the master file and saved it to a file. Furthermore, we need to extract out the longest sequences from each alignment block and save it to a separate file for use as query. For example, if we saved an alignment block to file A and its longest sequence query.A, we can convert the alignment block into an ASN.1 scoremat using the following command (unix/linux):

blastpgp -i query.A -B A -u 1 -C scoremat.A -d dummy_db >/delv/null

The dummy_db is a formatted protein blast database with only one sequence. We can do this since we are not interested in the alignment result. Rather, we are after the checkpoint scoremat used in judging or scoring the alignment. The collection of -C output for all the entries can be used for generate an RPSBLAST database using formatrpsdb tool.

4. Technical Assistance

For questions and comments on this document and BLAST in general, please send them to:

blast-help@ncbi.nlm.nih.gov

Questions and comments on other NCBI resources should be addressed to:

info@ncbi.nlm.nih.gov

5. Appendix: Parameters and Their Accepted Values

All the program parameters for blastpgp can be printed on the screen using the following command line:
blastpgp - <hit enter key>

Here, we list them out individually in the tables below, each with detailed description and example usage.

Table 5.1

Parameter -p

Function Specifies search mode for PHI-BLAST

Default blastpgp

Input Format [String]

Example To run PHI-BLAST, used: -p patseedp

Note
A third option is "seedp", which reads in pattern file with HI lines.

Table 5.2

Parameter -d

Function Specifies target database

Default nr

Input Format [String]

Example To run search against refseq_protein, use: -d refseq_protein

Note
Database file name should be provided WITHOUT file extension. Multiple databases should be provided with double quotes, like in -d "refseq_protien pdbaa".

Table 5.3

Parameter -i

Function Specifies input query file

Default stdin

Input Format [File In]

Example To search with my_kinase.txt, use: -i my_kinase.txt

Note
User the complete query file name WITH extension. The sequences should be in FASTA format, and the file should be saved as a plain text file.

Table 5.4

Parameter -o

Function Saves the oput in the specified file

Default stdout

Input Format [String]

Example To save result in my_kinase.out, use: -o my_kinase.out

Note
Users can use redirecttion with < symbol followed by file name.

Table 5.5

Parameter -e

Function The upper limit of Expect value cutoff

Default 10.0

Input Format [Real] use either floating point or scientific notation

Example To decrease the e value to 0.1, use: -e 0.1

Note
Accepted formats are 10, 3/4, 0.01, or 2e-34. Smaller values increase the search stringency.

Table 5.6

Parameter -m

Function alignment view options

Default 0

Input Format [Integer]

Example To get result in XML format, use: -m 7

Note
Accepted values and the format they specify

 0 = pairwise
 1 = query-anchored showing identities
 2 = query-anchored no identities
 3 = flat query-anchored, show identities
 4 = flat query-anchored, no identities
 5 = query-anchored no identities/blunt ends
 6 = flat query-anchored, no identities/blunt ends
 7 = XML Blast output
 8 = tabular
 9 = tabular with comment lines
10 = ASN, text
11 = ASN, binary [Integer]

Table 5.7

Parameter -F

Function Filters query sequence

Default F

Input Format [String]

Example To activate filter, use: -F T
To mask lookup table when filtering, use: -F "m S"

Note
It uses SEG when set to T. For more information see Section 6.4 of "BLAST URLAPI". When -t setting is > 0 (the default is 1), it has the side effect of masking the database sequences. Additional masking of the query sequence under this is likely to be counterproductive.

Table 5.8

Parameter -f

Function Sets score threshold for extending an initial word match

Default 11

Input Format [Integer]

Example To increase the requirement to 12, use: -f 12

Note
Increase the setting makes the search more stringent. This parameter is one of the main reasons that BLAST is "heuristic"; increasing the value of -f saves time but eliminates matches. The default value of -f is set with the assumption that the the default word size -W 3 (Table 5.30) is used. If one instead uses -W 2, then -f should be lowered.

Table 5.9

Parameter -A

Function Multiple hit window size

Default 40

Input Format [integer]

Example To reduce the window size to 30, use: -A 30

Note
When -P is set to 0 (Table 5.13), blastpgp requires two word hits within the distance specified by -A to initiate alignment extension. A reduction in setting of .-A. saves time at the risk of missing matches.

Table 5.10

Parameter -G

Function Specifies the cost for opening a gap

Default 11

Input Format [integer]

Example To increase this cost to 12, use: -G 12

Note
Corresponding -E change may be required. For allowed value when using non-PSSM, see Section 6.11.2 of BLAST URLAPI.

Table 5.11

Parameter -E

Function Cost to extend a gap

Default 1

Input Format [integer]

Example To increase this to 2, use: -E 2

Note
Corresponding -G change may be required.

Table 5.12

Parameter -N

Function Sets threshold for gapped extension for an ungapped alignment

Default 22

Input Format [integer]

Example To increase this to 25, use: -N 25

Note
Similar to X-dropoff for ungapped alignment extension(?). Score threshold measured in bits. Increasing -N saves time, but risks missing matches.

Table 5.13

Parameter -P

Function Type of hits

Default 0

Input Format [integer]

Example To run in single hit mode, use: -P 1

Note
0 for multiple hits; 1 for single hit.

Table 5.14

Parameter -S

Function Start of required region in query

Default 1

Input Format [Integer]

Example To start in position 100 in the query, use: -S 100

Note
To be used in conjunction with -H.

Table 5.15

Parameter -H

Function End of required region in query

Default -1

Input Format [integer]

Example To end in position 200 in the query, use: -H 200

Note
This and -S specify the subsequence of the query to use. They function in a manner different from -L in blastall, in that the subsequence has to be in the alignment, but the alignment may be extend beyond the specified region.

Table 5.16

Parameter -a

Function Number of processors to use

Default 1

Input Format [integer]

Example To utilize both processors in a two processor machine, use: -a 2

Note
1 up to the number of processors available.

Table 5.17

Parameter -I (upper case i)

Function Displays GI in the deflines for matches from NCBI database

Default F

Input Format [T/F]

Example To activate gi display, use: -I T

Note

-I F: ref|NP_000241.1| 
myeloperoxidase [Homo sapiens]

-I T: gi|4557759|ref|NP_000241.1| myeloperoxidase [Homo 
sapiens]

Table 5.18

Parameter -h

Function E-value threshold for inclusion of hits from round i

Default 0.002

Input Format [real]

Example To loosen this to 0.005, use: -h 0.005

Note
Hits with expect value lower than the setting will be included in the construction of the PSSM, which will be used in the scoring of hits in round i+1.

Table 5.19

Parameter -c

Function Constant in pseudocounts for multipass version

Default 9

Input Format [Integer]

Example To change this to 10, use: -c 10

Note
Pseudocounts from a standard probability distribution are combined with the actual letter distribution of the query.

Table 5.20

Parameter -j

Function Maximal rounds of iteration

Default 1

Input Format [Integer]

Example To search for 10 iterations, use: -j 10

Note
Default is single iteration - regular blastp or PHI-BLAST without subsequent iterations in PSI-BLAST mode. To generate PSSM, setting should be ≥ 2 unless -B or -R is used. Use -j 0 if you want the search to continue till convergence. This may be risky since tests in [2] suggest that -j 5 or -j 6 works best on average.

Table 5.21

Parameter -J

Function Believe the query defline

Default F

Input Format [T/F]

Example To believe the query defline, use: -J T

Note
-J T is required when saving SeqAlign object (-O) or scoremat (-C, -Q).

Table 5.22

Parameter -Z

Function X dropoff value for final gapped alignment (in bits)

Default 25

Input Format [Integer]

Example To decrease this to 20, use: -Z 20

Note
Smaller settings tend to generate shorter alignments because they reduce the flexibility to include large gaps.

Table 5.23

Parameter -O

Function Saves SeqAlign Object to file

Default Optional

Input Format [File Out]

Example To same it to my_blastobj, use: -O my_blastobj

Note
-J, 'Believe the query defline', must be set to T. The file can be used to regenerate the alignment in various formats without rerun of the search.

Table 5.24

Parameter -M

Function Matrix

Default BLOSUM62

Input Format [String]

Example To change it to PAM70, use: -M PAM70

Note
This only affects the initial blastp or PHI-BLAST search.

Table 5.25

Parameter -v

Function Sets the upper limit of descriptions to display in result

Default 500

Input Format [Integer]

Example To decrease the descriptions to 100, use: -v 100

Note
Web counterpart: Descriptions.

Table 5.26

Parameter -b

Function Upper limit of database sequences to show alignments for

Default 250

Input Format [Integer]

Example To increase the alignments saved to 5000, use: -b 500

Note
Setting this parameter to a smaller value may affect the PSSM constructed.

Table 5.27

Parameter -C

Function Saves PSI-BLAST checkpointing output file to specified file

Default Optional

Input Format [File Out]

Example To save checkpointing file to my_checkpoint, use: -C my_checkpoint

Note
The saved checkpoint file or scoremat can be used to search the same query against another database. To be read in through -R parameter or used in RPSBLAST database construction.

Table 5.28

Parameter -R

Function Input File for PSI-BLAST Restart

Default Optional

Input Format [File In]

Example To read in checkpoint file my_chechpoint, use: -R my_checkpoint

Note
This is for restart a previously unfinished PSI-BLAST search or use the scoremat to score search against a different database using the same query.

Table 5.29

Parameter -W

Function Word size

Default 3

Input Format [Integer]

Example To set word size to 2, use: -W 2

Note
Only two options are available. If -W 2 is used, one may need to reduce the -f setting also.

Table 5.30

Parameter -z

Function Effective length of the database

Default 0

Input Format [Real]

Example To set this to an arbitrary (reference) value of 10⁹, use: -z 1e9

Note
Use zero for the actual database size.

Table 5.31

Parameter -K

Function Number of best hits from a region to keep

Default 0

Input Format [Integer]

Example To keep only the best 100 hits, use: -K 100

Note
Custom value not recommended.

Table 5.32

Parameter -s

Function Compute locally optimal Smith-Waterman alignments

Default F

Input Format [T/F]

Example To do local optimal Smith-Waterman alignments, use: -s T

Note
This may change the alignment and its score. This parameter is not available in the PSI-BLAST page.

Table 5.33

Parameter -Y

Function Effective length of the search space

Default 0

Input Format [Real]

Example To set this to a reference value of 2000000, use: -Y 2000000

Note
Use zero for the real size.

Table 5.34

Parameter -k

Function Specifies the input pattern for PHI-BLAST search

Default hit_file

Input Format [File In]

Example To read in pattern from protease_pat, use: -k protease_pat

Note
This pattern input is used by PHI-BLAST search in patseedp or seedp mode.

Table 5.35

Parameter -T

Function Produces HTML output

Default F

Input Format [T/F]

Example To generate HTML output, use: -T T

Note
Database entries will be hot-linked to corresponding Entrez records when the result is viewed with a browser. For NCBI databases only.

Table 5.36

Parameter -Q

Function Generates human readable PSSM

Default Optional

Input Format [File Out]

Example To save the PSSM to my_pssm, use: -Q my_pssm

Note
The file is an AxQ matrix with log odd score for matches to different residues at each position.

Table 5.37

Parameter -B

Function Reads in an alignment file to jump start PSI-BLAST search

Default Optional

Input Format [File In]

Example To jump start with alignment in my_alignment, use: -B my_alignment

Note
See Section 3.3 for input example and file format.

Table 5.38

Parameter -l (lower case L)

Function Restricts search of database to a list of GI's

Default Optional

Input Format [String]

Example To restrict the search to GIs in sequences.gi, use: -l sequences.gi

Note
This only applies to NCBI supplied databases formatted with -o T. It enables users to search a virtual database consists of a subset of entries specified by the input file.

Table 5.39

Parameter -U

Function Uses lower case filtering of query sequence

Default F

Input Format [T/F]

Example To activate this filtering function, use: -U T

Note
It enables users to mannually mask specific regions in the query with lower case letters. This will make the search focus on the region of interest left in UPPERCASE.

Table 5.40

Parameter -t

Function Uses composition-based statistics or compositional matrix adjustment

Default 1 (T)

Input Format [String]

Example To inactivate composition-based statistics, use: -t F or -t 0

Note
A value > 0 also has the side-effect of masking the database sequences. At this time, the settings 2 and 3 are supported only if -M is set to the default matrix BLOSUM62. Accepted values and their functions are

0, F, or f: no composition-based statistics
1, T, or t: Composition-based statistics [2]
2: Composition-based score adjustment [5] conditioned on 
sequence properties in round 1
3: Composition-based score adjustment [5] 
unconditionally in round 1

Table 5.41

Parameter -q

Function Specifies the input format of scoremat (checkpoint)

Default 0

Input Format [Integer]

Example To read in an ASCII scoremat, use: -q 1

Note
To be used with -R. Accepted values and their functions are

0: no scoremat input
1: Restart is from ASCII scoremat checkpoint file
2: Restart is from binary scoremat checkpoint file

Table 3.42

Parameter -u

Function Specifies the output format of scoremat (checkpoint)

Default 0

Input Format [Integer]

Example To generate ASCII scoremat, use: -u 1

Note
Works together with -C. Accepted values and their functions are

0: no scoremat output
1: Output is ASCII scoremat checkpoint file (requires -J)
2: Output is binary scoremat checkpoint file (requires -J)

Table 5.43

Parameter -L

Function Cost to decline alignment

Default 0

Input Format Integer

Example To set this to 50, use: -L 50

Note
Disabled when set to 0.

6. References

[1] Altschul et al, 1997. Nucleic Acids Res. 25: 3389 - 3402.
[2] Schaffer et al, 2001. Nucleic Acids Res. 29: 2994 - 3005.
[3] Baeza-Yates and Gonnet, 1992. Comm. ACM 35: 74 - 82.
[4] Wu and Manber, 1992. Comm. ACM 35: 83 - 91.
[5] Yu and Altschul, 2005. Bioinformatics 21: 902-911.
[6] Aravind and Ponting, 1998. Protein Science 7: 1250 - 1254.