Standalone PSI/PHI-BLAST: blastpgp
Tao Tao, Ph.D. NCBI User Service
TOC
1. Introduction
The program blastpgp performs regular gapped blastp searches. In addition,
it can perform iterative
searches in "Position-Specific Interative BLAST" (PSI-BLAST) mode or
pattern initiated searched in
"Pattern-Hit Initialed BLAST" (PHI-BLAST) mode. The result from a
PHI-BLAST search can serve as the
input to subsequent PSI-BLAST searches. The program is commonly used to
search for sequences distantly
related to the query that cannot be detected by traditional blastp search.
The program can build a
position-specific scoring matrix (PSSM) from the search result, for a
specific family of related sequences.
1.1. PSI-BLAST
In PSI-BLAST mode, blastpgp first performs a standard blastp search
against the target
database. The sequences found in the first round are used to build a PSSM,
which is then used
in the next round of search. As explained in the paper by the Altschul et
al [1]
and Schäffer et al [2], the traditional BLAST
algorithm was implemented using
an A ╳ A
substitution matrix, where A is the
alphabet size currently set at 28. PSI-BLAST instead uses a Q ╳ A matrix,
where Q is the length of the query sequence. At each position, the score
for a letter depends on the position with
respect to the query and the letter in the subject sequences.
The position-specific matrix for round i + 1 is built from a
constrained multiple alignment among the query and the sequences
found with sufficiently low e-value, below that set by the inclusion
threshold -h, in round i. The top part of the output
for each round distinguishes the sequences into "sequences found
previously and used in the scoring matrix" and "sequences not
used in the scoring matrix". PSI-BLAST search "converges" or stops if all
sequences found at round i + 1 below the e-value
inclusion threshold were already in the model at the beginning of the
round.
Blastpgp can also store and reuse the PSSM through -C and -R parameters,
respectively. To reuse a stored checkpoint file, the exact
same query should be used. See Section 3.2 for examples
and more detailed explanations.
Using the option -B, PSI-BLAST is also capable of importing a
"master-slave" style multiple sequence alignment, computed outside of
PSI-BLAST, to jump start the search from that alignment. The multiple
sequence alignment must include the complete query sequence,
which does not need to be the first in the alignment. The multiple
alignment file must be specified in a format that is similar to
ClustalW minus the headers and trailers, see Section
3.3 below for example.
The -R parameter, restart from a checkpoint, and -B, jump start from a
multiple alignment, are mutually exclusive.
1.2 PSI-TBLASTN
PSI-BLASTN is a variant of blastall's tblastn mode and it searches a
protein query
sequence against a (translated) nucleotide sequence database like tblastn.
However, instead
of using the default matrix as specified by -M, a PSSM created by a
PSI-BLAST run using the
same query is used to score the potential matches. The nucleotide sequence
database is dynamically
translated in all reading frames during PSI-TBLASTN search. PSI-TBLASTN
search using a PSSM may help
find more distantly related sequences encoded by unannotated nucleotide
sequences.
The first step for this type of search is to run PSI-BLAST iterations with
the query against
a target protein database by capturing the PSSM or checkpoint file output
through the -C option.
The same query is then searched with blastall against a nucleotide
database in PSI-TBLASTN
mode using "-p psitblastn" and the checpoint file (from the first step)
read through -R parameter. All
other parameter settings applicable to "blastall -p tblastn ..." also
apply here for "blastall -p psitblastn ...".
Some restrictions to parameters are 1) query must be the same as the one
used in blastpgp for PSSM generation,
2) by default, blastpgp uses -F F and blastall uses -F T. To ensure
consistent usage of the blastpgp/psitblastn combination,
the -F option should be explicitly set to be consistent with one another.
1.3 PHI-BLAST
The PHI-BLAST mode combines matching of regular expressions with local
alignments surrounding the match.
The incorporation of this function into the BLAST software framework is
partly for user convenience and
partly so that it can be combined seamlessly with PSI-BLAST.
One very restrictive way to identify protein motifs is by regular
expressions that must contain
each instance of the motif. The PROSITE database is a compilation of such
restricted regular expressions
that describe protein motifs.
This search mode takes a motif file and a query sequence containing the
motif as inputs throught the -k and -i parameters.
Given a protein sequence S and a regular expression pattern P occurring in
S, PHI-BLAST helps identify other protein sequences
contain "an occurrence of P" and are "similar to S in the vicinity of the
P". In another word, PSI-BLAST addresses the question
of "What other proteins containing the same motif are similar to the query
protein around the motif occurrence?"
PHI-BLAST is preferable to searching pattern occurrences alone because it
filters out those cases where the pattern occurrence
may be random and are not indicative of homology. PHI-BLAST may be
preferable to other flavors of BLAST because it is faster and
allows the use of a rigid pattern occurrence requirement.
The pattern search methods in PHI-BLAST are based on the algorithms
described by Baeza-Yates and Gonnet [3] and
Wu and Manber [4]. The calculation of local alignments
is done using a method very similar to gapped BLAST and
much of the same code. The method of evaluating statistical significance
of the alignment is somewhat different.
The query sequence should be in FASTA format as for all other BLAST
searches. The syntax for patterns follows the rules of PROSITE
and is documented in Section 1.5 as well as in "Pattern Search with seedtop". An input
pattern to PHI-BLAST does not have to be from PROSITE list. PHI-BLAST does
only gapped alignment since ungapped alignments contradict
almost all patterns in PROSITE.
PHI-BLAST can restrict the local alignment search to a proper subset of
the pattern occurrences,
through -p seedp, if a specified pattern occurs more than once in the
query. The use of the "seedp"
option requires the user to specify the location(s) of the occurrence(s)
in the pattern file through
the HI line. When there are multiple pattern occurrences in the query, it
may be important to decide
how many are of interest because the E-value for matches is effectively
multiplied by the number of
interesting pattern occurrences.
PHI-BLAST is integrated with the standard and non-restart usage of
PSI-BLAST. In the command-line
mode, PSI-BLAST can be invoked by using the -j option, as for standard
PSI-BLAST search. When -j value
is assigned a value different from the default value of 1, then the
results of the single round of
PHI-BLAST searching are used to construct a PSSM used in PSI-BLAST. One
may continue PSI-BLAST for more than one round.
1.4 PHI-BLAST Statistical
Significance
When a query sequence Q matches a database sequence D in PHI-BLAST, it is
useful to subdivide
Q and D into 3 disjoint pieces:
Qleft - Qpattern - Qright
Dleft - Dpattern - Dright
The substrings Qpattern and Dpattern contain the pattern specified in the
pattern file. The
pieces Qpattern and Dpattern are aligned and that alignment
is displayed as part of the PHI-BLAST
output, but the score for that alignment is mostly ignored. The "reduced"
score r of an alignment
is the sum of the scores obtained by aligning Qleft with
Dleft and Qright with Dright.
The expected number of alignments with a reduced score = x is given by:
CN (Lambda*x +
1)e^(-Lambda *x)
where:
C, Lambda : "constants" depending on the score matrix and the gap
costs.
N : (number of occurrences of pattern in database) * (number of
occurrences of pattern in Q)
e : the base of the natural logarithm.
It is important to understand that this method of computing the
statistical significance of a PHI-BLAST
alignment is mathematically different from the method used for BLAST and
PSI-BLAST alignments. The E-values
from PHI-BLAST are displayed using a similar syntax as those in
tranditional blast and PSI-BLAST.
1.5. Pattern Syntax for
PHI-BLAST.
The syntax for patterns in PHI-BLAST follows the PROSITE convention. When
using the stand-alone
program, it is permissible to have multiple patterns in a file. For more
details, see
Section
3.1 of "Pattern Search with seedtop".
Table 1.5.1
Meanings of the Pattern Symbols |
Pattern Symbols | Meaning of
the Symbols |
A-N,P-T,V-Z | Valid Protein Code |
ACGTX | Valid Nucleotide Code |
[ ] | any one of the characters enclosed in the
brackets |
- | Represents nothing. This is a spacer used by
PROSITE |
X | (followed by nothing) means any residue |
X(5) | Represents 5 positions in which any residue is
allowed. X and 5 can be substituted by other code and number,
respectively. |
X(2,4) | Represents 2 to 4 positions where any residue
is allowed. The numbers can be substituted by any other two numbers.
|
> | Can occur only at the end of a pattern and means
nothing. It may occur before a period. |
. | Can be used at the end of the pattern and means
nothing |
Here is an example record from PROSITE.
ID CNMP_BINDING_2; PATTERN.
AC PS00889;
DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE Cyclic nucleotide-binding domain signature 2.
PA
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR /RELEASE=32,49340;
NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR /FALSE_NEG=1; /PARTIAL=1;
CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;
When using the stand-alone program, the pattern should be in a file, with
the first line
starting with ID followed by a space and the text strings for the
pattern name.
The actual pattern, immediately below the ID line, should start with
PA followed
by a space and the text string for pattern description. Optional HI
initialed lines can
be present to specify the pattern position.
In the previous PROSITE example, the pattern name is "CMMP_BINDING_2;
PATTERN." The pattern
specified by the PA line is explained in the table below. Other lines, if
present, are ignored by
PHI-BLAST.
Table 1.5.2
Explanation of Pattern Example CNMP_BINDING_2 |
Position in Pattern | Pattern
Syntax | Pattern Representation | Actual Length ¹
|
1 | [LIVMF] | Any one of L, I, V, M, or
F | 1 |
2 | G | One G | 1 |
3 | E | One E | 1 |
4 | X | Any one residue | 1 |
5 | [GAS] | Any one of G, A, or S | 1 |
6 | [LIVM] | Any one of L, I, V, or
M | 1 |
7 | x(5,11) | 5 to 11 any residue | 5 to
11 |
8 | R | One R | 1 |
9 | [STAQ] | Any one of S, T, A, or
Q | 1 |
10 | A | One A | 1 |
11 | x | Any one residue | 1 |
12 | [LIVMA] | Any one of L, I, V, M, or
A | 1 |
13 | X | Any one residue | 1 |
14 | [STACV] | Any one of S, T, A, C, or
V | 1 |
Note:
1 The actual length of matched patterns, found in the database
entries, will be between 18 to 24 letters long.
A simplified pattern for PHI-BLAST, with HI lines for pattern position on
the query, is give below.
ID ER_TARGET; PATTERN.
PA [KRHQSA]-[DENQ]-E-L>.
HI (19 22)
HI (201 204)
In this example, the HI lines specify that two pattern occurrences
in the input query are
of interest, 19 - 22 and 201 - 204. These HI lines are relevant when
stand-alone PHI-BLAST is used
with the seedp mode, in which the interesting occurrences of the pattern
in the sequence need to be
specified to inform PHI-BLAST only the specified pattern occurrences
should be used to find good
alignments. The seedp option is qualitatively more useful than the
standard patternp option ONLY
when the number of pattern occurrences, K, is greater than 1 AND the user
is interested only in
matches for J < K occurrences. The use of the HI lines enables the user
to specify which
occurrences of the pattern should be evaluated.
1.6 Additional Functionality Related to
PHI-BLAST
PHI-BLAST takes a sequence pattern and a query containing that pattern as
input and
searches a sequence database for other sequences containing the same
pattern and having a good
alignment in the vicinity of the pattern match to the query. This program,
however, does not
address the following two simpler questions:
Q1: Given a sequence and a list of patterns, which patterns occur in the
sequence
and where?
Q2: Given a pattern and a sequence database, which sequences contain the
pattern
and where? |
They are better addressed with seedtop, a program closely related to
PHI-BLAST, but
they do not fit into the output framework of BLAST because the answers are
simple lists without
alignments and with no notion of statistical significance.
2. Installation and Setup of
blastpgp
No additional setup is needed if the blast package is set up using the
steps described in
pc_setup.html and unix_setup/html.
3. Practical Usages
We can use blastpgp to perform different searches and address different
questions. Here we
will present a list of common scenarios and discuss them in details. Each
scenario will be in
its own subsection with example command lines.
3.1 Simple PSI-BLAST Search
PSI-BLAST search is generally used in the following situations:
- The initial blastp search is not informative
- There is a need to identify the distant members of a protein
family
- To create a model PSSM in one database and use that in another to
find better matches
To perform a PSI-BLAST search with my_kinase.txt against refseq_protein
database for 6
iterations, and save the results in my_kinase.out, we can use the
following command line. Parameters
not specified with assume default values. For details, see Appendix.
blastpgp -i my_kinase.txt -d refseq_protein -j 6 -o my_kinase.out
|
The parameter -j specifies a maximum number of iterations. The run is
considered to "converge" and
will stop early, if blastpgp finds no new matches with E-value lower than
the threshold specified by
-h (default -0.002). One can run until guaranteed convergence by
specifying -j 0, but this is risky
because the running time and output size are essentially unbounded. Some
tests [2]
suggest that -j 5 or -j 6 yields the best results for typical queries. For
queries belonging to a
large protein family, we may need to increase the -b setting to make sure
the inclusion of most (if
not all) of the family members.
3.2 Save Checkpoint File for Use in
Other Searches
We can run PSI-BLST to create and save a PSSM file through -C option. In
the following command line,
the "-u 1" instructs blastpgp to saved the PSSM in text format. This
requires the -J T. Since we do
not need the alignment output, it is redirected to the Linux /dev/null
directory.
blastpgp -i query -d refseq_protein -C chk.asn -j 2 -u 1 -J T > /dev/null
|
The file chk.asn is called a "checkpoint" based on a standard usage of
this term in computer science.
The file name used to save the checkpoint can be any unique name and need
not have the suffix .asn or
any suffix at all. The PSSM saved in chk.asn is the PSSM used to search in
the last completed round. It
is NOT the PSSM that would be hypothetically constructed from the
alignment of the last round of search.
Users should be able to check the outputs to know that at least under some
conditions the saved PSSM
yields acceptable results.
The file written by -C is now in ASN.1 format. Under "-u 1", the file is
human readable. To get a
truly human readable version of the checkpoint file listing the scores for
matches at each position,
use the -Q parameter.
The motivation for the -C and -R paramters is to support changing
databases, so that the PSSM model built
by searching one database and can be reused to search another database.
This can be useful either to
switch from a protein database to a nucleotide database and/or from a
larger database to a smaller database.
For example, we can save the PSSM (IL.chk) from a PSI-BLAST search against
nr and use it in a psitblastn
search against against est_others:
blastpgp -d nr -i interleukin -j 2 -C IL.ckp -J T -u 1 -o IL_nr.out
blastall -i interleukin -d est_others -p psitblastn -R IL.ckp -r 1 -o
IL_est.out
|
The first run saves the PSSM that was used in the last round of search
against the protein nr.
The second run uses this saved PSSM to search a translated version of
est_others. The second
type of search can be performed on a new genome:
blastpgp -i interleukin -d new_genome -R IL.ckp -p psitblastn -o
IL.genome.out
|
This is a more sensitive way to search for matches to interleukin protein
in a new, and not yet
annotated genome.
Another practical usage of the checkpoint and restart feature is to search
a few
iterations at a time.
blastpgp -i query -d nr -C query.ckp -u 1 -J T -j 4 -o output.j1-4
blastpgp -i query -d nr -R query.ckp -r 1 -J T -j 3 -o output.j5-7
|
The above examples will do at most "4 + 3 - 1" or "6" iterations. The
"minus one (-1)" is because the checkpoint corresponds
to the last matrix used, so the checkpoint saved after the first run is
the PSSM derived from
the results of iteration 3 and used to search in iteration 4.
If the search did not converge, we can restart the iterations by reading
in the saved checkpoint
file through -R. This will require that we use the same query, database,
and command line.
3.3 Jump Start PSI-BLAST Search
Using a Multiple Sequence Alignment
PSI-BLAST can also read in a multiple alignment file to jump start the
search. When doing so,
the PSSM will be constructed from the imported alignment instead of the
previous iteration.
The alignment file needs to be in the required format.
Suppose the multiple alignment file has N sequences. It may be presented
in one or more,
alignment blocks, where each block presents a range of columns from the
multiple alignment
e.g., the first block might have columns 1-60, the second block might have
columns 61-120,
the third block might have columns 121-180, etc. Each block should have N
rows, one row per
sequence. The sequences should be in the same order in every block. Blocks
are separated by
one or more blank lines.
Within a block there are no blank lines, and each line consists of a
sequence identifier followed
by some white space, and then by characters or dashes for that sequence in
the multiple alignment.
In each column, all letters must be in upper case, or all letters must be
in lower case. Upper case
means that this column is to be given position-specific scores. Lower-case
means to use the underlying
matrix (specified by -M) for this column, e.g., if the query sequence has
an 'i' residue in the column,
then the standard scores for matching an 'I' are used in the column.
A sample command line is:
blastpgp -i seq1 -B align1 -j 2 -d nr -C seq1.pssm -J T -o seq1.align
|
where "-i seq1" specifies the query, "-B align1" provides the jump start
alignment file,
"-j 2" determines the number iterations, and "-d nr" specifies the target
database.
The PSSM and alignment are saved in seq1.pssm and seq1.align,
respectively.
The example files seq1 and align1 copied below were reformatted from those
kindly supplied
by Dr. Aravind, as described in one of his paper [6].
Dr. Aravind first helped
define how -B should work. Dr. Wolf subsequently helped design a more
flexible input format for
the alignments. Dr. Schäffer did the actual implementation in C. For
use in test blastpgp run,
you can download the query (seq1) and the alignment (align1)
by right click the link and select "Save link as ..." from the menu.
seq1
----
>26SPS9_Hs
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEALKCV
AQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERK
LSQMILDKKFHGILDQGEGVLIIFDEPP
align1
------
26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgkla
F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsakl
YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvkniln
YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaake
FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrc
COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrr
644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqt
YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspe
eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw----------------------------------
T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw----------------------------------
YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspe
KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgae
F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircne
Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlket
26SPS9_Hs lryagrqtealkcvaqasknrsladfekaltdy---------------------------
F57B9_Ce alkyngsdldamkaiaaaaqkrslkdfqvafgsf--------------------------
YDL097c_Sc akytketyqsrgidamkavaeaynnrslldfntalkqy----------------------
YMJ5_Ce ivayqkspriiairsmadafrkrslkdfvkalaeh-------------------------
FUS6_ARATH asglahlelkkyklaarkfldvnpelgnsyneviapqdiatygglcalasfdrselkqkv
COS41.8_Ci kfleaaqrynelsyksaiheteqtkalekalncailapagqqrsrmlatlfkdercqllp
644879 qailtklkcaaglaelaarkykqaakclllasfdhcdfpellspsnvaiygglcalatfd
YPR108w_Sc llslisttaalqsissltislyasdyasyfpyllety-----------------------
eif-3p110_Hs ------------------------------------------------------------
T23D8.4_Ce ------------------------------------------------------------
YD95_Sp vlavlpqnesmssleacinslylcdysgffrtladve-----------------------
KIAA0107_Hs ilevlhslpavrqylfslyecrysvffqslavv---------------------------
F49C12.8_Hs vqeqltggglngtlipvreylesyydchydrffiqlaale--------------------
Int-6_Mm idnnsvssplqslqqrtwlihwslfvffnhpkgrdniidlflyqpqylnaiqtmcphilr
26SPS9_Hs ------------------------------------------------------------
F57B9_Ce ------------------------------------------------------------
YDL097c_Sc ------------------------------------------------------------
YMJ5_Ce ------------------------------------------------------------
FUS6_ARATH idninfrnflelvpdvrelindfyssryascleylasl----------------------
COS41.8_Ci sfgilekmfldriiksdemeefar------------------------------------
644879 rqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem--------------
YPR108w_Sc ------------------------------------------------------------
eif-3p110_Hs ------------------------------------------------------------
T23D8.4_Ce ------------------------------------------------------------
YD95_Sp ------------------------------------------------------------
KIAA0107_Hs ------------------------------------------------------------
F49C12.8_Hs ------------------------------------------------------------
Int-6_Mm ylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkklrec
26SPS9_Hs RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMI
F57B9_Ce PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMI
YDL097c_Sc EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMI
YMJ5_Ce KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMI
FUS6_ARATH KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALI
COS41.8_Ci QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMI
644879 KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLI
YPR108w_Sc ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFI
eif-3p110_Hs DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMI
T23D8.4_Ce NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMI
YD95_Sp VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFI
KIAA0107_Hs EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFI
F49C12.8_Hs SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLI
Int-6_Mm ESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLI
26SPS9_Hs LDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce LDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc LDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce LDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH TDNQIQARIDSHNKILYARHADQ
COS41.8_Ci CESRMKGHIDQIDGIVFFERRET
644879 LEGLISARVDSHSKILYARDVDQ
YPR108w_Sc PNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs INEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce IQEELSATLDEPTDCLIMHRVEP
YD95_Sp PDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs AAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs ATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm RNARLDAKIDSKLGHVVMGNNAV
|
3.4 PHI-BLAST Search with a
Pattern in "patseedp" Mode
A typical PHI-BLAST search may look like this:
blastpgp -i query -k pat -p patseedp -d refseq_protein -o refp_pat.out
|
here -i specifies the input FASTA query sequence file, -k
input is a file describing the pattern
and where in the query the pattern is found, -d specifies the
target database to search against,
-p indicates the mode of search, patseedp in this case, and
-o specifies the output file.
In this search, the program does not differentiate the multiple
occurrences of the pattern.
The pairwise alignment output is essentially the same as blastp with the
exception that the pattern occurrence
in the alignment will be marked by asterisks. Working example for query
and pat files are given below.
pat
---
ID Cytidine and deoxycytidylate deaminases zinc-binding region signature.
PA [CH]-[AGV]-E-x(2)-[LIVMFGAT]-[LIVM]-x(17,33)-P-C-x(2,8)-C-x(3)-[LIVM].
HI (70 76)(100 101)(106 110)
query
-----
>gi|21955158|ref|NP_663745.1| phorbolin 1 [Homo sapiens]
MEASPASGPRHLMDPHIFTSNFNNGIGRHKTYLCYEVERLDNGTSVKMDQHRGFLHNQAKNLLCGFYGRHAEL
RFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFAARIYDYDPLYKEALQMLRDA
GAQVSIMTYDEFKHCWDTFVDHQGCPFQPWDGLDEHSQALSGRLRAILQNQGN
|
Sample alignment from PHI-BLAST patseedp search is excerpted below.
Score = 261 bits (688), Expect = 1e-076
Identities = 171/193 (88%), Positives = 178/193 (92%), Gaps = 3/193 (1%)
Query: 10 RHLMDPHIFTSNFNNG---IGRHKTYLCYEVERLDNGTSVKMDQHRGFLHNQAKNLLCGF
66
R+LMDP FT NFNN + R +TYLCYEVERLDNGT V MDQH GFL N+AKNLLCGF
Sbjct: 190 RYLMDPDTFTFNFNNDPLVLRRRQTYLCYEVERLDNGTWVLMDQHMGFLCNEAKNLLCGF
249
Query: 67 YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA
126
pattern 70 *****************************************
YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA
Sbjct: 250 YGRHAELRFLDLVPSLQLDPAQIYRVTWFISWSPCFSWGCAGEVRAFLQENTHVRLRIFA
309
|
3.5 PHI-BLAST Search with Pattern in
"seedp" Mode
When the pattern occurs in the query more than once and only a specific
occurrence
is of interest, PSI-BLAST search should be done in seedp mode with a
command line like:
blastpgp -i query2 -k pat2 -p seedp -d refseq_protein ...
|
Examples for pat2 and query2 are given below. The HI line in the pattern
file instructs PHI-BLAST
to focus on that specific pattern occurrence during the search. The
position is broken into sections
demarcated by the ambigous length. To find the positions of the pattern
occurrence in a given query,
we can use the program seedtop found in the blast binary distribution
package. See "Pattern Search with
seedtop" for more information.
pat2
----
ID Lipolytic enzymes "G-D-S-L" family, serine active site (PATTERN)
PA G-D-S-[LIVM]-x(1,2)-[TAG]-G.
HI (152 155) (158 159)
query2
------
>gi|15223790|ref|NP_173441.1| Family II extracellular lipase [A.
thaliana]
MNPPTPDPSPKPVAPPGPSSKPVAPPGPSPCPSPPPKPQPKPPPAPSPSPCPSPPPKPQPKPVPPPACPPTPPKPQPKPA
PPPEPKPAPPPAPKPVPCPSPPKPPAPTPKPVPPHGPPPKPAPAPTPAPSPKPAPSPPKPENKTIPAVFFFGDSVFDTGN
NNNLETKIKSNYRPYGMDFKFRVATGRFSNGMVASDYLAKYMGVKEIVPAYLDPKIQPNDLLTGVSFASGGAGYNPTTSE
AANAIPMLDQLTYFQDYIEKVNRLVRQEKSQYKLAGLEKTNQLISKGVAIVVGGSNDLIITYFGSGAQRLKNDIDSYTTI
IADSAASFVLQLYGYGARRIGVIGTPPLGCVPSQRLKKKKICNEELNYASQLFNSKLLLILGQLSKTLPNSTFVYMDIYT
IISQMLETPAAYGFEETKKPCCKTGLLSAGALCKKSTSKICPNTSSYLFWDGKLWGYIKKSQTLIDGLQMLVSMFFFGDS
IIDTGNNNNLTTEMKCNFSPYGKDFPLGVATGRFSNGKVVSDYISEYLGVKPIVPAYFDPNVQLEDLLTGVSFASGGSGY
YHLTPKISRVKSMLEQLTYFQRHIARVKRLVGEEKTDQLLAKGLSVVVAGSNDLAITYYGHGAQLLKDDIHYFTSKMANS
AASFVMQLYEYGARQIAVLGTPPLGCVPILRTLKGGLRRECAQDINYASQLFNVKLSNILDQLAKNLPNSNLIYIDIYSA
FSHILENSADYAQTGTFSAVLAFGDSILDTGNNNLLMTVSRGNFLPYGRDFPHRIPTGRFGNGRVLSDLVASGLGVKDLL
PAFRSPFLKNSELATGVCFASGGSGLDKFTASIQGVIWVQDQVSDFQRYLEKLNQQVGDAAKVKEIIANAVILVSAGNND
LAITYFSTPKRQTRYTVQAYTDMLIGWKTTFINSLYDLGARKFAILGTLPLGCLPGARQITGNLICLPNVNYGARVYNDK
VANLVNQYNQRLPNGKFVYIDMYNSLLEVINNPSQYGKTLSNTRKF
|
Sample alignment output from PHI-BLAST seedp search:
Score = 289 bits (760), Expect = 7e-085
Identities = 158/269 (58%), Positives = 200/269 (74%), Gaps = 13/269 (4%)
Query: 147 AVFFFGDSVFDTGNNNNLETKIKSNYRPYGMDFKFRVATGRFSNGMVASDYLAKYMGVKE
206
pattern 152 ********
++FFFGDS+ DTGNNNNL T++K N+ PYG DF VATGRFSNG V SDY+++Y+GVK
Sbjct: 473 SMFFFGDSIIDTGNNNNLTTEMKCNFSPYGKDFPLGVATGRFSNGKVVSDYISEYLGVKP
532
|
3.6 Integrated PHI-BLAST and PSI-BLAST
search
PHI-BLAST is integrated with PSI-BLAST, its search result can be used for
subsequent
PSI-BLAST iteration directly if the -j option is specified with a value
greater than 1. A
typical search command line, modified from that in 4.6 looks like this:
blastpgp -i query2 -k pat2 -d refp -p patseedp -j 5 -o phi_5.out -Q pssm
|
The first search will be PHI-BLAST in patseedp mode. The result from
PHI-BLAST is passed
on to PSI-BLAST search to run for 4 interations. The alignment and PSSM
are saved to phi_5.out
and pssm, respectively.
3.7 Miscellaneous
Information
Users who develop their own sequence analysis software may wish to develop
their own scoring systems.
They can adapt codes in the function posit.c, which writes out the
checkpoint, to write out
scoring systems derived by other algorithms in such a way that PSI-BLAST
can read the files in later.
The checkpoint structure is generated in a way that it can handle any
position-specific matrix that fits
in the Karlin-Altschul statistical framework for BLAST scoring.
When seaching through the PSI-BLAST page on the Web, only one pattern is
allowed per query. The input pattern has
an upper size limit of 100 characters. The PHI-BLAST Web page supports
only the "patseedp" option since it
does not take the position specification. There the user must explicitly
invoke one round of iteration at a time,
and the PHI-BLAST page provides the option to initiate a PSI-BLAST round
with the PHI-BLAST results.
The PSI-BLAST page is different from the command line blastpgp (with -j
setting) in at least two ways. First,
each submission on the Web page runs only one iteration, and the user must
explicitly approve continuation to the
next iteration. Second, the user can explicitly override the -h setting by
either excluding matches with lower
E-values or explicitly including matches with higher E-values for use in
constructing the PSSM.
Standalone PSI-BLAST can also be used to convert multiple sequence
alignments from Pfam release into scoremat.
To do this, we need to extract out individual alignment blocks from the
master file and saved it to a file.
Furthermore, we need to extract out the longest sequences from each
alignment block and save it to a separate
file for use as query. For example, if we saved an alignment block to file
A and its longest sequence
query.A, we can convert the alignment block into an ASN.1 scoremat using
the following command (unix/linux):
blastpgp -i query.A -B A -u 1 -C scoremat.A -d dummy_db >/delv/null
|
The dummy_db is a formatted protein blast database with only one sequence.
We can do this since we are not
interested in the alignment result. Rather, we are after the checkpoint
scoremat used in judging or scoring
the alignment. The collection of -C output for all the entries can be used
for generate an RPSBLAST database
using formatrpsdb tool.
4. Technical Assistance
For questions and comments on this document and BLAST in general, please
send them to:
blast-help@ncbi.nlm.nih.gov
Questions and comments on other NCBI resources should be addressed to:
info@ncbi.nlm.nih.gov
5. Appendix: Parameters and Their
Accepted Values
All the program parameters for blastpgp can be printed on the screen using
the following
command line:
blastpgp - <hit enter key>
Here, we list them out individually in the tables below, each with
detailed description
and example usage.
Table 5.1 |
Parameter | -p |
Function | Specifies search mode
for PHI-BLAST |
Default | blastpgp |
Input Format | [String] |
Example | To run PHI-BLAST, used:
-p patseedp |
Note A third option is "seedp", which reads in pattern file with HI
lines.
Table 5.2 |
Parameter | -d |
Function | Specifies target
database |
Default | nr |
Input Format | [String] |
Example | To run search against
refseq_protein, use: -d refseq_protein |
Note Database file name should be provided WITHOUT file extension.
Multiple databases should be provided with
double quotes, like in -d "refseq_protien pdbaa".
Table 5.3 |
Parameter | -i |
Function | Specifies input query
file |
Default | stdin |
Input Format | [File In] |
Example | To search with
my_kinase.txt, use: -i my_kinase.txt |
Note User the complete query file name WITH extension. The sequences
should be in FASTA format, and the
file should be saved as a plain text file.
Table 5.4 |
Parameter | -o |
Function | Saves the oput in the
specified file |
Default | stdout |
Input Format | [String] |
Example | To save result in
my_kinase.out, use: -o my_kinase.out |
Note Users can use redirecttion with < symbol followed by file
name.
Table 5.5 |
Parameter | -e |
Function | The upper limit of
Expect value cutoff |
Default | 10.0 |
Input Format | [Real] use either
floating point or scientific notation |
Example | To decrease the e value
to 0.1, use: -e 0.1 |
Note Accepted formats are 10, 3/4, 0.01, or 2e-34. Smaller values
increase the search stringency.
Table 5.6 |
Parameter | -m |
Function | alignment view
options |
Default | 0 |
Input Format | [Integer] |
Example | To get result in XML
format, use: -m 7 |
Note Accepted values and the format they specify
0 = pairwise
1 = query-anchored showing identities
2 = query-anchored no identities
3 = flat query-anchored, show identities
4 = flat query-anchored, no identities
5 = query-anchored no identities/blunt ends
6 = flat query-anchored, no identities/blunt ends
7 = XML Blast output
8 = tabular
9 = tabular with comment lines
10 = ASN, text
11 = ASN, binary [Integer]
Table 5.7 |
Parameter | -F |
Function | Filters query
sequence |
Default | F |
Input Format | [String] |
Example | To activate filter, use:
-F T To mask lookup table when filtering, use: -F "m S" |
Note It uses SEG when set to T. For more information see
Section
6.4 of "BLAST URLAPI". When
-t setting is > 0 (the default is 1), it has the side effect of masking
the database sequences. Additional masking of
the query sequence under this is likely to be counterproductive.
Table 5.8 |
Parameter | -f |
Function | Sets score threshold for
extending an initial word match |
Default | 11 |
Input Format | [Integer] |
Example | To increase the
requirement to 12, use: -f 12 |
Note Increase the setting makes the search more stringent. This
parameter is one of the main reasons that BLAST
is "heuristic"; increasing the value of -f saves time but eliminates
matches. The default value of -f is set with
the assumption that the the default word size -W 3 (Table
5.30) is used. If one instead uses -W 2,
then -f should be lowered.
Table 5.9 |
Parameter | -A |
Function | Multiple hit window
size |
Default | 40 |
Input Format | [integer] |
Example | To reduce the window size
to 30, use: -A 30 |
Note When -P is set to 0 (Table 5.13), blastpgp
requires two word hits within the distance
specified by -A to initiate alignment extension. A reduction in setting of
.-A. saves time at the risk of missing matches.
Table 5.10 |
Parameter | -G |
Function | Specifies the cost for
opening a gap |
Default | 11 |
Input Format | [integer] |
Example | To increase this cost to
12, use: -G 12 |
Note Corresponding -E change may be required. For allowed value when
using non-PSSM, see
Section
6.11.2
of BLAST URLAPI.
Table 5.11 |
Parameter | -E |
Function | Cost to extend a
gap |
Default | 1 |
Input Format | [integer] |
Example | To increase this to 2,
use: -E 2 |
Note Corresponding -G change may be required.
Table 5.12 |
Parameter | -N |
Function | Sets threshold for
gapped extension for an ungapped alignment |
Default | 22 |
Input Format | [integer] |
Example | To increase this to 25,
use: -N 25 |
Note Similar to X-dropoff for ungapped alignment extension(?). Score
threshold measured in bits. Increasing -N saves time, but risks missing
matches.
Table 5.13 |
Parameter | -P |
Function | Type of hits |
Default | 0 |
Input Format | [integer] |
Example | To run in single hit
mode, use: -P 1 |
Note 0 for multiple hits; 1 for single hit.
Table 5.14 |
Parameter | -S |
Function | Start of required region
in query |
Default | 1 |
Input Format | [Integer] |
Example | To start in position 100
in the query, use: -S 100 |
Note To be used in conjunction with -H.
Table 5.15 |
Parameter | -H |
Function | End of required region
in query |
Default | -1 |
Input Format | [integer] |
Example | To end in position 200 in
the query, use: -H 200 |
Note This and -S specify the subsequence of the query to use. They
function in a manner different from
-L in blastall, in that the subsequence has to be in the alignment, but
the alignment may be extend beyond the
specified region.
Table 5.16 |
Parameter | -a |
Function | Number of processors to
use |
Default | 1 |
Input Format | [integer] |
Example | To utilize both
processors in a two processor machine, use: -a 2 |
Note 1 up to the number of processors available.
Table 5.17 |
Parameter | -I (upper case i) |
Function | Displays GI in the
deflines for matches from NCBI database |
Default | F |
Input Format | [T/F] |
Example | To activate gi display,
use: -I T |
Note
-I F: ref|NP_000241.1|
myeloperoxidase [Homo sapiens]
-I T: gi|4557759|ref|NP_000241.1| myeloperoxidase [Homo
sapiens]
Table 5.18 |
Parameter | -h |
Function | E-value threshold for
inclusion of hits from round i |
Default | 0.002 |
Input Format | [real] |
Example | To loosen this to 0.005,
use: -h 0.005 |
Note Hits with expect value lower than the setting will be included in
the construction of the PSSM,
which will be used in the scoring of hits in round i+1.
Table 5.19 |
Parameter | -c |
Function | Constant in pseudocounts
for multipass version |
Default | 9 |
Input Format | [Integer] |
Example | To change this to 10,
use: -c 10 |
Note Pseudocounts from a standard probability distribution are combined
with the actual letter distribution of the query.
Table 5.20 |
Parameter | -j |
Function | Maximal rounds of
iteration |
Default | 1 |
Input Format | [Integer] |
Example | To search for 10
iterations, use: -j 10 |
Note Default is single iteration - regular blastp or PHI-BLAST without
subsequent iterations in PSI-BLAST mode.
To generate PSSM, setting should be ≥ 2 unless -B or -R is used. Use
-j 0 if you want the search to continue till convergence.
This may be risky since tests in [2] suggest that -j 5
or -j 6 works best on average.
Table 5.21 |
Parameter | -J |
Function | Believe the query
defline |
Default | F |
Input Format | [T/F] |
Example | To believe the query
defline, use: -J T |
Note -J T is required when saving SeqAlign object (-O) or scoremat (-C,
-Q).
Table 5.22 |
Parameter | -Z |
Function | X dropoff value for
final gapped alignment (in bits) |
Default | 25 |
Input Format | [Integer] |
Example | To decrease this to 20,
use: -Z 20 |
Note Smaller settings tend to generate shorter alignments because they
reduce the flexibility to include large gaps.
Table 5.23 |
Parameter | -O |
Function | Saves SeqAlign Object to
file |
Default | Optional |
Input Format | [File Out] |
Example | To same it to
my_blastobj, use: -O my_blastobj |
Note -J, 'Believe the query defline', must be set to T. The file can be
used to regenerate the alignment in various formats
without rerun of the search.
Table 5.24 |
Parameter | -M |
Function | Matrix |
Default | BLOSUM62 |
Input Format | [String] |
Example | To change it to PAM70,
use: -M PAM70 |
Note This only affects the initial blastp or PHI-BLAST search.
Table 5.25 |
Parameter | -v |
Function | Sets the upper limit of
descriptions to display in result |
Default | 500 |
Input Format | [Integer] |
Example | To decrease the
descriptions to 100, use: -v 100 |
Note Web counterpart: Descriptions.
Table 5.26 |
Parameter | -b |
Function | Upper limit of database
sequences to show alignments for |
Default | 250 |
Input Format | [Integer] |
Example | To increase the
alignments saved to 5000, use: -b 500 |
Note Setting this parameter to a smaller value may affect the PSSM
constructed.
Table 5.27 |
Parameter | -C |
Function | Saves PSI-BLAST
checkpointing output file to specified file |
Default | Optional |
Input Format | [File Out] |
Example | To save checkpointing
file to my_checkpoint, use: -C my_checkpoint |
Note The saved checkpoint file or scoremat can be used to search the
same query against another database.
To be read in through -R parameter or used in RPSBLAST database
construction.
Table 5.28 |
Parameter | -R |
Function | Input File for PSI-BLAST
Restart |
Default | Optional |
Input Format | [File In] |
Example | To read in checkpoint
file my_chechpoint, use: -R my_checkpoint |
Note This is for restart a previously unfinished PSI-BLAST search or
use the scoremat to score search against
a different database using the same query.
Table 5.29 |
Parameter | -W |
Function | Word size |
Default | 3 |
Input Format | [Integer] |
Example | To set word size to 2,
use: -W 2 |
Note Only two options are available. If -W 2 is used, one may need to
reduce the -f setting also.
Table 5.30 |
Parameter | -z |
Function | Effective length of the
database |
Default | 0 |
Input Format | [Real] |
Example | To set this to an
arbitrary (reference) value of 109, use: -z 1e9 |
Note Use zero for the actual database size.
Table 5.31 |
Parameter | -K |
Function | Number of best hits from
a region to keep |
Default | 0 |
Input Format | [Integer] |
Example | To keep only the best 100
hits, use: -K 100 |
Note Custom value not recommended.
Table 5.32 |
Parameter | -s |
Function | Compute locally optimal
Smith-Waterman alignments |
Default | F |
Input Format | [T/F] |
Example | To do local optimal
Smith-Waterman alignments, use: -s T |
Note This may change the alignment and its score. This parameter is not
available in the PSI-BLAST page.
Table 5.33 |
Parameter | -Y |
Function | Effective length of the
search space |
Default | 0 |
Input Format | [Real] |
Example | To set this to a
reference value of 2000000, use: -Y 2000000 |
Note Use zero for the real size.
Table 5.34 |
Parameter | -k |
Function | Specifies the input
pattern for PHI-BLAST search |
Default | hit_file |
Input Format | [File In] |
Example | To read in pattern from
protease_pat, use: -k protease_pat |
Note This pattern input is used by PHI-BLAST search in patseedp or
seedp mode.
Table 5.35 |
Parameter | -T |
Function | Produces HTML
output |
Default | F |
Input Format | [T/F] |
Example | To generate HTML output,
use: -T T |
Note Database entries will be hot-linked to corresponding Entrez
records when the result is viewed with a browser.
For NCBI databases only.
Table 5.36 |
Parameter | -Q |
Function | Generates human readable
PSSM |
Default | Optional |
Input Format | [File Out] |
Example | To save the PSSM to
my_pssm, use: -Q my_pssm |
Note The file is an AxQ matrix with log odd score for matches to
different residues at each position.
Table 5.37 |
Parameter | -B |
Function | Reads in an alignment
file to jump start PSI-BLAST search |
Default | Optional |
Input Format | [File In] |
Example | To jump start with
alignment in my_alignment, use: -B my_alignment |
Note See Section 3.3 for input example and file
format.
Table 5.38 |
Parameter | -l (lower case L) |
Function | Restricts search of
database to a list of GI's |
Default | Optional |
Input Format | [String] |
Example | To restrict the search to
GIs in sequences.gi, use: -l sequences.gi |
Note This only applies to NCBI supplied databases formatted with -o T.
It enables users to search a virtual database consists
of a subset of entries specified by the input file.
Table 5.39 |
Parameter | -U |
Function | Uses lower case
filtering of query sequence |
Default | F |
Input Format | [T/F] |
Example | To activate this
filtering function, use: -U T |
Note It enables users to mannually mask specific regions in the query
with lower case letters. This will make the search
focus on the region of interest left in UPPERCASE.
Table 5.40 |
Parameter | -t |
Function | Uses composition-based
statistics or compositional matrix adjustment |
Default | 1 (T) |
Input Format | [String] |
Example | To inactivate
composition-based statistics, use: -t F or -t 0 |
Note A value > 0 also has the side-effect of masking the database
sequences. At this time, the settings 2 and 3
are supported only if -M is set to the default matrix BLOSUM62. Accepted
values and their functions are
0, F, or f: no composition-based statistics
1, T, or t: Composition-based statistics [2]
2: Composition-based score adjustment [5] conditioned on
sequence properties in round 1
3: Composition-based score adjustment [5]
unconditionally in round 1
Table 5.41 |
Parameter | -q |
Function | Specifies the input
format of scoremat (checkpoint) |
Default | 0 |
Input Format | [Integer] |
Example | To read in an ASCII
scoremat, use: -q 1 |
Note To be used with -R. Accepted values and their functions are
0: no scoremat input
1: Restart is from ASCII scoremat checkpoint file
2: Restart is from binary scoremat checkpoint file
Table 3.42 |
Parameter | -u |
Function | Specifies the output
format of scoremat (checkpoint) |
Default | 0 |
Input Format | [Integer] |
Example | To generate ASCII
scoremat, use: -u 1 |
Note Works together with -C. Accepted values and their functions
are
0: no scoremat output
1: Output is ASCII scoremat checkpoint file (requires -J)
2: Output is binary scoremat checkpoint file (requires -J)
Table 5.43 |
Parameter | -L |
Function | Cost to decline
alignment |
Default | 0 |
Input Format | Integer |
Example | To set this to 50, use:
-L 50 |
Note Disabled when set to 0.
6. References
[1]
Altschul et al, 1997. Nucleic Acids Res. 25: 3389 - 3402.
[2]
Schaffer et al, 2001. Nucleic Acids Res. 29: 2994 - 3005.
[3] Baeza-Yates and Gonnet, 1992. Comm. ACM 35: 74 -
82.
[4] Wu and Manber, 1992. Comm. ACM 35: 83 - 91.
[5]
Yu and Altschul, 2005. Bioinformatics 21: 902-911.
[6]
Aravind and Ponting, 1998. Protein Science 7: 1250 - 1254.
|