of a lysine). More generally, a 1–X nonbonded interaction is defined as that existing between two atoms that are separated by a connectivity of X − 1 (X minus one) covalent bonds in the structure. In the case of rings, the minimal connectivity in terms of covalent bonds between the two atoms is adopted.
The detection of residues with structural errors was carried out by means of smoothed and normalized energy profiles. The normalized total energy score per residue (ER) is defined as follows:
Building of a benchmark set of accurate and highly accurate protein structure models
To assess the performance of knowledge-based potentials in the detection of structural errors of small magnitude, two sets of comparative protein structure models were built (The models in the class A set represent 35 distinct folds. In the class B set, 34 distinct folds are represented. Therefore, even though not all the models in these two sets were built for proteins representing different folds, they are not strongly biased to any particular fold since a large fraction of the models has a distinct fold. In terms of a more general classification, as the one defined by the composition and arrangement of secondary structure elements, a fairly good and poorly biased representation is achieved. For the class A set, 26% of the models contain only α-helices as secondary structure elements, 24% contain only β-sheets, 18% contain α and β, and 33% of the models contain α + β secondary structure elements in their structures. In the class B set of models, 36% of the models contain only α-helices, 31% contain only β-sheets, 13% contain α and β, and 20% contain α + β.
The 57 accurate and 55 highly accurate protein models were selected from a existing set of 3375 models with a correct fold that has been described previously (Sanchez and Sali 1998; Melo et al. 2002). This original set of 3375 models was built by the comparative modeling of representative chains of the Protein Data Bank (PDB) (Berman et al. 2002). The models were built based on the correct templates and mostly correct alignments between the target sequences and the template structures. The models were obtained by applying MODPIPE to 1.085 representative chains of the PDB (Sanchez and Sali 1998). These representative sequences corresponded to the protein chains in PDB that shared <30% sequence identity or were >30 residues different in length. The templates for comparative modeling, selected by MODPIPE, were 1637 PDB chains with <80% sequence identity to each other or >30 residues different in length. Each target sequence was aligned separately with each one of the 1637 known structures using the program ALIGN, which implements local sequence alignment by dynamic programming (Altschul 1998). Only the target–template alignments with a significance score higher than 22 nats (corresponding approximately to the PSI-BLAST e-value of 10−4) were used, resulting in 3993 models. Models with <30% structural overlap with the actual experimentally determined structure were eliminated. Structural overlap was defined as the fraction of the equivalent Cα atoms upon least squares superposition of the two structures with the 3.5 Å cutoff. This procedure also removed models based on correct templates that had a poor alignment and models based on templates that had large domain or rigid-body movements with respect to the target structure. The final set contained 3375 correct models (Melo et al. 2002).
The set of 3375 correct models was initially filtered by updating the target and template structures currently available at the PDB and checking the sequence alignments originally used to build the models. A total of 132 models presented inconsistencies between the target sequence in the original alignment used to build the model and the current target sequence available at the PDB. These 132 models were removed, thus resulting in a total of 3243 entries (Fig. 1). Then a second filter was applied, and we selected only those protein models of a length >100 residues for which >90% of their residues were possible to model. Finally, as explained above, two independent filters were applied to select those models belonging to the class A and class B sets (Fig. 1). All models in both sets were built for target monomeric proteins. The 3D coordinates of these models in PDB format are available as supplemental material at http://protein.bio.puc.cl/sup-mat.html.
Based on these two rates, receiver operating characteristic (ROC) curves were calculated for each potential as previously described (Fawcett 2004) and used to assess its performance. Receiver operating characteristic (ROC) curves are two-dimensional graphs in which the TP rate is plotted on the Y-axis and the FP rate is plotted on the X-axis (Swets 1988; Swets et al. 2000; Fawcett 2004). An ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives) for all possible decision thresholds.
In addition to the ROC curves, which constitute the best way to compare two classifiers, we also calculated four overall measures to assess and compare the performance of the potentials in the detection of errors in accurate and highly accurate protein structure models. The first measure was the area under the ROC curve (AUC), which ranges from 0.5 to 1.0. The AUC represents somehow an overall measure or summary of the ROC itself, and it has an important statistical property: The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance (Fawcett 2004). The second measure was the accuracy (ACC), which is defined as:
Since almost all the metrics mentioned above depend on the particular decision threshold that is used to classify the instances (with AUC as the only exception), we report these metrics at a single and fixed value that is called the optimal threshold (OT). The OT is uniquely defined by the point in the ROC curve that has the minimal distance to the upper-left corner of the ROC graph, which would correspond to a perfect classifier (i.e., fp = 0.0 and tp = 1.0).
References - Altschul S. 1998. Generalized affine gap costs for protein sequence alignment. Proteins 32: 88–96. [PubMed].
- Baker D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 93–96. [PubMed].
- Berman H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. 2002. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58: 899–907. [PubMed].
- Bernstein H.J. 2000. Recent changes to RasMol, recombining the variants. Trends Biochem. Sci. 25: 453–455. [PubMed].
- Bonneau R., Strauss, C.E., Rohl, C.A., Chivian, D., Bradley, P., Malmstrom, L., Robertson, T., and Baker, D. 2002. De novo prediction of three-dimensional structures for major protein families. J. Mol. Biol. 322: 65–78. [PubMed].
- Bradley P., Chivian, D., Meiler, J., Misura, K.M., Rohl, C.A., Schief, W.R., Wedemeyer, W.J., Schueler-Furman, O., Murphy, P., Schonbrun, J., et al. 2003. Rosetta predictions in CASP5: Successes, failures, and prospects for complete automation. Proteins 53: 457–468. [PubMed].
- Brooks B., Bruccoleri, R., Olafson, B., States, D., Swaminathan, S., and Karplus, M. 1983. CHARMM: A program for macromolecular energy, minimizations and dynamic calculations. J. Comput. Chem. 4: 187–217.
- Burley S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the Human Genome Project. Nat. Genet. 23: 151–157. [PubMed].
- DeBolt S.E. and Skolnick, J. 1996. Evaluation of atomic level mean force potentials via inverse folding and inverse refinement of protein structures: Atomic burial position and pairwise nonbonded interactions. Protein Eng. 9: 637–655. [PubMed].
- Dill K.A. 1997. Additivity principles in biochemistry. J. Biol. Chem. 272: 701–704. [PubMed].
- Fawcett T. 2004. ROC graphs: Notes and practical considerations for researchers. Kluwer Academic Publishers, The Netherlands.
- Feliciangeli S.F., Thomas, L., Scott, G.K., Subbian, E., Hung, C., Molloy, S.S., Jean, F., Shinde, U., and Thomas, G. 2006. Identification of a pH sensor in the Furin propeptide that regulates enzyme activation. J. Biol. Chem. 281: 16108–16116. [PubMed].
- Fiser A., Do, R.K., and Sali, A. 2000. Modeling of loops in protein structures. Protein Sci. 9: 1753–1773. [PubMed].
- Gonzalez E.M., Reed, C., Bix, G., Fu, J., Zhang, Y., Gopalakrishnan, B., Greenspan, D.S., and Iozzo, R.V. 2004. BMP-1/Tolloid-like metalloproteases process endorepellin, the angiostatic C-terminal fragment of perlecan. J. Biol. Chem. 280: 7080–7087. [PubMed].
- Hagler A.T., Huler, E., and Lifson, S. 1974. Energy functions for peptides and proteins. I. Derivation of a consistent force field including the hydrogen bond from amide crystals. J. Am. Chem. Soc. 96: 5319–5327. [PubMed].
- Jones D.T. 1999. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287: 797–815. [PubMed].
- Jones D.T., Taylor, W.R., and Thornton, J.M. 1992. A new approach to protein fold recognition. Nature 358: 86–89. [PubMed].
- Kopp J. and Schwede, T. 2004. The SWISS-MODEL repository of annotated three-dimensional protein structure homology models. Nucleic Acids Res. 32: D230–D234. [PubMed].
- Kryshtafovych A., Venclovas, C., Fidelis, K., and Moult, J. 2005. Progress over the first decade of CASP experiments. Proteins S7: 225–236. [PubMed].
- Lazaridis T. and Karplus, M. 1998. Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288: 477–487. [PubMed].
- Lu H. and Skolnick, J. 2001. A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 44: 223–232. [PubMed].
- MacKerell A.D., Bashford, D., Bellott, M., Dunbrack Jr, R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., Ha, S., et al. 1998. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102: 3586–3616.
- Marti-Renom M.A., Stuart, A., Fiser, A., Sanchez, R., Melo, F., and Sali, A. 2000. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29: 291–325. [PubMed].
- Melo F. and Feytmans, E. 1997. Novel knowledge-based mean force potential at atomic level. J. Mol. Biol. 267: 207–222. [PubMed].
- Melo F. and Feytmans, E. 1998. Assessing protein structures with a non-local atomic interaction energy. J. Mol. Biol. 277: 1141–1152. [PubMed].
- Melo F., Sanchez, R., and Sali, A. 2002. Statistical potentials for fold assessment. Protein Sci. 11: 430–448. [PubMed].
- Miyazawa S. and Jernigan, R.L. 1985. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 18: 534–552.
- Park B.H. and Levitt, M. 1996. Energy functions that discriminate X-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258: 367–392. [PubMed].
- Park B.H., Huang, E.S., and Levitt, M. 1997. Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J. Mol. Biol. 266: 831–846. [PubMed].
- Pieper U., Eswar, N., Braberg, H., Madhusudhan, M.S., Davis, F., Rossi, A., Marti-Renom, M.A., Karchin, R., Webb, B., Melo, F., et al. 2006. MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res. 33: 291–295.
- Pizzo E., Buonanno, P., Di Maro, A., Ponticelli, S., De Falco, S., Quarto, N., Cubellis, M.V., and D'Allesio, G. 2006. Ribonucleases and angiogenins from fish. J. Biol. Chem. 281: 27454–27460. [PubMed].
- Rohl C.A., Strauss, C.E., Misura, K.M., and Baker, D. 2004. Protein structure prediction using Rosetta. Methods Enzymol. 383: 66–93. [PubMed].
- Rychlewski L., Zhang, B., and Godzik, A. 1998. Function and fold predictions for Mycoplasma genitalium proteins. Fold. Des. 3: 229–238. [PubMed].
- Rychlewski L., Zhang, B., and Godzik, A. 1999. Insights from structural predictions: Analysis of Escherichia coli genome. Protein Sci. 8: 614–624. [PubMed].
- Sali A. and Blundell, T.L. 1993. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234: 779–815. [PubMed].
- Sali A., Fiser, A., Sanchez, R., Marti-Renom, M.A., Jerkovic, B., Badretdinov, A., Melo, F., Overington, J., and Feyfant, E. 2001. MODELLER, a protein structure modeling program, Release 6v0. http://salilab.org/modeller/.
- Samudrala R. and Moult, J. 1998. An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275: 895–916. [PubMed].
- Sanchez R. and Sali, A. 1998. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. 95: 13597–13602. [PubMed].
- Sanchez R., Pieper, U., Melo, F., Eswar, N., Marti-Renom, M.A., Madhusudhan, M.S., Mirkovic, N., and Sali, A. 2000. Protein structure modeling for structural genomics. Nat. Struct. Biol. 7: (Suppl): 986–990. [PubMed].
- Sayle R. and Milner-White, E.J. 1995. RasMol: Biomolecular graphics for all. Trends Biochem. Sci. 20: 374. [PubMed].
- Schwede T., Kopp, J., Guex, N., and Peitsch, M.C. 2003. SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res. 31: 3381–3385. [PubMed].
- Sippl M.J. 1990. Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213: 859–883. [PubMed].
- Sippl M.J. 1993a. Boltzmann's principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J. Comput. Aided Mol. Des. 7: 473–501. [PubMed].
- Sippl M.J. 1993b. Recognition of errors in three-dimensional structures of proteins. Proteins 17: 355–362. [PubMed].
- Sippl M.J. and Weitckus, S. 1992. Detection of native like models for amino acid sequences of unknown three dimensional structure in a data base of known protein conformations. Proteins 13: 258–271. [PubMed].
- Solis A.D. and Rackovsky, S. 2006. Improvement of statistical potentials and threading score functions using information maximization. Proteins 62: 892–908. [PubMed].
- Swets J.A. 1988. Measuring the accuracy of diagnostic systems. Science 240: 1285–1293. [PubMed].
- Swets J.A., Dawes, R.M., and Monahan, J. 2000. Better decisions through science. Sci. Am. 283: 82–87. [PubMed].
- Tsai J., Bonneau, R., Morozov, A.V., Kuhlman, B., Rohl, C.A., and Baker, D. 2003. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 52: 76–87. [PubMed].
- Wang K., Fan, B., Levitt, M., and Samudrala, R. 2004. Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct. Biol. 4: 8. [PubMed].
- Weiner P.K. and Kollman, P.A. 1981. AMBER: Assisted model building with energy refinement. A general program for modeling molecules and their interactions. J. Comput. Chem. 2: 287–303.
- Xia Y., Huang, E.S., Levitt, M., and Samudrala, R. 2000. Ab initio construction of protein tertiary structures using a hierarchical approach. J. Mol. Biol. 300: 171–185. [PubMed].
- Zhang L., Godzik, A., Skolnick, J., and Fetrow, J.S. 1998. Functional analysis of the Escherichia coli genome for members of the α/β hydrolase family. Fold. Des. 3: 535–548. [PubMed].
- Zhang B., Rychlewski, L., Pawlowski, K., Fetrow, J.S., Skolnick, J., and Godzik, A. 1999. From fold predictions to function predictions: Automation of functional site conservation analysis for functional genome predictions. Protein Sci. 8: 1104–1115. [PubMed].
- Zhang Y., Kolinski, A., and Skolnick, J. 2003. TOUCHSTONE II: A new approach to ab initio protein structure prediction. Biophys. J. 85: 1145–1164. [PubMed].
- Zhou H. and Zhou, Y. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11: 2714–2726. [PubMed].
|