Evolutionary Plasticity of Protein Families: Coupling Between Sequence and Structure Variation

doi:10.1002/prot.20644

Journal List > NIHPA Author Manuscripts

Proteins.Author manuscript; available in PMC 2007 August 8.

Published in final edited form as:

Proteins. 2005 November 15; 61(3): 535–544.

doi: 10.1002/prot.20644.

PMCID: PMC1941674

NIHMSID: NIHMS13499

Evolutionary Plasticity of Protein Families: Coupling Between Sequence and Structure Variation

Anna R. Panchenko,¹^* Yuri I. Wolf,¹ Larisa A. Panchenko,² and Thomas Madej¹

¹ Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland

² Department of Biology, Moscow State University, Moscow, Russia

*Correspondence to: Anna Panchenko, Computational Biology Branch, National Center for Biotechnology Information, Building 38A, National Institutes of Health, Bethesda, MD 20894. E-mail: panch/at/ncbi.nlm.nih.gov

The publisher's final edited version of this article is available at Proteins.

See other articles in PMC that cite the published article.

Abstract

In this work we examine how protein structural changes are coupled with sequence variation in the course of evolution of a family of homologs. The sequence–structure correlation analysis performed on 81 homologous protein families shows that the majority of them exhibit statistically significant linear correlation between the measures of sequence and structural similarity. We observed, however, that there are cases where structural variability cannot be mainly explained by sequence variation, such as protein families with a number of disulfide bonds. To understand whether structures from different families and/or folds evolve in the same manner, we compared the degrees of structural change per unit of sequence change (“the evolutionary plasticity of structure”) between those families with a significant linear correlation. Using rigorous statistical procedures we find that, with a few exceptions, evolutionary plasticity does not show a statistically significant difference between protein families. Similar sequence–structure analysis performed for protein loop regions shows that evolutionary plasticity of loop regions is greater than for the protein core.

Keywords: protein structural evolution, sequence variation, protein loops, sequence-structure correlation

INTRODUCTION

A protein sequence folds into a unique, highly ordered conformation which maintains its specific function. As proteins evolve, their sequences change due to amino acid replacements, the majority of which are believed to be effectively neutral.¹ Consequently, protein-specific function, structure, folding, and the protein–protein interaction network as a rule change gradually in the course of evolution. Indeed, the overall protein structural topology is so well preserved throughout evolution that proteins that diverged billions of years ago may still show remarkable structural resemblance and, in many cases, sequence conservation as well.²

The fundamental question of whether protein structures evolve by divergence or by convergence inspired many comparative studies of protein structures and networks of protein similarities.³^–¹⁰^,⁴² According to the convergent scenario, protein structural similarity can occur independently in two proteins due to the limited number of topological arrangements.¹¹^,¹² Recently, it has been shown that convergent models do not adequately describe the patterns of sequence and structural similarity observed in the populations of real proteins by using graph theoretical methods.⁸^,¹⁰ By contrast, the scale-free behavior and other important characteristic features of protein networks can be correctly reproduced using divergent models of structural evolution.⁷^–¹⁰ In these models, new protein structures emerge, and existing structures change through the processes of duplication and subsequent divergence from a common ancestor.

The sequence and structural analysis of many commonly observed protein folds points to the dominant role of divergent mechanisms in protein structural evolution as well.¹³^–¹⁷ It has been demonstrated, for example, that proteins from the TIM barrel, OB-fold, cupredoxin, and β-trefoil folds have common features in their topology, nature of ligands, and location of catalytic residues, which points to the plausibility of divergent scenarios for these and other protein folds comprising the protein universe. In a previous study, we likewise observed a significant linear correlation between sequence similarity and loop structural similarity for the aforementioned folds.¹⁸ Given that the loops do not contribute much to the protein core stability, we argued that the strong coupling between the changes in sequence and loop structure can only happen due to divergent evolution.

Chothia and Lesk first addressed the question of coupling between the structural and sequence changes in proteins, and found an exponential dependence of root-mean-square deviation on percent of sequence identity.² Further studies that were performed on larger datasets of proteins showed similar results.⁵^,¹⁹ Recently, however, it has been shown on a sample of 36 protein families that most of the structural variation in aligned regions of homologous proteins is linearly correlated with the changes in sequence which supports the “global” model of protein structure.²⁰ According to this model, all residue–residue interactions, not just a few key residues, are important in determining the unique protein structure. In an attempt to solve the “fold recognition” problem and design structural models for new sequences, Koehl and Levitt performed an analysis of how structural changes between two protein folds correlate with the differences between the sequences that are compatible with these folds.²¹ They also found, on a benchmark of 12 protein families, that structural changes as measured by cRMS are linearly related to the changes in sequence.

In this article we study how the protein structure changes in its conserved aligned core regions and unaligned loop regions as proteins diverge from a common ancestor. We performed a sequence–structure correlation analysis on a large number of families of homologous proteins and found a statistically significant linear correlation between measures of sequence and structural similarity for the great majority of these families. This finding allows us to address the next important question of how much sequence change can protein structure tolerate, and whether it depends on the type of protein fold, or on some other sequence and structural characteristics. We call this quantity “the evolutionary plasticity of structure” (EPS), and estimate it by calculating the regression coefficients of linear sequence–structure dependencies for homologs.

METHODS

Test Set

Sets of homologous protein families were extracted from the CDD search database version 1.62 at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The CDD collection of protein domain alignments includes curated CDDs²² and preprocessed domain families imported from SMART and PFAM, 6222 protein domain families altogether.²³ Upon import, the sequences from SMART/PFAM alignments with more than 75% identity with known structures were substituted by the most similar structure from the Protein Data Bank.²⁴ Those families containing short sequence repeats and having average alignment length of less than 50 residues were excluded from the test set.

Each CDD family was decomposed into a set of pairwise structure–structure alignments. Structural alignments within CDD families were computed by the VAST algorithm,²⁵ and were selected for analysis according to the following criteria: (a) the mutual overlap between the VAST alignment footprint and CDD footprint (the footprint for a given sequence was defined as a region between the first and the last residues aligned by VAST or CDD) was at least 80%; (b) X-ray resolution of both structures in a pair was better than 3.0 Å; (c) BLAST E-value calculated for VAST alignment was less than 0.01; (d) any discontinuous domain²⁶ inconsistently aligned between VAST and CDD was disregarded.

Additionally, to the requirements imposed on structural pairs we selected protein families based on the following criteria: (a) the protein family should contain at least 10 structurally aligned protein pairs; (b) proteins from a given family should span a wide range of sequence similarity, that is, should cover a range of at least 30% in sequence identity between the most diverged and least diverged structural pair; (c) not more than two protein family alignments from the same domain cluster were retained in the final test set; the redundancy between protein families was checked by using the procedure implemented in the CDART algorithm.²⁷ Even though these protein families can belong to the same domain cluster, they are coming from different sources and have rather different alignments (Table I).

TABLE I

The final test set comprised 81 CDD families covering a wide range of functional and structural classes. The list of test families together with their length, number of protein pairs, and the PDB code of the first structure is shown in Table I. The test set for loop analysis contained 59 families, excluding 22 families that had a high fraction of pairs with missing coordinates in loop regions (see the next section).

Measures of Structural and Sequence Similarity

To measure the quality of linear correlation between sequence and structural characteristics for homologous proteins from the same family, we first need to choose the most sensitive and reliable measures of sequence and structural similarity. Because most of the structural similarity measures (RMSD, AHM, LHM) are extensive and depend on the number of residues and protein size, the aforementioned structural measures should be divided by the radius of gyration (similar but not identical results were obtained with the normalization by the square root of the number of aligned residues). The radius of gyration for a protein pair was calculated for each of the two proteins in the pair based on the structurally aligned part and then was averaged. As a result, the normalized RMSD, AHM, and LHM quantities do not depend on the number of residues any more. Nonnormalized conventional measures of structural similarity yielded weaker sequence/structure correlation (not shown) so that in our further analysis we used only normalized structural similarity measures.

The sequence similarity was measured as the BLAST bitscore²⁸ divided by the alignment length (bitscore per residue). Structural similarity measures based on comparing the structures in the aligned regions comprised RMSD, fraction of conserved contacts (CC), and aligned Hausdorff measure (AHM), whereas the loop-based Hausdorff measure (LHM) quantified the difference in the loop regions. The fraction of conserved contacts was calculated as a fraction of identical residue contacts in both structures divided by the average number of contacts in both structures made by the aligned residues.²⁹ The contacts were defined between residues separated along the chain by at least five peptide bonds and having C^α atoms less than 8 Å apart.

The root-mean-squared deviation (RMSD) was calculated using the superposition algorithm due to McLachlan.³⁰ Another measure that quantified the structural difference of proteins between the aligned regions and between the loops was based on the mathematical concept of Hausdorff distance.¹⁸^,³¹ Let A = {a₁, …,a_m} and B = {b₁, …,b_n} be finite point sets in a Euclidean space. The Hausdorff distance between the sets A and B is then defined by:

(1)

Here, the terms d(a_i,b_j) denote the Euclidean distance between the points. In other words, the Hausdorff distance between the sets A and B is the smallest distance such that every point a_i [set membership] A is within this distance of some point b_j B, and vice versa. Hausdorff distance can be defined under the assumption that the structural alignment between two domains is known and the C^α atoms for both structures are in a common coordinate frame.

The Hausdorff measure for loops (LHM) was calculated as follows:

(2)

Here “loop” is defined as a region between two consecutive aligned secondary structure elements and n_s is the number of aligned secondary structure elements: h_i = 0, if the ith loop regions do not have any unaligned residues; h_i = d_H (AZ,B_i), where A_i contains the set of C^α coordinates of nonaligned residues in the ith loop of the first structure in a pair, the last aligned residue from the preceding aligned region, and the first aligned residue from the following aligned region. Similarly, B_i is defined for the second structure in a pair. The sets (A_i, B_i) are defined to include two aligned residues so that the measure can be defined even if one of the sets of nonaligned residues is empty. In the calculation of LHM, those pairs where one or the other protein had more than 25% missing residues in nonaligned loops were excluded. In the case of AHM, instead of the coordinates for the C^α atoms in the loops, we use the coordinates for the C^α atoms in the aligned segments and average over the number of aligned segments.

Definitions of disulfide bonds were obtained from the PDB files of all protein structures for each family. Bonds formed outside of the structure–structure alignment footprint regions (see “Test set” section) were disregarded. The average number of disulfide bonds per family was calculated as the sum of the number of SS-bonds in each protein in a family divided by the number of proteins. The fraction of conserved disulfide bonds was calculated as a ratio between the number of identical SS-bonds in a protein pair and the average number of disulfide bonds within the footprint regions of two proteins.

Statistical Analysis

The statistical analyses described in this study used the Splus statistical package(version 6). To investigate the relationship between sequence and structural similarity we performed correlation and regression analyses. The Pearson linear correlation (ρ) and Spearman rank correlation coefficients were calculated, and the p-value under the null hypothesis that the correlation coefficient was equal to zero was estimated. Those families with p-values less than 0.01 were considered as having correlation coefficients significantly different from zero. To quantify how much the nonlinear terms improve the data fitting we included a quadratic term in the linear model and performed nonlinear regression analysis. The ratio of squared linear correlation coefficient for the linear model (R_l²) and squared multiple correlation coefficient for the nonlinear model (R_n²) (r² = R = R_l²/R_n²) in this case would indicate the relative improvement in the data fitting upon inclusion of the nonlinear term in the model. The higher this ratio is, the lower the contribution of nonlinear terms upon data fitting.

The F-test has been used to test the null hypothesis that all regression coefficients are equal, with alternative hypothesis being that the regression coefficients are not all equal. The null hypothesis has been rejected, and therefore we employed multiple comparison procedures. First we checked which regression coefficients were different from each other by using the Tukey-Kramer method.³² For the purpose of illustrating the Tukey-Kramer method, the approximate method proposed by Gabriel can be applied, which computes the comparison intervals for all regression coefficients.³² According to Gabriel’s method, two regression coefficients are considered significantly different if and only if their comparison intervals do not overlap.

RESULTS

The Quality of Sequence–Structure Correlation for Different Protein Families

Table II shows the accuracy of correlation obtained between the BLAST bitscore per residue and various measures of structural similarity (RMSD, CC, AHM, and LHM). As can be seen from this table, the linear correlation is strong for most of the families, and half of them have correlation coefficients better than 0.73–0.87, depending on the structural similarity measure used (Table II lists Pearson correlation coefficients; Spearman rank correlation coefficients give similar results). This result is consistent with the studies of Wood and Pearson,²⁰ who showed on a smaller test set of 35 protein families that half of them have correlation coefficients greater than 0.878. Comparing different measures of structural similarity, one can see that normalized AHM tends to yield a stronger correlation than other quantities yielding 98% of families with statistically significant linear correlation coefficients (with p-value <0.01). In agreement with this observation, our previous studies showed that the AHM measure performs very well in distinguishing homologs from analogs.¹⁸ High accuracy of the AHM is due to the higher sensitivity of the Hausdorff measure to subtle dissimilarities between the aligned parts of protein structures. Based on this observation, we chose this quantity to characterize the structural change in the present analysis.

TABLE II

Figure 1(a–d) illustrates the high quality of linear correlation for four protein families: Picornavirus capsid protein (pfam00073), Pancreatic ribonuclease (cd00163), GLFV-dehydrogenase (pfam00208), and Alpha-amylase (smart00632), which all have Pearson linear correlation coefficients less than −0.87. As shown in Figure 1(e–f), not all families, however, exhibit such good correlation between sequence and structure changes. The Trypsin-like serine protease family (cd00190), for example, has a correlation coefficient of only −0.57 [Fig. 1(f)], while the Copper-binding proteins family (pfam00127) is more adequately described by the nonlinear regression model taking into account higher order quadratic terms (r²-ratio being equal to 0.88) [Fig. 1(e)]. In the overall test set, among those with statistically significant correlation (79 families), 17 families had an r²-ratio smaller than 0.9 indicating that, for these cases, adding the nonlinear term improves the performance of modeling by about 10%. It should be noted that alignments from different sources but belonging to the same protein family (see Methods, Table I) except for three cases exhibit consistent behavior with respect to the quality of linear correlation. Furthermore, random exclusion of duplicate families does not have any effect on the quality of linear correlation, nor on the results discussed below.

Fig. 1

Normalized AHM is plotted versus BLAST bitscore per residue for (a) Picornavirus capsid protein (pfam00073), (b) Pancreatic ribonuclease (cd00163), (c) GLFV-dehydrogenase (pfam00208), (d) Alpha-amylase (smart00632), (e) Copper binding proteins family (more ...)

Although the correlation between protein sequence and structure is found to be statistically significant for the great majority of test families, there is still a high degree of variability in the magnitudes of the correlation coefficients among the families. There seems to be no strong relationship between the domain length (i.e., the average length of structure–structure alignments in a family) and the quality of linear correlation (ρ = −0.30, p-value = 0.01). No connection between correlation coefficients and contact density (ρ = −0.23, p-value = 0.04) or contact order³³ (ρ = −0.27, p-value = 0.02) has been observed either.

One might hypothesize that changes in structure should not always be strongly coupled with changes in amino acid sequence, especially if protein stability is determined mainly by the set of strong interactions such as covalent disulfide bonds. Figures 2 and 3 show how the quality of linear correlation depends on the disulfide bond content in protein families. As can be seen from Figure 2, protein families having on average two or more disulfide bonds per family (Sample 1, 13 families) exhibit rather poor sequence–structure correlation and proteins from the families with high correlation coefficients usually contain less than two disulfide bonds (Sample 2, 68 families). We should note that the difference between these two distributions is not caused by the difference in the family length (there is no significant correlation between the number of disulfide bonds per family and protein length).

Fig. 2

The histogram shows the Person correlation coefficients between AHM and BLAST bitscore per residue for protein families with less (a) and more (b) than two disulfide bonds per family.

Fig. 3

Pearson correlation coefficient plotted against the number of disulfide bonds per family for the overall test set (circles) and only for those families which have more than 50% conserved disulfide bonds (crosses).

To test the difference between two distributions of correlation coefficients (Sample 1 and Sample 2), we applied the Wilcoxon two-sample test, which showed that these two samples come from populations with different mean values (the null hypothesis was rejected with the p-value = 0.0016). We found that the majority of S—S bonds in Sample 1 were well conserved among different family representatives (more than 75% conserved S—S bonds) except for the three cases of Carboxylesterase (pfam00135, 72% conserved S—S bonds), Trypsin-like serine protease (smart00020, 71% conserved S—S bonds), and Papain family Cysteine protease (pfam00112, 63% conserved S—S bonds), whereas two of these families (pfam00135 and pfam00112) are also characterized by high sequence–structure correlation (ρ = −0.80, ρ = −0.84).

Figure 3 shows as well that the quality of sequence–structure correlation depends on the average number of disulfide bonds per family (the correlation coefficient is 0.44 with p-value of 0.001). Because not all disulfide bonds are conserved in protein families, we also calculated the fraction of conserved S—S bonds per family and showed in this figure those families that had the fraction of conserved S—S bonds higher than 0.5 (Fig. 3, crosses). A high fraction of conserved S—S bonds in a family points to the preservation of specific S—S bonds in evolution and can be used as a measure of reliability of their definition (correlation coefficient for data points shown by crosses is equal to 0.64 with p-value of 0.0007).

The Evolutionary Plasticity of Structure Estimated for Different Protein Families

As we showed in the previous section, for the majority of families, the sequence–structure dependence can be quite well described by the linear regression. The regression coefficients (the slope of the regression line) in these cases would estimate the relative structural to sequence change in the evolution of a particular protein family or, in other words, “the evolutionary plasticity of structure” (EPS). This measure is discussed below in more detail. To compare regression coefficients for different protein families, first we excluded families with poor correlation (ρ_RMSD > −8.0 or ρ_AHM > −0.8) and large contribution of nonlinear terms ( equation M3

). This filtering procedure resulted in 43 families with high linear correlation (these families are marked by asterisks in Table I). Figure 4 depicts the histogram of regression coefficients for this set of 43 protein families. As can be seen from this figure, the EPS varies by about a factor of 3 among different protein families. Likewise, Wood and Pearson²⁰ reported a 3.9-fold change in their “structural mutation sensitivity” for a similar but smaller test set.

Fig. 4

The histogram shows linear regression coefficients for each family with high correlation (see the caption for Table I).

Although the regression coefficients vary between families, one needs to test whether this difference is statistically significant. To compare the slopes of the various families, we first tested the null hypothesis that all regression coefficients are equal (see Methods). This hypothesis is rejected with P [double less-than sign] 0.0001. To determine which families have different structural tolerances, we employed multiple comparison methods and calculated the comparison intervals (95% confidence) for the regression coefficients of every protein family (Fig. 5). The comparison intervals are constructed such that two regression coefficients are significantly different if and only if their intervals do not overlap.³² As can be seen from Figure 5, there are apparently two groups of protein families that have significantly different regression coefficients and nonoverlapping comparison intervals, while the rest of the protein families do not exhibit a significant difference in slopes between each other.

Fig. 5

The linear regression coefficients (b) are plotted together with their comparison intervals (see Methods) for each family with high correlation (see the caption for Table I). All families are ordered with respect to the increasing regression coefficients. (more ...)

The first group consists of several protein families having the steepest slopes (highest EPS) and positioned in the left side of the plot. These include GLFV-dehydrogenases (pfam00208, b = −0.27), Copper/zinc superoxide dismutase (pfam00080, b = −0.22), Protein tyrosine phosphatase (smart00194, b = −0.21), and Proteasome A-type and B-type (pfam00227, b = −0.21). The second group is formed by proteins with the smallest EPS, which are positioned on the right side of Figure 5; among them are Picornavirus capsid protein family (pfam00073, cd00205, b = −0.10), Beta/gamma-crystallins (smart00247, b = −0.11), IPT/TIG domain (pfam01833, b = −0.12), and Xylose isomerase (pfam00259, b = −0.12). Interestingly enough, some protein families characterized by the lowest EPS, form large interaction interfaces with other proteins or cell components. For example, Picornavirus capsid proteins are packed in highly ordered icosahedral shells that are maintained through multiple interactions between the subunits whereas crystallins, IPT/TIG and Xylose isomerase domains also participate in macromolecular interactions.

Overall, we found that EPS values for the majority of protein families do not differ significantly between each other because their comparison intervals (see Methods) overlap. Because our test protein families spanned a wide range of structural folds (Table I) and functions, the previous observation implies that EPS, in general, depends neither on the structural class nor on the protein fold type. For example, the Glycosyl hydrolase family (smart00633) has an EPS of −0.18, whereas the aldo/keto reductase/K+ channel beta subunit family has an EPS of about −0.12, although both protein families have the TIM barrel fold. The superfolds, the most populated structural topologies (TIM barrels, beta trefoils, four-helical bundles, and others), show EPS values comparable to those of other folds (not shown).

The Evolutionary Plasticity Is Different in Loop Regions Compared to the Protein Core Regions

The evolutionary relatedness between proteins can be successfully gauged from the comparison of their loop regions.¹⁸^,³⁴ Table II shows that, within the families of homologous proteins, structural changes in loops are strongly coupled with the evolutionary distance which, in this case, was measured by the normalized BLAST bitscore for the aligned regions. The sequence–structure dependence in loop regions for 71% of protein families (the test set for the loop analysis, see Methods) can be well described by a linear model and, for 88% of the protein families the linear correlation coefficients are found to be statistically significant. Among families with a particularly high sequence–LHM correlation, are the families of Xylose isomerase, Class I Histocompatibility antigen, Protein tyrosine phosphatase, IG-like plexins, and others. For some families, for example, Ribonuclease A, the sequence–structure correlation for loops is even higher than the correlation observed for aligned core regions. The linear sequence–structure correlation suggests that loop regions are, in general, under constant evolutionary pressure, which preserves their overall structure and they therefore change gradually as proteins diverge.

To compare the EPS of aligned core regions with the EPS of loop regions, we computed the ratio of their regression coefficients (b^core/b^loop). The test set depicted in Figure 6 comprises 16 protein families with a good linear correlation for both LHM and AHM (with the requirement that both correlation coefficients are less than −0.8 and r² > 0.9). Assuming equal plasticity of core regions and loops (the null hypothesis), we expect that, in half of the instances, b^core/b^loop ratios will fall below 1, and in half of the instances these ratios will be above 1 (8:8 ratio). However, we observed 15 cases where the b^core/b^loop ratio was less than 1. The probability to observe such bias given the above assumption can be estimated from the binomial distribution as p(0.5, 0, 16) + p(0.5, 1, 16) = 0.00026. Thus, equal plasticities of core regions and loops is not likely to be compatible with our observations. This suggests that loop regions have higher evolutionary plasticity of structure compared to the protein core and, as can be seen from Figure 6, for the majority of families (12 families), the ratio of regression coefficients for the core and loop regions lies between 0.2 and 0.6.

Fig. 6

The histogram of the ratio between regression coefficients obtained for aligned parts (AHM used as a measure of structural similarity) and regression coefficients obtained for loops (LHM used as a measure of structural similarity).

DISCUSSION AND CONCLUSION

In this article, we study the structural evolution of homologous proteins in terms of their sequence–structure dependence. We showed that the protein structural variability for a great majority of protein families is linearly coupled with the sequence variability, which suggests that, typically, protein structure gradually changes as proteins diverge during evolution. However, when the protein structural core is stabilized by strong interactions such as disulfide bonds, the correlation between structural and sequence divergence is much weaker if detectable at all. Protein families that have large number of disulfide bonds (which are usually conserved) typically do not show a linear sequence–structure correlation in contrast to families with fewer disulfide bonds. Apparently, during the evolution of these families, purifying selection preserves the disulfide contacts and has a much weaker effect in the rest of the protein molecule such that, in these cases, the structural variability cannot be explained predominantly by the changes in sequence.

Drawing an analogy with solid mechanics, the sequence–structure dependence curves can be viewed as stress–strain curves where the physical body undergoes geometrical deformation after applying a stress. In the case of protein evolution, amino acid substitutions introduce the stress on protein structure, and structure either adjusts to the change or breaks apart. The linear dependences of measures of structural similarity on sequence similarity observed for the majority of protein families in our test set allows us to compare “the evolutionary plasticity of structure” (EPS) between different families. The evolutionary plasticity of structure for a given family is defined, accordingly, as a degree of structural variation per unit of sequence variation. Low values of EPS (shallow slope of the regression line) correspond to the situation when protein structure is highly conserved within a family of homologs relative to sequence changes. This could be caused either by strong functional constraints imposed on the structure or by high structural stiffness, that is, the inability to accommodate large structural variations without breaking the molecule apart. High values of EPS (steep slope) correspond to the situation when large structural shifts (within a framework of a given protein fold) can occur upon minor sequence divergence as a result of relaxed functional constraints on the structure and/or high structural tolerance of a given fold.

The rigorous statistical analysis performed in this work suggests that, with several exceptions, the values of the EPS for protein structural cores do not significantly differ between protein families. Interestingly enough, despite the variability among protein families in functional constraints and types of structural folds, the proteins from different families respond similarly to the sequence drift in evolution. This observation is based on the evaluation of multiple comparison intervals for the EPS values rather than on direct comparison of sequence–structure correlation slopes as has been done by others.²⁰ One could argue that this result could be an artifact caused by possible flaws in the analysis such as insufficient structural data and/or derivation of sequence and structure similarity measures. However, the observed high correlation between sequence and structural divergence within individual families suggests that the analysis described here is robust. Moreover, the observed EPS values were not found to be statistically different, even though the test set was designed in such a way (protein families with high linear correlation and sufficient number of sequences) to reduce the uncertainty of the EPS estimates.

It is commonly observed that the size of the sequence space is much larger than the size of structure space, and the number of different structural folds is rather small, estimated to be several thousand.³⁵^–⁴⁰ Moreover, certain protein topologies are realized in evolution much more often than others (so-called “superfolds”), and the existence of such inequality in fold frequencies is sometimes attributed to specific physicochemical or geometrical properties of superfolds. Our results demonstrate that the gradual change of structure follows the same pattern in different protein families, suggesting that the role of intrinsic characteristics of superfolds in evolution might be exaggerated. In this respect we argue that the differences between common and rare folds may arise in evolution semirandomly, that is, via self-enhancing stochastic fluctuations of abundance of essentially equal folds.⁷ In any case, until the existence and significance of differences in “evolutionary plasticity of structure” between protein families is conclusively demonstrated, there is probably no ground to use their inequality as a working hypothesis in studies of protein structural evolution.

Acknowledgments

We thank Stephen Bryant (NCBI), Eugene Koonin (NCBI), and Nick Grishin (University of Texas Southwestern Medical Center) for helpful discussions, and Lewis Geer for help with CDART database.

Footnotes

Grant sponsor: the NIH Intramural Research Program

References

Kimura, M. The neutral theory of molecular evolution. Cambridge: Cambridge University Press; 1983.

Chothia, C; Lesk, AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823–826. [PubMed]

Holm, L; Sander, C. Mapping the protein universe. Science. 1996;273:595–603. [PubMed]

Murzin, AG. How far divergent evolution goes in proteins. Curr Opin Struct Biol. 1998;8:380–387. [PubMed]

Russell, RB; Saqi, MA; Sayle, RA; Bates, PA; Sternberg, MJ. Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol. 1997;269:423–439. [PubMed]

Matsuo, Y; Bryant, SH. identification of homologous core structures. Proteins. 1999;35:70–79. [PubMed]

Koonin, EV; Wolf, YI; Karev, GP. The structure of the protein universe and genome evolution. Nature. 2002;420:218–223. [PubMed]

Dokholyan, NV; Shakhnovich, B; Shakhnovich, EI. Expanding protein universe and its origin from the biological Big Bang. Proc Natl Acad Sci USA. 2002;99:14132–14136. [PubMed]

Qian, J; Luscombe, NM; Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J Mol Biol. 2001;313:673–681. [PubMed]

10.

Deeds, EJ; Shakhnovich, B; Shakhnovich, EI. Proteomic traces of speciation. J Mol Biol. 2004;336:695–706. [PubMed]

11.

Finkelstein, AV; Ptitsyn, OB. Why do globular proteins fit the limited set of folding patterns? Prog Biophys Mol Biol. 1987;50:171–190. [PubMed]

12.

Ptitsyn, OB; Finkelstein, AV. Similarities of protein topologies: evolutionary divergence, functional convergence or principles of folding? Q Rev Biophys. 1980;13:339–386. [PubMed]

13.

Murphy, ME; Lindley, PF; Adman, ET. Structural comparison of cupredoxin domains: domain recycling to construct proteins with novel functions. Protein Sci. 1997;6:761–770. [PubMed]

14.

Ponting, CP; Russell, RB. Identification of distant homologues of fibroblast growth factors suggests a common ancestor for all beta-trefoil proteins. J Mol Biol. 2000;302:1041–1047. [PubMed]

15.

Copley, RR; Bork, P. Homology among (betaalpha)(8) barrels: implications for the evolution of metabolic pathways. J Mol Biol. 2000;303:627–641. [PubMed]

16.

Arcus, V. OB-fold domains: a snapshot of the evolution of sequence, structure and function. Curr Opin Struct Biol. 2002;12:794–801. [PubMed]

17.

Kinch, LN; Grishin, NV. Evolution of protein structures and functions. Curr Opin Struct Biol. 2002;12:400–408. [PubMed]

18.

Panchenko, AR; Madej, T. Analysis of protein homology by assessing the (dis)similarity in protein loop regions. Proteins. 2004;57:539–547. [PubMed]

19.

Flores, TP; Orengo, CA; Moss, DS; Thornton, JM. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 1993;2:1811–1826. [PubMed]

20.

Wood, TC; Pearson, WR. Evolution of protein sequences and structures. J Mol Biol. 1999;291:977–995. [PubMed]

21.

Koehl, P; Levitt, M. Sequence variations within protein families are linearly related to structural variations. J Mol Biol. 2002;323:551–562. [PubMed]

22.

Marchler-Bauer, A; Anderson, JB; DeWeese-Scott, C; Fedorova, ND; Geer, LY; He, S; Hurwitz, DI; Jackson, JD; Jacobs, AR; Lanczycki, CJ; Liebert, CA; Liu, C; Madej, T; Marchler, GH; Mazumder, R; Nikolskaya, AN; Panchenko, AR; Rao, BS; Shoemaker, BA; Simonyan, V; Song, JS; Thiessen, PA; Vasudevan, S; Wang, Y; Yamashita, RA; Yin, JJ; Bryant, SH. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 2003;31:383–387. [PubMed]

23.

Marchler-Bauer, A; Panchenko, AR; Shoemaker, BA; Thiessen, PA; Geer, LY; Bryant, SH. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 2002;30:281–283. [PubMed]

24.

Berman, HM; Bhat, TN; Bourne, PE; Feng, Z; Gilliland, G; Weissig, H; Westbrook, J. The Protein Data Bank and the challenge of structural genomics. Nat Struct Biol. 2000;7:957–959. Suppl. [PubMed]

25.

Gibrat, JF; Madej, T; Bryant, SH. Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996;6:377–385. [PubMed]

26.

Chen, J; Anderson, JB; DeWeese-Scott, C; Fedorova, ND; Geer, LY; He, S; Hurwitz, DI; Jackson, JD; Jacobs, AR; Lanczycki, CJ; Liebert, CA; Liu, C; Madej, T; Marchler-Bauer, A; Marchler, GH; Mazumder, R; Nikolskaya, AN; Rao, BS; Panchenko, AR; Shoemaker, BA; Simonyan, V; Song, JS; Thiessen, PA; Vasudevan, S; Wang, Y; Yamashita, RA; Yin, JJ; Bryant, SH. MMDB: Entrez’s 3D-structure database. Nucleic Acids Res. 2003;31:474–477. [PubMed]

27.

Geer, LY; Domrachev, M; Lipman, DJ; Bryant, SH. CDART: protein homology by domain architecture. Genome Res. 2002;12:1619–1623. [PubMed]

28.

Altschul, SF; Gish, W; Miller, W; Myers, EW; Lipman, DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]

29.

Marchler-Bauer, A; Bryant, SH. Measures of threading specificity and accuracy. Proteins. 1997:74–82. Suppl 1. [PubMed]

30.

McLachlan, AD. Gene duplications in the structural evolution of chymotrypsin. J Mol Biol. 1979;128:49–79. [PubMed]

31.

Preparata, FP; Shamos, MI. Computational geometry, an introduction. New York: Springer-Verlag; 1985.

32.

Sokal, RR; Rohlf, FJ. The principles and practice of statistics in biological research. New York: Freeman & Company; 1995. Biometry.

33.

Plaxco, KW; Simons, KT; Baker, D. Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol. 1998;277:985–994. [PubMed]

34.

Panchenko, AR; Madej, T. Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evol Biol. 2005;5:10–15. [PubMed]

35.

Chothia, C. Proteins. One thousand families for the molecular biologist [news]. Nature. 1992;357:543–544. [PubMed]

36.

Wolf, YI; Grishin, NV; Koonin, EV. Estimating the number of protein folds and families from complete genome data. J Mol Biol. 2000;299:897–905. [PubMed]

37.

Coulson, AF; Moult, J. A unifold, mesofold, and superfold model of protein fold use. Proteins. 2002;46:61–71. [PubMed]

38.

Liu, X; Fan, K; Wang, W. The number of protein folds and their distribution over families in nature. Proteins. 2004;54:491–499. [PubMed]

39.

Grant, A; Lee, D; Orengo, C. Progress towards mapping the universe of protein folds. Genome Biol. 2004;5:107. [PubMed]

40.

Zhang, C; DeLisi, C. Estimating the number of protein folds. J Mol Biol. 1998;284:1301–1305. [PubMed]

41.

Murzin, AG; Brenner, SE; Hubbard, T; Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. [PubMed]

42.

Yang, AS; Honig, B. An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. J Mol Biol. 2000;301:679–689. [PubMed]