Contact matrix
There are two distinct ways to assign hydrophobic contacts between residues. The first is based on the criterion of an atom-to-atom contact. Given a protein structure with n amino acids, the contact matrix, , is initially set to zero. If the distance between any two atoms of two hydrophobic residues (i,j) is less than the sum of their van der Waals radii plus 0.5 Å, the of the contact matrix is set to a value of 2.0. To emphasize long-range (i.e., non-local) interactions, all are set to 0.0 if abs(i-j) < 4, except those residues which are backbone connected. In this case, is set to 1.0. In this study, Ala, Val, Leu, Ile, Phe, Trp, Met, Pro, Gly, Tyr, and Cys are considered hydrophobic residues.
In the second approach, the non-polar ASA buried between residues is utilized in the contact matrix. We first define the to be the non-polar ASA of residue i buried by residue j. If and , which reflect the hydrophobic contacts between i and j, are both larger than a preset value (25.0 Å), is set to a value of . Otherwise, it is set to zero. As in the atom-to-atom contact criterion, long-range interactions are emphasized by setting all to 0.0 if abs(i-j) < 4, except backbone connected residues. For these, is set to 1.0. We found both matrices to be quite similar. However, in general, the second approach has a sightly better performance in cutting out hydrophobic folding units. There are two possible reasons for the superiority in the utilization of the non-polar ASA which is buried between the residues in the contact matrix. First, hydropobic contacts involving the hydrophobic atoms of hydrophilic residues are taken into account, for example the aliphatic side-chain of lysine. Second, whereas in a residue-by-residue approach a constant value of 2 is assigned to a hydrophobic-hydrophobic residue contact, in a surface area based approach, the hydrophobic contacts are weighted by the extent of the surface which is in contact. We have therefore adopted the latter, ASA criterion. Examples of both types of contact matrices of ``1hfh'' are given in Figure 1.
Reciprocal averaging
The solution of the reciprocal averaging procedure is shown to be equivalent to
an eigenvalue-eigenvector problem of the positive semi-definite symmetric
matrix,
where R is a diagonal matrix of the row totals of the contact matrix C (for details see Hill, 1973; Holm & Sander, 1994). That is, the first non-trivial eigenvector of the D matrix represents in its amplitude a new ordering, which reflects the grouping information. However, there is a trivial solution (1, 1, 1, ...) with a maximal eigenvalue of 1. To avoid getting this trivial solution, one subtracts the component of the trivial solution for every trial vector in an iterative routine. Starting with a random trial vector, the eigenvalue-eigenvector solution is considered solved if either the norm of the trial vector converges with changes of less than 0.00001, or the iteration cycle reaches a maximum of 500 times. Figure 2 gives the first non-trivial eigenvector of ``1hfh''. A new contact matrix, whose residue order is reassigned by the amplitude of the eigenvector, is illustrated in Figure 3.
What is a cut
Since hydrophobic clustering is implicitly expressed by the amplitudes of the solved eigenvector, a cut is specified by an index value within the range of the eigenvector, dividing the protein into two parts. The first part contains those residues whose eigenvector's amplitudes are smaller than the index value, and the second part consists of those whose eigenvector's amplitudes are larger than the index value. In Figure 2, with the cut index at 0.0, the first cut part has residues 1 - 63, and the second part, 64 - 120. Obviously, a cut along the eigenvector's amplitude will produce arbitrary cutting points in the sequence. For instance, if in Figure 2 we choose the cut index at -0.09, there are five cutting points in the sequence, and each of the two cut units consisting of three segments (for instance, 1 - 8; 15 - 28; 45 - 56, for the first part). With this definition, as many trial units as twice the size of the protein can be generated along the amplitude of the solved eigenvector.
(a) Reducing the number of segments
Since a cut by a one-dimensional search is not always perfect, it is necessary to resort to two additional processes, to improve it. First, we join two segments which are separated by a short piece, which has not been assigned to any of the hydrophobic folding units. If the size of any such remaining unassigned fragment of amino acids, either at the end of the chain, or between the hydrophobic folding units, is less than half of the minimum length of a segment (here we use 8 residues as the minimum segment length), the respective short segment is added to the trial cut unit. Hence, this procedure may fuse two cut segments into one. Second, to reduce the number of segments in the cut unit, we eliminate any selected segment whose length is less than the minimum length of a segment.
(b) Combinatorial assembly
After discarding short fragments in a trial generation of a hydrophobic folding unit, a risk still exists that unwanted segments may have been included. To prevent this situation, the unit is examined for all combinations of the segments. Consider a trial cut containing three segments. Six combinations of the three segments, (1), (2), (3), (1,2), (1,3) and (2,3) are generated, where 1,2 and 3 represent the first, second and third segments, respectively, and (1,2) implies that only the first two segments are considered to constitute the new trial cut. The combination achieving the best score in the combinatorial assembly is considered as the final cut folding unit for that cut index. If there are n segments in a unit, the total number of combinations is , (including the one containing all segments). This procedure ensures getting the best cutting of the hydrophobic units as suggested by the hydrophobic contact matrix.
Bidirectional cutting
Along the amplitude of the eigenvector, we generate a series of trial folding units as described above. Moving the cutting-index from negative toward positive values and considering only the first part of a cut, is termed forward cutting. Backward cutting is moving the cutting-index in the opposite direction, and considering only the second part of a cut as a trial cut. The cut producing a hydrophobic folding unit with a local optimal score in a series of cuts, is the folding unit which is excised. The algorithm iterates alternatively along the opposite amplitudes of the eigenvector. If the number of residues left in the short linkers is less than a preset minimum (30 residues) the cutting algorithm terminates. A local optimal score is defined as follows. We temporarily keep the cut that has the highest score and count how many cuts have been executed in the follow-up trial cuts. If the score of a follow-up trial cut is higher than the registered one, we replace it with the new one and reset the follow-up count to zero. A local optimal cut is finalized if either the follow-up count is over 30 or the cutting index reaches the end. ``1hfh'' has been cut into two hydrophobic folding units. The scores of both the forward and backward cutting are plotted in Figure 4.
Adding unselected pieces
Since the cutting suggested by the reordered contact matrix is not perfect, some residues which are left unassigned should preferably be added to already cut units. For this reason, we join the unassigned pieces to the previously cut hydrophobic folding units if it increases their score. We do this cleaning up of unassigned residues in two steps. First, we try to add, combinatorially, unassigned segments (with length > 8) to all cut units, as described above. Second, only one unassigned residue is sequentially added to all existing units. We iterate through this, ``residue-attaching'' process, until there is no further increase in the score by single residue additions for any of the already cut units.