The algorithm of SARF2

The goal is to detect 3D similarities of the backbone fragments, without topological restrictions.

The idea of the method is to find all the compatible (superimposable) secondary structure elements, and then to form the large ensembles of mutually compatible elements.


The algorithm was designed keeping in mind that the biologically meaningful similarities of protein structures consist mostly of secondary structure elements: a-helices and/or beta-strands. Instead of operating on a large number of atoms or residues we cosider only the protein's secondary structure elements (SSEs), which are represented as vectors. Our goal is to find large sets of the SSEs in two protein structures wich can be superimposed with a small r.m.s.d. We define that a pair of the SSEs from one protein and a pair of the SSEs from another protein are compatible if they satisfy certain distance and angle constrains. Usually it means that the significant portion of these compatible pairs of SSEs can be superimposed with relatively small r.m.s.d. After enumerating all the compatible pairs of the SSEs of two proteins, we find several larger ensembles of the mutually compatible SSEs. Finally, to refine the match, we go back from the secondary structure representation to the Ca-trace representation, including other Ca-atoms (or other fragments of the Ca-trace) which are consistent with the initial match. Thus, our algorithm consists of the four steps:
  1. Secondary structure assignment;
  2. Search for the compatible pairs of the SSEs;
  3. Search for large ensembles of the mutually compatible pairs of SSEs;
  4. Extension and refinement of the match.

Secondary structure assignment

The input for the algorithm are two PDB entries of the proteins to be compared. Only Ca-traces are used. To assign the secondary structure to the Ca-trace we slide prototypes of the a-helix and the b-strand along the protein backbone. The fragment of the Ca-trace is assigned a particular secondary structure if it superimposes with a small enough rmsd to one of the prototypes. Usually we use a Ca-trace fragment size of 5 residues, an rmsd limit for alpha-helix of 0.4Å, and an rmsd limit for beta-strand of 0.8Å. The typical helixand strand are both taken from the same PDB entry -- 1bp2, residues 104 - 109 and 80 - 85 correspondently, according to
Unger, 1989. The rmsd limits were chosen to make the number of the residues in a-helical conformation and b-strand conformation approximately equal. Our assignment, based only on the conformation of the trace, but not on the pattern of the hydrogen bonds, usually fits well with the helix assignments in PDB files. For b-strands, however, our assignment can vary somewhat because we do not require hydrogen bond formation for the b-strand, considering even a single extended strand as being in the b- conformation. The vectors representing each SSE are calculated as in Kraulis, 1991.

Search for the compatible pairs of the SSEs

For each pair {Si,Sj} of the SSEs from protein A, we calculate: The medium line is defined by the middle point of the shortest segment between axes of the SSEs and the vector (Si/|Si|+dij*Sj/|Sj|), dij = +/- 1. We chose the sign of dij to place projections of the vectors Si and Sj to the medium line on one side from the projection of the closest points. In the case when vector Si goes through the closest point, we devide it into two parts, separated by the closest point Cij. Totally we have 5 parameters to determine the compatibility of two pairs of SSEs: angle Gij, and two distances for each of the SSEs (Dimin, Dimax, Djmin, Djmax).

We consider pairs of the SSEs {SAi, SAj} from protein A and {SBk, SBl} from protein B as compatible, if :

These distance and angle constrains filter out differently arranged pairs of SSEs, leaving only superimposable SSEs. In contrast with the distance constrains, based on the distance between centers of the SSEs, our filtration procedure can detect difficult cases, when only relatively small segments of the SSEs are compatible, or when one of the SSEs is significantly larger than another.

Search for the largest ensemble of the compatible pairs of SSEs

This problem is equivalent to the maximum clique problem and known to be NP-complete. To avoid combinatorial explosion, we made a restriction on the maximum distances between SSE, to be incorporated into the ensemble. This restriction automatically exclude meaningless cases of the random matches of the distant SSEs. We incorporate a SSE into the ensemble, only if the distance between this and all the others SSEs from the ensemble is less than D0 (usually D0 = 25Å). The search for the largest ensemble is straightforward by recursive scheme. In the program we store in memory not only the largest ensemble, but several (up to 3000) different ensembles.

Extension and refinement of the match

After finding large ensembles of the mutually compatible SSEs, we need to find the correspondence between Ca-atoms from two proteins. For these purpose we apply an iterative procedure, beginning from the initial approximation assuming the correspondence between centers of the superimposable fragments of the compatible SSE. Thus we can calculate translation vector and rotation matrix which roughly match all the SSE from our ensemble. However, this superposition can be improved significantly by small additional rotation or translation. To obtain better superposition we first try to include as many as possible Ca-atoms from the SSEs. We allow deletions in SSEs and use dynamic programming technique to find the best correspondence. Subsequently, we try to collect other fragments of the Ca-trace, which may not be picked up at the previous stages of the algorithm and may not be in the helical or extended conformation. We unite those fragments into the ensemble unless the rmsd does not exceed the threshold value (usually 3.2Å). Finally, we recalculate the rotation matrix from this new Ca- correspondence and repeat the procedure several (usually 4 or 5) times.

Significance of the similarities. Similarity score.

We evaluated significance of the similarity from the statistical distribution of the similarity score. As if all of the matches are within small threshold of the rmsd ( 3.2Å) we could use a number of residues N in the match as an approximate measure of the similarity. In fact, we use a similarity score S = 4*N/(2+rmsd) to increase the score of matches with smaller rmsd. If rmsd = 2.0Å, S = N. Comparing a structure of protein A with all the others we compute a distribution of the similarity score S. From this distribution, for each protein B we can calculate the statistical significance of the similarity between proteins A and B in units of standard deviations Z(A,B). Although usually Z(A, B) and Z(B, A) are correlated, frequently it may happened that Z(A, B) is very big, while Z(B, A) is a small value. For example, if A is a big protein of 300 residues, it is easy to find 60 resudue match with many proteins. However, for a small protein B 60 residue match would be very rare, resulting in a lager Z-score. To determine the structural similarity between proteins A and B, we used the score S = Z(A, B) + Z(B, A), which gives a symmetrical value of the significance of similarity between the proteins.

References

Unger, R., Harel, D., Wherland, S. & Sussman, J.L. Proteins 5, 355-373 (1989).

Kraulis, P.J. J. App. Cryst. 24, 946-950 (1991).


SARF2 algorithm / Nick Alexandrov / nicka@ncifcrf.gov