BASED SEQUENCE ALIGNMENTS IN PASS2 AND ASSESSMENT OF ALIGNMENTS
The number of protein sequences and structures deposited in databanks (1,2) indicate an ever-increasing gap between protein sequence and structural information (3,4) which is further amplified due to genome sequencing projects. Homologous proteins share a high degree of sequence, structural and functional similarity (5–9). Homologous families can be easily grouped by simple sequence searches whereas superfamily members, adopting the same fold and performing similar biological roles (10–17), can often be identified by sensitive fold prediction algorithms followed by a careful alignment of sequences.
SCOP [Structural Classification of Proteins (17)] is a dictionary of protein structural entries organised at different hierarchies of structural and functional similarities. SCOP (1.53 release) records 26 219 protein domains which are grouped into merely 564 folds, suggesting a strong structural convergence of proteins. CAMPASS (18) forms the first version of a protein superfamily database, corresponding to 69 superfamilies, which records the alignments of proteins aligned using COMPARER (19). Availability of such alignment databases over the World Wide Web offers the possibility to study and design experiments on specific superfamilies; they also permit systematic survey and analysis of various structural properties and perform fold predictions. In addition, the construction of three-dimensional models using homology modelling techniques are usually reliable where the sequence identity between query and the structural homologues (templates) are ≥30%. Analyses of structural and sequence differences amongst known superfamily members can hopefully provide useful guidelines for modelling distant related proteins. We report the semi-automated updated version of the superfamily alignment database which has been designed to be in direct correspondence with SCOP database [1.53 release (17)].
PASS2 superfamily alignments are in concordance with SCOP definitions of superfamily classification and domain boundaries, as opposed to CAMPASS (18) where there could be differences in grouping protein domains and results from DIAL (20) were employed to consider alternate domain boundaries. This was decided to facilitate the automation processes. Nearly 600 single-member superfamilies have been included in PASS2. The superposed set of protein co-ordinates can be viewed using graphic interface softwares such as RASMOL (21). A rough structural phylogeny, corresponding to root mean square deviations (r.m.s.d.) at structurally equivalent positions amongst superfamily members, is provided. Sequence search engines, both in the form of text strings and alignment using PSI-BLAST, have been introduced in PASS2 to provide a user-friendly access to the database. It is also possible to obtain augmented superfamily alignments including the query sequence or homologues from genome databases. PASS2 is connected with 72 genome databases of model organisms enabling access to homologous gene products. Structure-based sequence alignments and superposed protein structure co-ordinates corresponding to superfamily alignments are extractable as in CAMPASS (18). In addition, PASS2 offers the possibility to download accessory structural files of individual superfamily members such as solvent accessibility, hydrogen bonding and secondary structural data. Links to other databases, such as CATH (22,23), FSSP (24), PFAM (25), PALI (26), 3Dee (27,28), PROMOTIF (29), PROCHECK (30,31), SYSTERS (32), both to the individual entries and to the main home pages are provided from PASS2. Tables 1 and 2 provide a list of links and useful data downloadable by the user across the World Wide Web.
The superfamilies are named after their codes in SCOP database: for example, 02.03.068 refers to Biotin-carboxylase N-terminal domain-like superfamily. All the protein domains under a superfamily are considered and one representative protein domain entry, of the best resolution and R-factor from each family, is chosen for a preliminary alignment. NMR structures are considered equivalent to a 3.2 Å resolution crystal structure in the present context. Protein structural co-ordinates have been obtained from the Protein Data Bank (2) and protein domain co-ordinates of the desired chain and domain boundaries considered in the superfamily are extracted using the CHAINRESALL (R. Sowdhamini, unpublished results) program. ATM2SEQ (33) is used to obtain the corresponding amino acid sequences and MALIGN (33) for a multiple-sequence alignment using a constant gap penalty of 40. MOTIFS (R. Sowdhamini, unpublished results) provides a percentage identity matrix which is examined, using PERL scripts, to derive a non-redundant representative set of protein domains for the superfamily such that, as far as possible, no two proteins are >27% identical by the MALIGN alignment (see Fig. 1 for a flow chart).
The non-redundant set of superfamily representatives was chosen for a rigorous structure-based sequence alignment using COMPARER (34). The method requires pairs of superposed set of co-ordinates of the proteins to be aligned. Superposition is achieved by the choice of ‘initial equivalences’ which serve as seeds for pairwise rigid-body superposition using PMNFC, a modified form of MNYFIT (35). In order to construct PASS2, we initially tested the quality of alignments obtained by employing non-gap positions of MALIGN-derived alignment as initial equivalences and found the resulting alignments were reliable and reasonably correct for all except one out of 10 randomly selected superfamilies. Alignment accuracies were manually examined for the absence of gaps in the middle of secondary structures and the conservation of core secondary structures. The superposed structures corresponding to the alignment were examined on the graphics to ensure the absence of insertions or deletions in the middle of α-helices and β-strands. As far as possible we have also ensured that the functionally important residues in the members of superfamilies with highly similar functions are topologically equivalent. Initial equivalences were chosen from the initial MALIGN results. Pairs of superposed co-ordinates and the derived equivalences are employed to extend alignments guided by the similarity in the structural environment of individual amino acids. Structural environments of amino acids are described by their backbone conformation (secondary structure and cis-peptide bond), pattern of exposure to solvent and patterns of hydrogen bonding and disulphide bond connectivity. These are stored in accessory structural data files for all the superfamily members. The final COMPARER-derived alignments are annotated for the structural information using JOY4.0 (7,9). The final alignments will be assessed for unusual average r.m.s.d. for individual members and the deletions of conserved secondary structures (R. Sowdhamini, unpublished results). Problematic alignments are being examined by resorting to a different structure comparison program, STAMP4.0 (36) which does not require initial equivalences, before alignment through COMPARER. Where STAMP is not appropriate, the simulated annealing option in COMPARER is being employed and the COMPARER runs re-performed (see Fig. 2 for a flow chart).
Pairwise percentage identities of the final structure-based sequence alignments are presented in the form of a symmetric matrix. MNYFIT (35) is employed to obtain a rigid-body superposition of the superfamily members, without an update of equivalences, where the initial equivalences are chosen as non-gap positions corresponding to the final alignment. The r.m.s.d. of the structurally equivalent regions, the non-gap positions of the final alignment, are employed to construct a phylogeny of the superfamily members. Such r.m.s.d. values, though not an accurate representation of the structural relationships between representative members of a superfamily, are nevertheless useful in providing a quantitative estimate of the evolutionary divergence of various homologous families under the query superfamily.
Association of genome sequences from around 60 sources with superfamilies in PASS2 was performed using PSI-BLAST search (37), which is sensitive in identifying distant relatives and convenient for automatic searches (38). Such searches were performed with 10 iterations and a liberal E-value of 0.01, using each of the representative superfamily members as a query against the genome databases. The genome sequences, which are either homologues or additional superfamily members, are aligned with the original structure-based alignment and re-annotated using JOY. Where possible, links to such structure-annotated alignments with genome sequence homologues of superfamily members are provided. Figure 3 shows one such structure-based sequence alignment.
PASS2 is a compendium of structure-based sequence alignments of distantly related proteins grouped at the superfamily level in direct correspondence with SCOP definitions. In addition, PASS2 acts as a ‘junction’ point to obtain links of representative superfamily members to genome, sequence and structural databases. Structural phylogenies of superfamily members provide a crude but quantitative estimate of evolutionary relationships where sequence similarity breaks down due to poor sequence identity. Structural phylogenies have been reported earlier, for example, for the distantly related hemoglobin family (39) and for the phosphate-binding superfamilies of the triose-phosphate isomerase TIM-barrel fold (40). Both CAMPASS (19) and PASS2 databases are unique in annotating the structural environment of individual residues on the sequence alignment using JOY (7,9). In addition, PASS2 provides these structural files of individual proteins downloadable across the Web and links to other superfamily members in genome databases.
Statistical analyses on SCOP (41) shows that a vast majority of the protein domains under each of the hierarchical structural classifications, with respect to class, fold and superfamily, are single-members. PASS2 has a conscious inclusion of single-member superfamilies in order to ensure the representation of such examples in fold libraries and profiles generated for fold prediction. We are currently employing superfamily alignments in PASS2 to analyse the retention of structural features and the deviation of structural parameters which will be useful in modelling distantly related proteins. We are also performing sequence searches in genome databases with structural templates of superfamily alignments as additional constraints.
We are grateful to Professor Sir Tom Blundell for the first version of the database, and to his group for useful discussions. R.S. is a recipient of a Wellcome Trust Senior Research Fellowship, and V.M. is supported by the Wellcome Trust.