PASS2: a semi-automated database of Protein Alignments Organised as Structural Superfamilies

Journal List > Nucleic Acids Res > v.30(1); Jan 1, 2002

Nucleic Acids Res. 2002 January 1; 30(1): 284–288.

PMCID: PMC99156

PASS2: a semi-automated database of Protein Alignments Organised as Structural Superfamilies

V. Mallika, Anirban Bhaduri, and R. Sowdhamini^a

National Centre for Biological Sciences, UAS-GKVK Campus, Bangalore 560 065, India

^aTo whom correspondence should be addressed. Tel: +91 80 3636421; Fax: +91 80 3636662; Email: mini@ncbs.res.in

Received August 13, 2001; Revised October 30, 2001; Accepted October 30, 2001.

This article has been cited by other articles in PMC.

Abstract

PASS2 is a nearly automated version of CAMPASS and contains sequence alignments of proteins grouped at the level of superfamilies. This database has been created to fall in correspondence with SCOP database (1.53 release) and currently consists of 110 multi-member superfamilies and 613 superfamilies corresponding to single members. In multi-member superfamilies, protein chains with no more than 25% sequence identity have been considered for the alignment and hence the database aims to address sequence alignments which represent 26 219 protein domains under the SCOP 1.53 release. Structure-based sequence alignments have been obtained by COMPARER and the initial equivalences are provided automatically from a MALIGN alignment and subsequently augmented using STAMP4.0. The final sequence alignments have been annotated for the structural features using JOY4.0. Several interesting links are provided to other related databases and genome sequence relatives. Availability of reliable sequence alignments of distantly related proteins, despite poor sequence identity and single-member superfamilies, permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure–function relationships of individual superfamilies. The database can be queried by keywords and also by sequence search, interfaced by PSI-BLAST methods. Structure-annotated sequence alignments and several structural accessory files can be retrieved for all the superfamilies including the user-input sequence. The database can be accessed from http://www.ncbs.res.in/%7Efaculty/mini/campass/pass.html.

INTRODUCTION

The number of protein sequences and structures deposited in databanks (1,2) indicate an ever-increasing gap between protein sequence and structural information (3,4) which is further amplified due to genome sequencing projects. Homologous proteins share a high degree of sequence, structural and functional similarity (5–9). Homologous families can be easily grouped by simple sequence searches whereas superfamily members, adopting the same fold and performing similar biological roles (10–17), can often be identified by sensitive fold prediction algorithms followed by a careful alignment of sequences.

SCOP [Structural Classification of Proteins (17)] is a dictionary of protein structural entries organised at different hierarchies of structural and functional similarities. SCOP (1.53 release) records 26 219 protein domains which are grouped into merely 564 folds, suggesting a strong structural convergence of proteins. CAMPASS (18) forms the first version of a protein superfamily database, corresponding to 69 superfamilies, which records the alignments of proteins aligned using COMPARER (19). Availability of such alignment databases over the World Wide Web offers the possibility to study and design experiments on specific superfamilies; they also permit systematic survey and analysis of various structural properties and perform fold predictions. In addition, the construction of three-dimensional models using homology modelling techniques are usually reliable where the sequence identity between query and the structural homologues (templates) are ≥30%. Analyses of structural and sequence differences amongst known superfamily members can hopefully provide useful guidelines for modelling distant related proteins. We report the semi-automated updated version of the superfamily alignment database which has been designed to be in direct correspondence with SCOP database [1.53 release (17)].

FEATURES IN PASS2

PASS2 superfamily alignments are in concordance with SCOP definitions of superfamily classification and domain boundaries, as opposed to CAMPASS (18) where there could be differences in grouping protein domains and results from DIAL (20) were employed to consider alternate domain boundaries. This was decided to facilitate the automation processes. Nearly 600 single-member superfamilies have been included in PASS2. The superposed set of protein co-ordinates can be viewed using graphic interface softwares such as RASMOL (21). A rough structural phylogeny, corresponding to root mean square deviations (r.m.s.d.) at structurally equivalent positions amongst superfamily members, is provided. Sequence search engines, both in the form of text strings and alignment using PSI-BLAST, have been introduced in PASS2 to provide a user-friendly access to the database. It is also possible to obtain augmented superfamily alignments including the query sequence or homologues from genome databases. PASS2 is connected with 72 genome databases of model organisms enabling access to homologous gene products. Structure-based sequence alignments and superposed protein structure co-ordinates corresponding to superfamily alignments are extractable as in CAMPASS (18). In addition, PASS2 offers the possibility to download accessory structural files of individual superfamily members such as solvent accessibility, hydrogen bonding and secondary structural data. Links to other databases, such as CATH (22,23), FSSP (24), PFAM (25), PALI (26), 3Dee (27,28), PROMOTIF (29), PROCHECK (30,31), SYSTERS (32), both to the individual entries and to the main home pages are provided from PASS2. Tables 1 and 2 provide a list of links and useful data downloadable by the user across the World Wide Web.

Table 1.

Links in superfamily alignment database: main links for each superfamily

Table 2.

Links in superfamily alignment database: links for all the members within a superfamily

SELECTION OF SUPERFAMILIES AND CHOICE OF SUPERFAMILY MEMBERS

The superfamilies are named after their codes in SCOP database: for example, 02.03.068 refers to Biotin-carboxylase N-terminal domain-like superfamily. All the protein domains under a superfamily are considered and one representative protein domain entry, of the best resolution and R-factor from each family, is chosen for a preliminary alignment. NMR structures are considered equivalent to a 3.2 Å resolution crystal structure in the present context. Protein structural co-ordinates have been obtained from the Protein Data Bank (2) and protein domain co-ordinates of the desired chain and domain boundaries considered in the superfamily are extracted using the CHAINRESALL (R. Sowdhamini, unpublished results) program. ATM2SEQ (33) is used to obtain the corresponding amino acid sequences and MALIGN (33) for a multiple-sequence alignment using a constant gap penalty of 40. MOTIFS (R. Sowdhamini, unpublished results) provides a percentage identity matrix which is examined, using PERL scripts, to derive a non-redundant representative set of protein domains for the superfamily such that, as far as possible, no two proteins are >27% identical by the MALIGN alignment (see Fig. 1 for a flow chart).

Figure 1

Steps involved in the choice of representative members of a superfamily alignment (see text for details). Representative members usually are of the highest resolution and R-factor and non-redundant such that no two members have >27% sequence (more ...)

AUTOMATION OF STRUCTURE-BASED SEQUENCE ALIGNMENTS IN PASS2 AND ASSESSMENT OF ALIGNMENTS

The non-redundant set of superfamily representatives was chosen for a rigorous structure-based sequence alignment using COMPARER (34). The method requires pairs of superposed set of co-ordinates of the proteins to be aligned. Superposition is achieved by the choice of ‘initial equivalences’ which serve as seeds for pairwise rigid-body superposition using PMNFC, a modified form of MNYFIT (35). In order to construct PASS2, we initially tested the quality of alignments obtained by employing non-gap positions of MALIGN-derived alignment as initial equivalences and found the resulting alignments were reliable and reasonably correct for all except one out of 10 randomly selected superfamilies. Alignment accuracies were manually examined for the absence of gaps in the middle of secondary structures and the conservation of core secondary structures. The superposed structures corresponding to the alignment were examined on the graphics to ensure the absence of insertions or deletions in the middle of α-helices and β-strands. As far as possible we have also ensured that the functionally important residues in the members of superfamilies with highly similar functions are topologically equivalent. Initial equivalences were chosen from the initial MALIGN results. Pairs of superposed co-ordinates and the derived equivalences are employed to extend alignments guided by the similarity in the structural environment of individual amino acids. Structural environments of amino acids are described by their backbone conformation (secondary structure and cis-peptide bond), pattern of exposure to solvent and patterns of hydrogen bonding and disulphide bond connectivity. These are stored in accessory structural data files for all the superfamily members. The final COMPARER-derived alignments are annotated for the structural information using JOY4.0 (7,9). The final alignments will be assessed for unusual average r.m.s.d. for individual members and the deletions of conserved secondary structures (R. Sowdhamini, unpublished results). Problematic alignments are being examined by resorting to a different structure comparison program, STAMP4.0 (36) which does not require initial equivalences, before alignment through COMPARER. Where STAMP is not appropriate, the simulated annealing option in COMPARER is being employed and the COMPARER runs re-performed (see Fig. 2 for a flow chart).

Figure 2

Steps involved in the structure-based sequence alignment of superfamily representatives. COMPARER (19) is primarily employed for the alignment where initial equivalences are obtained from MALIGN (33). Where the resulting alignment has problems due to (more ...)

STRUCTURAL PHYLOGENY AND GENOME SEARCHES

Pairwise percentage identities of the final structure-based sequence alignments are presented in the form of a symmetric matrix. MNYFIT (35) is employed to obtain a rigid-body superposition of the superfamily members, without an update of equivalences, where the initial equivalences are chosen as non-gap positions corresponding to the final alignment. The r.m.s.d. of the structurally equivalent regions, the non-gap positions of the final alignment, are employed to construct a phylogeny of the superfamily members. Such r.m.s.d. values, though not an accurate representation of the structural relationships between representative members of a superfamily, are nevertheless useful in providing a quantitative estimate of the evolutionary divergence of various homologous families under the query superfamily.

Association of genome sequences from around 60 sources with superfamilies in PASS2 was performed using PSI-BLAST search (37), which is sensitive in identifying distant relatives and convenient for automatic searches (38). Such searches were performed with 10 iterations and a liberal E-value of 0.01, using each of the representative superfamily members as a query against the genome databases. The genome sequences, which are either homologues or additional superfamily members, are aligned with the original structure-based alignment and re-annotated using JOY. Where possible, links to such structure-annotated alignments with genome sequence homologues of superfamily members are provided. Figure 3 shows one such structure-based sequence alignment.

Figure 3

Representative structure-based sequence alignment along with the related genome sequences. This example is for the N-terminal domain of biotin carboxylase-like superfamily and genome sequences have been identified from Zea mays using PSI-BLAST (36) (1mmc (more ...)

SCOPE OF PASS2

PASS2 is a compendium of structure-based sequence alignments of distantly related proteins grouped at the superfamily level in direct correspondence with SCOP definitions. In addition, PASS2 acts as a ‘junction’ point to obtain links of representative superfamily members to genome, sequence and structural databases. Structural phylogenies of superfamily members provide a crude but quantitative estimate of evolutionary relationships where sequence similarity breaks down due to poor sequence identity. Structural phylogenies have been reported earlier, for example, for the distantly related hemoglobin family (39) and for the phosphate-binding superfamilies of the triose-phosphate isomerase TIM-barrel fold (40). Both CAMPASS (19) and PASS2 databases are unique in annotating the structural environment of individual residues on the sequence alignment using JOY (7,9). In addition, PASS2 provides these structural files of individual proteins downloadable across the Web and links to other superfamily members in genome databases.

Statistical analyses on SCOP (41) shows that a vast majority of the protein domains under each of the hierarchical structural classifications, with respect to class, fold and superfamily, are single-members. PASS2 has a conscious inclusion of single-member superfamilies in order to ensure the representation of such examples in fold libraries and profiles generated for fold prediction. We are currently employing superfamily alignments in PASS2 to analyse the retention of structural features and the deviation of structural parameters which will be useful in modelling distantly related proteins. We are also performing sequence searches in genome databases with structural templates of superfamily alignments as additional constraints.

ACKNOWLEDGEMENTS

We are grateful to Professor Sir Tom Blundell for the first version of the database, and to his group for useful discussions. R.S. is a recipient of a Wellcome Trust Senior Research Fellowship, and V.M. is supported by the Wellcome Trust.

REFERENCES

Bairoch A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28:, 45–48. [PubMed].

Bernstein F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112:, 535–542. [PubMed].

Gonnet G.H., Cohen,M.A. and Benner,S.A. (1992). Exhaustive matching of the entire protein sequence database. Science, 256:, 1443–1445. [PubMed].

Sander C. and Schneider,R. (1993) The HSSP database of protein structure-sequence alignments. Nucleic Acids Res., 21:, 3105–3109. [PubMed].

Rossmann M.G. and Argos,P. (1977). The taxonomy of protein structure. J. Mol. Biol., 109:, 99–129. [PubMed].

Richardon J.S. (1981) The anatomy and taxonomy of protein structure. Adv. Prot. Chem., 34:, 167–339.

Overington J.P., Johnson,M.S., Sali,A. and Blundell,T.L. (1990) Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Soc. (London), B241:, 132–145.

Overington J.P., Zhu,Z.Y., Sali,A., Johnson,M.S., Sowdhamini,R., Louie,G.V. and Blundell,T.L. (1993) Molecular recognition in protein families: a database of aligned three-dimensional structures of related proteins. Biochem. Soc. Trans, 21:, 597–604. [PubMed].

Mizuguchi K., Deane,C.M., Blundell,T.L., Johnson,M.S. and Overington,J.P. (1998) JOY: protein sequence-structure representation and analysis. Bioinformatics, 14:, 617–623. [PubMed].

10.

Blundell T.L., Bedarkar,S., Rinderknecht.E. and Humbel,R.E. (1978) Insulin-like growth factor 1. A model for tertiary structure accounting for immunorecactivity and receptor binding. Proc. Natl Acad. Sci. USA, 75:, 180–184. [PubMed].

11.

Lesk A.M. and Chothia,C. (1980) How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol., 136:, 225–270. [PubMed].

12.

Chothia C. (1984) Principles that determine the structures of proteins. Annu. Rev. Biochem., 53:, 537–572. [PubMed].

13.

Murthy M.R.N. (1984) A fast method of comparing protein structure. FEBS Lett., 168:, 97–102. [PubMed].

14.

Holm L., Ouzounis,C., Sander,C., Tuparev,G. and Vriend,G. (1992) A database of protein-structure families with common folding motifs. Protein Sci., 1:, 1691–1698. [PubMed].

15.

Russell R.B. and Barton,G.J. (1994) Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts, secondary structure and accessibility. J. Mol. Biol., 244:, 332–350. [PubMed].

16.

Orengo C.A., Jones,D.T. and Thornton,J.M. (1994) Protein superfamilies and domain superfolds. Nature, 372:, 631–634. [PubMed].

17.

Murzin A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:, 536–540. [PubMed].

18.

Sowdhamini R., Burke,D.F., Huang,J.F., Mizuguchi,K., Nagarajaram,H.A., Srinivasan,N., Steward,R.E. and Blundell,T.L. (1998) CAMPASS: a database of structurally aligned protein superfamilies. Structure, 6:, 1087–1094. [PubMed].

19.

Sali A. and Blundell,T.L. (1990) Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol., 212:, 403–428. [PubMed].

20.

Sowdhamini R. and Blundell,T.L. (1995) Automatic identification and analysis of domains in proteins of known crystal structure. Protein Sci., 4:, 506–520. [PubMed].

21.

Sayle R.A. and Milner-White,E.J. (1995) RASMOL: biomolecular graphics for all. Trends Biochem. Sci., 20:, 374. [PubMed].

22.

Orengo C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) CATH—a hierarchic classification of protein domain structures. Structure, 5:, 1093–1108. [PubMed].

23.

Pearl F.M.G, Lee,D., Bray,J.E, Sillitoe,I., Todd,A.E., Harrison,A.P., Thornton,J.M. and Orengo,C.A. (2000) Assigning genomic sequences to CATH. Nucleic Acids Res., 28:, 277–282. [PubMed].

24.

Holm L. and Sander,C. (1996) Mapping the protein universe. Science, 273:, 595–602. [PubMed].

25.

Sonnhammer E.L., Eddy,S.R. and Durbin,R. (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins, 28:, 405–420. [PubMed].

26.

Balaji S., Sujatha,S., Kumar,S.S. and Srinivasan,N. (2001) PALI—a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res., 29:, 61–65. [PubMed].

27.

Dengler U., Siddiqui,A.S. and Barton,G.J. (2001) Protein structural domains: analysis of the 3Dee domains database. Proteins, 42:, 332–344. [PubMed].

28.

Siddiqui A.S., Dengler,U. and Barton,G.J. (2001) 3Dee: a database of protein structural domains. Bioinformatics, 17:, 200–201. [PubMed].

29.

Hutchinson E.G. and Thornton,J.M. (1996) PROMOTIF—a program to identify and analyze structural motifs in proteins. Protein Sci., 5:, 212–220. [PubMed].

30.

Morris A.L., MacArthur,M.W., Hutchinson,E.G. and Thornton,J.M. (1992) Stereochemical quality of protein structure coordinates. Proteins, 12:, 345–364. [PubMed].

31.

Laskowski R.A., MacArthur,M.W., Moss,D.S. and Thornton,J.M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr., 26:, 283–291.

32.

Krause A., Stoye,J. and Vingron,M. (2000) The SYSTERS protein sequence cluster set. Nucleic Acids Res., 28:, 270–272. [PubMed].

33.

Johnson M.S., Overington,J.P. and Blundell,T.L. (1993) Alignment and searching for common protein folds using a data bank of structural templates. J. Mol. Biol., 231:, 735–752. [PubMed].

34.

Sali A. and Blundell,T.L. (1990) Definition of general topological equivalence in protein structures—a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol., 212:, 403–428. [PubMed].

35.

Sutcliffe M.J., Haneef,I., Carney,D. and Blundell,T.L. (1987) Knowledge based modelling of homologous proteins, Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Eng., 1:, 377–384. [PubMed].

36.

Russell R.B. and Barton,G.J. (1992) Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins, 14:, 309–323. [PubMed].

37.

Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:, 3389–3402. [PubMed].

38.

Park J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol., 273:, 349–354. [PubMed].

39.

Johnson M.S., Sutcliffe,M.J. and Blundell,T.L. (1990) Molecular anatomy: phyletic relationships derived from three-dimensional protein structures. J. Mol. Evol., 30:, 43–59. [PubMed].

40.

Bork P., Gellerich,J., Groth,H., Hooft,R. and Martin,F. (1995) Divergent evolution of a β/α-barrel subclass: detection of numerous phosphate-binding sites by motif search. Protein Sci., 4:, 268–274. [PubMed].

41.

Brenner S.E., Chothia,C. and Hubbard,T.J. (1997) Population statistics of protein structures: lessons from structural classifications. Curr. Opin. Struct. Biol., 7:, 369–376. [PubMed].