CBB seminar April 21, 11am, Bldg.38A, 8th floor conference room Yo Matsuo (NCBI/NLM/NIH) Discrimination of remote homologs from analogs by using homologous core substrucutres Yo Matsuo and Stephen H. Bryant (NCBI/NLM/NIH) Three-dimensional structure of protein is well conserved during evolution. Similarity in structure can therefore often suggest remote homology of proteins, even when their sequences show no significant similarity. But, not all similar structures are necessarily homologous. In the present work, we ask if it is possible to distinguish remote homologs from analogs, using a database of structure alignments. A set of 1125 domains was selected, which had little or no sequence similarity with each other. Their structure neighbors (structurally similar domains) were detected from within the same set. Of the neighbors, 2632 were homologs and 8493 were analogs. As reported by other groups, these homologs and analogs were not well distinguished by percent sequence identity, RMSD, and structure alignment length. However, we found that analogs did not overlap very much with the Homologous Core Substructure (HCS). The HCS of each domain was defined as a subset of its residues structurally conserved by 80% or more of its homologs. The fraction of the HCS shared by each neighbor (HCS overlap score) was calculated. Each homolog was removed from the set for defining the HCS when its overlap score was calculated. Then, using the overlap score of .88 as a threshold, 75.1% (three-fourths) of homologs were correctly identified, with a false-positive rate of 12.4% analogs. To attain the same true-positive rate, other three measures (percentage of identical residues, RMSD, and structure- structure alignment length) gave false positive rates of 45.1, 69.8, and 40.5%, respectively. Examples of pairs of remote homologs suggested by HCS included the N-terminal FAD-binding domain of NADH peroxidase and the ADP-binding domain of trimethylamine dehydrogenase. The HCS describes structural conservation in the same way that a sequence motif describes sequence conservation. The HCS may be used to identify homologs at greater evolutionary distances. In the seminar, I will also discuss some details of the non-redundant PDB data set and the cross reference between MMDB and SCOP domains used in the present work.