Bioinformatics

UCLA-DOE Laboratory of Structural Biology and Molecular Medicine University of California, Los Angeles P. O. Box 951570, Los Angeles, California 90095-1570

joyce@mbi.ucla.edu

Networks of protein interactions control the lives of cells. One research interest in our lab focuses on understanding protein interactions and protein function using bioinformatic approaches. We have summarized studies from the scientific literature of interacting proteins in a database, the Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu). DIP is designed to capture the layered information about protein interactions, which can be termed physical interactions and biological interactions. Biological protein interactions differ from the more general set of physical interactions in their prerequisite for specific protein states and the resultant transitions in the protein states of one or both of the interacting proteins. DIP contains information on physical interactions, including identities of the interacting proteins, their interacting regions, the binding affinity, and the experimental methods. LiveDIP, an extension of DIP, contains data on biological interactions, which are described in terms of protein states and state transitions. This data scheme provide a more complete picture of protein interactions inside cells. We developed advanced search tools such as Pathfinder and Batch searches to assemble pathways from currently available knowledge of protein interactions collated in LiveDIP. JDIP2D, is also developed to interactively explore interaction networks. It provides means of customized graph rendering, annotation, local storage and printing of the protein interaction networks. These data and tools of DIP offered some insights into protein interaction networks. Analysis of all the interactions in DIP indicates that many proteins form a single connected network of interactions accompanied by several smaller networks. An example of the pathway analysis tools applied to analyzing the pheromone response pathway in yeast suggests that the pathway functions in the context of a complex protein-protein interaction network and both positive and negative regulation are important in modulating signal intensity. Integrating gene expression data with this interaction network suggests some regulation mechanisms for signaling processes. Computational methods have also been developed to evaluate the overall quality of large-scale yeast two-hybrid experiments using gene-expression data. Future directions include expanding the database, providing additional tools for analyzing the validity of interactions, developing computational methods for predicting protein-protein interactions, and studying cell signaling on the genome scale.

References

Xenarios I, Fernandez E, Salwinski L, Duan XJ, Thompson MJ, Marcotte EM, Eisenberg D, DIP: the Database of Interacting Proteins: 2001 update. Nucleic Acid Res 2001 29(1): 239-241.
Duan XJ, Xenarios I, and Eisenberg D, Describing Biological Protein Interactions in Terms of Protein States and State Transitions: the LiveDIP Database. Manuscript in submission.

75. Automatic Discovery of Sub-Molecular Sequence Domains in Multi-Aligned Sequences: A Dynamic Programming Algorithm for Multiple Alignment Segmentation

Eric Poe Xing¹, Denise M. Wolf¹, Inna Dubchak¹, Sylvia Spengler¹, Manfred Zorn¹, Ilya Muchnik², and Casimir Kulikowski²

¹Center for Bioinformatics and Computational Genomics, NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 U.S.A.
²Department of Computer Science and DIMACS, Rutgers University, Piscataway, NJ 08855 U.S.A.

mdzorn@lbl.gov

Automatic identification of sub-structures in multi-aligned sequences is of great importance for effective and objective structural/functional domain annotation, phylogenetic treeing and other molecular analyses. We present a segmentation algorithm that optimally partitions a given multi-alignment into a set of potentially biologically significant blocks, or segments. This algorithm applies dynamic programming and progressive optimization to the statistical profile of a multi-alignment in order to optimally demarcate relatively homogenous sub-regions. Using this algorithm, a large multi-alignment of eukaryotic 16S rRNA was analyzed. Three types of sequence patterns were identified automatically and efficiently: shared conserved domain; shared variable motif; and rare signature sequence. Results were consistent with the patterns identified through independent phylogenetic and structural approaches. This algorithm facilitates the automation of sequence-based molecular structural and evolutionary analyses through statistical modeling and high performance computation.

76. THE RDP-II (Ribosomal Database Project)

James R. Cole, Timothy G. Lilburn, Paul R. Saxman, Bonnie L. Maidak, Charles T. Parker, Sunandana Chandra, Ryan J. Farris, George M. Garrity, Thomas M. Schmidt, and James M. Tiedje

Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824

colej@msu.edu

The Ribosomal Database Project - II (RDP-II) provides data, tools and services related to ribosomal RNA sequences to the research community. Through its website (http://rdp.cme.msu.edu/html/), RDP-II offers aligned and annotated rRNA sequence data, analysis services, and phylogenetic inferences (trees) derived from these data. RDP-II release 8.1 (May 21, 2001) contains 16,277 prokaryotic, 5201 eukaryotic, and 1503 mitochondrial small subunit rRNA sequences in aligned and annotated format. Annotation goals include up to date name, strain and culture deposit information, sequence length and quality information, and references. In order to provide a phylogenetic context for the data, RDP-II makes available over 100 trees that span the phylogenetic breadth of life. Web based research tools are provided for comparing a user submitted sequence to the RDP-II database (Sequence Match), aligning a user sequence against the nearest RDP sequence (Sequence Align), examining probe and primer specificity (Probe Match), testing for chimeric sequences (Chimera Check), generating a similarity matrix (Distance Matrix), analyzing T-RFLP data (T-RFLP and TAP-TRFLP), and a java-based phylogenetic tree browser (Sub Trees). Release 8.1 debuted an updated sequence search and selection tool (Hierarchy Browser) and a new phylogenetic tree building and visualization tool built around the PHYLIP phylogeny inference package (Phylip Interface). In addition, release 8.1 includes an interactive tutorial to guide users through the basics of rRNA sequence analysis. This tutorial is suitable both for the researcher new to rRNA based phylogenetic analysis and as a teaching module for upper-level undergraduate and graduate classes. An ongoing effort at the RDP-II is the improvement of alignments in view of recent research on the ribosome and recent improvements in secondary structure based alignment algorithms. RDP-II is also working on methods to incorporate higher taxonomic information in its data. We expect these efforts to result in more accurate and timely data, and to increase the utility of RDP-II for the research community. The RDP-II email address for questions or comments is rdpstaff@msu.edu.

77. A Random Walk Down the Genomes: a Case Study of DNA Evolution in VALIS

Yi Zhou¹, Archisman Rudra², Salvatore Paxia², and Bud Mishra²

¹Department of Biology, New York University
²New York University Courant Bioinformatics Group

yz237@nyu.edu

Modern biology is driven by large scale processing of heterogeneous data, which may come from diverse sources. This could be anything from a Genbank sequence to the result of some microarray experiment. The interfaces which let one access these different sources vary widely, so much so that a biologist needs to be an expert in very different areas of computer science: databases, networking, languages etc. Furthermore, the algorithms used to extract biologically significant information tend to be developed in an ad hoc manner. This leads to very little code sharing between the data analysis algorithms with the concomitant increase in code complexity.

Instead of developing each tool ab initio, our bioinformatics system VALIS defines low level building blocks and uniform APIs which lets one use these from high level scripting languages. This enables biologists to write very simple scripts to perform fairly involved bioinformatics processing in a flexible fashion.

As an example we use the VALIS system to investigate the consequences of various cellular events on genomic DNA sequence evolution. How genomes evolve is a very important problem in biology. It will lead to better understanding on the mechanisms of cancer development, and more accurate analyses of phylogeny data.

We approach the study of sequence evolution by looking at statistical properties of the DNA sequences. In particular, we measure the long-range correlation properties of DNA sequences. Our approach is to estimate a few of these statistical parameters in the hope of distinguishing between different models of DNA evolution in coding and non-coding regions.

In order to study the scale-invariant long-range correlation of the DNA sequences, we view the DNA sequences as being generated from a random walk model. We first map the whole genomic DNA sequences following purine-pyrimidine binary rule: change purines (A/G) to +1 and pyrimidines (C/T) to -1. This creates a ‘DNA walk’ along the genome. The ‘DNA walker’ moves either up or down at every base pair according to the binary map of the DNA sequence. If there is no long-range correlation, the walk is a realization of a Brownian motion. Otherwise, we observe a ‘walker’ with long-term memory and thus a Fractional Brownian motion. Those two processes can be characterized by different values of the Hurst exponent (H). H=0.5 for Brownian motion and H>0.5 for Fractional Brownian motion, i.e. higher H values suggests the presence of stronger long-range correlation. We use many different methods to estimate H, for example, R/S analysis and detrended fluctuation analysis (DFA).

We have analyzed various genomes using VALIS: bacteria, invertebrate and vertebrate. We observe a consistent difference in H in the coding regions compared to the non-coding regions. The H values tend to be higher in the non-coding regions than in the coding regions. Thus, the DNA walk down the bacterial coding region sequences behaves as a Brownian motion (H ~ 0.5), while it acts as a Fractional Brownian motion in the non-coding regions (H>0.5). For other organisms, such as yeast, the difference persists: yeast has H ~ 0.54 in the coding regions, versus H ~ 0.61 in the non-coding regions. The higher H values in non-coding regions indicate that the sequences in the non-coding regions possess much stronger long-range correlation than those in the coding regions. In addition, the H values in different regions increase with the evolutionary position of the corresponding organism. This suggests that there are some cellular events that tend to make DNA sequences more correlated as evolution proceeds.

Based on our observations, we hypothesize that the differences in the strengths of long-range correlation in DNA sequences are caused by the counteraction of two sets of biological events. One set includes insertion, deletion events caused by DNA polymerase stuttering and transposons, which tend to increase DNA long-range correlation. And the other set includes natural selection and DNA repair mechanisms, which try to eliminate the long-range correlation caused by the former events. However, the coding regions are under a higher natural selection pressure and possess the transcription-coupled DNA repair mechanism that is unique to them. Thus, the stronger correlation-elimination forces in the coding regions can explain the weaker long-range correlation observed there than that in the non-coding regions. And the higher flexibility offered by larger genome sizes in the higher organisms allows the increase of long-range correlation in DNA sequences along the evolution tree.

To test our hypothesis, we designed a ‘Genome Grammar’. This is a stochastic grammar with primitives for many kinds of mathematical probability distributions. We can even generate a sequence with the same probability distribution as measured from biological data. Furthermore, there are tools that let one apply some hypothesized processes act on sequences obtained from the grammar. This enables biologists to apply any model and conduct evolutionary experiments ‘in silico’.

Our observations also have potential significance for biotechnology application. Taking the advantage of highly efficient statistical algorithms in VALIS, the discovery of statistical differences in DNA coding and non-coding regions may lead to potential in vitro biochemistry technologies that can efficiently detect coding and non-coding regions without the effort of DNA sequencing.

78. A Graph Data Model to Unify Biological Data

Frank Olken

Lawrence Berkeley National Laboratory

Olken@lbl.gov

Federated (or mediated) database systems typically require that we describe each participating database’s schema in a common data model, e.g., relational. Such a common data model facilitates the construction of queries which span the various databases (and types of data).

In this work, we suggest that a graph data model, i.e., labeled graphs, either directed or undirected, could better serve as this common data model for biological data. We note the ubiquity of graphs in biological datasets: taxonomies, phylogenetic trees, metabolic networks, signaling networks, genetic regulatory networks, chemical structure graphs, contact graphs, partial orders in genetic mapping, overlap graphs in physical mapping and shotgun sequence assembly, DNA sequences as linear graphs, etc. Finally, we review graph data modeling and query language efforts in the database community and point out open problems in graph data modeling, query language design, query complexity for data management in biology.

79. Protein Data Bank: Unifying the Archive

Gary Gilliland and The PDB Team

Research Collaboratory for Structural Bioinformatics Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087; National Institute of Standards and Technology, Biotechnology Division and Informatics, Data Center, 100 Bureau Drive, Gaithersburg, MD 20899-8314; and San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537

gary.gilliland@nist.gov

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/) is the single worldwide archive of structural data of biological macromolecules. All data in the archive have been validated, and a uniform archive has been released for the community. A collection of mmCIF data files for the PDB archive has been made available at ftp://beta.rcsb.org/pub/pdb/uniformity/ data/mmCIF/.

A utility application that converts the mmCIF data files to the PDB format has also been released to provide support for existing software.

The Protein Data Bank is operated by the Research Collaboratory for Structural Bioinformatics (RCSB) and is supported by funds from the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine.

80. Protein Structure Predictions by PROSPECT

Dong Xu, Dongsup Kim, Christal Secrest, Victor Olman, and Ying Xu

Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory

xyn@ornl.gov

Protein threading represents one of the key computational techniques for protein fold recognition and protein backbone structure prediction. We have previously developed a computer software PROSPECT, using protein threading as its core technology. PROSPECT employs a divide-and-conquer algorithm for finding the globally optimal alignment between a query sequence and a template structure. Significant improvements and additions have been made to the PROSPECT system in the past year, which can be summarized as follows.

We have developed a capability for assessing the reliability of PROSPECT’s fold recognition prediction. It is well-known that there is no theoretically sound way, yet, to normalize threading scores across all queries/templates with different sequence lengths, different amino acid compositions, and different geometric and physical features, making assessing threading scores difficult. By threading each template structure in our template database (with over 2000 structures) against all sequences in the FSSP database, we can get a threading score distribution for the template against all FSSP proteins. We have trained a neural network to map each query-template threading score along with the template’s score-distribution and various other compositional, geometric and physical parameters of the template-query pair to a real value in [0,1], such that this value reflects the percentage of structurally-alignable positions between the query and the template (with 1 representing two structures 100% alignable; and 0 for no significant structure alignment). This mapping provides a highly useful measure for PROSPECT’s predictions.
We have improved PROSPECT’s threading energy function by including family-specific profiles and profile-profile alignments, and by re-parameterizing our current energy terms through better statistical treatments of the structure database information and using significantly large data set. This has resulted in a 10+% increase in PROSPECT’s prediction accuracy on a large test set, compared to the previous version of PROSPECT.
We have developed a capability for automatically decomposing a solved protein structure into protein domains. This is an essential step in automatically updating our structure template database. This unique capability solves the domain decomposition problem as a maximum flow problem. It can be potentially used by large protein-structure depositories like PDB for its automatic updates on protein domains.

We have recently applied PROSPECT to a number hypothetical proteins of Shewanella oneidensis MR-1, which were identified to be related to metal-reduction through microarray gene expression experiments and data analysis by J. Zhou’s lab at ORNL. Detailed structures of four such proteins will be presented in this presentation.

References

Y. Xu, and D. Xu, “Protein Threading using PROSPECT: design and evaluation”, Protein: Structure, Function, Genetics, 40:343 - 354, 2000.
D. Xu, K. Baburaj, C. B. Peterson, and Y Xu, “A Model for the Three Dimensional Structure of Vitronectin: Predictions for the Multi-Domain Protein from Threading and Docking”, Proteins: Structure, Function, Genetics, 44:312-320, 2001.
D. Xu, O. Crawford, P. LoCascio, and Y. Xu, “Application of PROSPECT in CASP4: characterizing protein structures with new folds”, Proteins: Structure, Function, Genetics special issue on CASP4 (by invitation), 2001 (in press).
Y. Xu, D. Xu and V. Olman, “A practical method for interpretation of threading scores: an application of neural networks”, Statistica Sinica Special issue on Bioinformatics, 2001 (in press).
D. Xu and Y. Xu, “Computational Studies of Protein Structure and Function using Threading Program PROSPECT”, In Protein Structure Prediction: Bioinformatic Approach (eds by Igor Tsigelny), International University Line (IUL) Publishers (by invitation), 2001 (in press).
Y. Xu, D. Xu, and H. N. Gabow, “Protein Domain Decomposition using a Graph-Theoretic Approach”, Bioinformatics, 16 (12), 1091 - 1104, 2000.

81. Protein Fold-Recognition Using HMMs and Secondary Structure Prediction

Kevin Karplus

University of California, Santa Cruz

karplus@soe.ucsc.edu

The protein-folding problem, in its purest form, is too difficult for us to solve in the next several years, but we need structure predictions now. One solution is to try to recognize the similarity between a target protein and one of the thousands of proteins whose structure has been determined experimentally. For very similar proteins, the relationships are easy to find and good models can be built by copying the backbone (and even some sidechains) for the homologous protein of known structure. For less similar proteins (in the “twilight zone”), the fold-recognition problem is more challenging, but it is often possible to find useful similarities.

Using evolutionary information helps enormously in recognizing remote relationships, and one convenient way to summarize a family of homologs is with a hidden Markov model (HMM). Homologs can be found and an HMM built by an iterated search, starting from a single target sequence. The resulting target HMM can be used to score the sequences of all proteins of known structure.

Similarly, homologs can be found and HMMs built for template proteins of known structure and used to score the target sequence. Combining both target-model and template-library results reduces the false positive rate.

Some further improvements can be made by predicting local structural properties of the target sequence (such as secondary structure or solvent accessibility) and adding these predictions to the HMM used to score the template sequences.

Fold-recognition techniques based on these HMMs have performed quite well in blind prediction experiments (CASP2, CASP3, and CASP4) and are doing better than threading techniques based on pairwise potentials.

82. Protein Engineering in Structural Genomics

Patrice Koehl and Michael Levitt

Department of Structural Biology, Fairchild Bldg, Stanford University, Stanford, CA 94305-5126

koehl@csb.stanford.edu

The emphasis of our project is placed on the design of novel proteins that may serve as catalysts needed for bioremediation. Our approach can be divided into two steps: identify target protein structures that would provide the desired functions, and search for sequences that make these protein structures both stable and unique. Our efforts have focused on the later, namely on automated protein sequence design. We have made significant progress in characterizing the sequence space compatible with a protein structure, and have shown that this information can prove valuable for protein structure prediction:

(1) Measuring the size of the sequence space compatible with a protein structure

It is well known that certain structures are more commonly observed among proteins than others. Highly designable structures are more likely to have been found through the process of evolution, since they are more robust to random mutations. We have developed a new approach to explore and quantify the sequence space associated with a given protein structure. We have shown that our measure of the protein sequence space compatible with a given fold correlates with the usage of the fold observed among naturally occurring sequences. Our results also suggest that the designability of a protein (i.e. the number of sequences possessing the structure of interest as their non-degenerate energy ground state) can be derived from the knowledge of its topology alone. As a consequence, we anticipate that our method for sequence space exploration will prove useful for identifying highly designable folds, which will represent attractive targets for protein design.

(2) Application of protein sequence design to protein structure prediction

The goal of the inverse protein folding problem is to identify amino acid sequences that stabilize a given target protein conformation. Methods that attempt to solve this problem have proved useful for protein sequence design. We have shown that the same methods can provide valuable information for protein fold recognition and for ab initio protein structure prediction. We also derived a new measure of the compatibility of a test sequence with a target model structure, based on computational protein design. The protein structure is used as input to design a family of low free energy sequences, and these sequences are compared to the test sequence, using a metric in sequence space based on nearest neighbor connectivity. We have found that this new measure is powerful enough to recognize near-native protein structures among non-native models.

83. Classifying G-Protein Coupled Receptors with Support Vector Machines

Rachel Karchin¹, Kevin Karplus², and David Haussler³

¹University of California, Santa Cruz, Computer Science
²University of California, Santa Cruz, Computer Engineering
³Howard Hughes Medical Institute

rachelk@soe.ucsc.edu

The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-protein coupled receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signaling network that regulates many basic physiological processes. They are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile hidden Markov model, and methods, including support vector machines, that transform protein sequences into fixed-length feature vectors. The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the minimum error point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN.

We have set up a web server for GPCR subfamily classification based on hierarchical multi-class SVMs at http://www.soe.ucsc.edu/ research/compbio/gpcr-subclass. By scanning predicted peptides found in the human genome with the SVMtree server, we have identified a large number of genes that may encode GPCRs.

A list of our predictions for human GPCRs is available at http://www.soe.ucsc.edu/research/compbio/gpcr_hg/class_results. We also provide suggested subfamily classification for 18 sequences previously identified as unclassified Class A (rhodopsin-like) GPCRs in GPCRDB, available at http://www.soe.ucsc.edu/research/compbio/ gpcr/classA_unclassified/.

84. Protein Structure Determination Through Combining Protein Threading and Sparse NMR Data

Ying Xu, Dong Xu, Dongsup Kim, and Oakley Crawford

Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory

xyn@ornl.gov

Protein structural information derived from protein threading and NMR experiments could complement each other. Fully utilizing the available information from the two sources could lead to solutions of protein structures neither one alone can solve. When applicable, protein threading can provide protein backbone structures with reasonable accuracies (e.g, 4-6 angstroms). It is estimated that threading methods could potentially be applicable to 60-70% of all soluble proteins. NMR experiments typically apply to small proteins (< 30 KD). As target proteins become larger, the fraction of assignable NMR peaks drops significantly, which results in an insufficient amount of NMR restraints for accurate structure solution. We have developed a computational capability for protein structure solution through combining sparse NMR data and protein threading. It consists of two components: (a) protein fold recognition and backbone prediction by NMR data-constrained threading; and (b) NMR structure calculation by molecular dynamics and energy minimization, using a threaded structure as starting point. We have demonstrated that a small number of NMR restraints can significantly improve the prediction accuracy by threading, and that when starting from a predicted backbone structure (with accuracy about 4 angstrom), NMR structure calculation only requires a small fraction of NMR restraints typically needed to reach a certain level of accuracy. To make this computational capability practically useful, a capability for assigning (sparse) backbone NMR peaks is essential. We have developed a computational framework for assigning backbone NMR peaks. The framework models peak assignments as a constrained bipartite matching problem. While we demonstrated that a rigorous solution is highly challenging (we proved the problem is NP-hard), we have developed a rigorous and reasonably efficient algorithm by taking advantage of the discerning power of our assignment function. This framework is the first rigorous formulation, which is capable of incorporating all relevant information involved in peak assignments. Our preliminary assignment results are highly encouraging. Collaborations are currently under way to solve a number of large proteins using this computational framework, with NMR labs.

References:

Y. Xu, D. Xu, D. Kim, V. Olman, J. Razumovskaya, and T. Jiang, “Automated Assignment of Backbone NMR Peaks using Constrained Bipartite Matching”, IEEE Computing in Science and Engineering special issue on bioinformatics, 2001 (in press).
Y. Xu and D. Xu, “Protein Structure Prediction by Protein Threading and Partial Experimental Data”, in Current Topics in Computational Molecular Biology (Eds, Jiang, Xu, Zhang), MIT Press, 2001 (in press).
Y. Xu, D. Xu, O. Crawford, J. R. Einstein, “A computational method for NMR-constrained protein threading”, Journal of Computational Biology, 7:449 - 467, 2000.
G. Lin, Z. Chen, T. Jiang, J. Wen, J. Xu, Y. Xu, “Approximation Algorithms for NMR Spectral Peak Assignment”, 2001 (submitted).

85. GAP: Genomics Annotation Platform

Konstantin M. Skorodumov¹, Evgeny Raush¹, Maxim Totrov¹, Ruben Abagyan², and Matthieu Schapira¹

¹Molsoft LLC
²The Scripps Research Institute

matthieu@molsoft.com

The challenge of functional genomics—assigning functions to sequenced genes—is critical for the rapid evolution of modern medicine. Computational approaches can accelerate dramatically annotation efforts by producing predictive functions that can then be rapidly confirmed. Molsoft is building a genomics annotation platform, GAP, based on in-house computational biology and intranet software. Both comparative genomics and structural genomics tools are being implemented, and will allow rapid identification of predictive protein functions and protein-protein interactions. The system relies on a relational database infrastructure and provides online graphic user interface.

86. Genome to Proteome and Back Again: ProteomeWeb

Carol S. Giometti¹, Sandra L. Tollaksen¹, Gyorgy Babnigg¹, Tripti Khare¹, Claudia I. Reich², Gary J. Olsen², John R. Yates III³, Jizhong Zhou⁴, Ken Nealson⁵, and Derek Lovley⁵

¹Argonne National Laboratory, Argonne, IL
²University of Illinois, Urbana-Champaign, IL
³The Scripps Institute, La Jolla, CA
⁴Oak Ridge National Laboratory, Oak Ridge, TN
⁵University of Massachusetts, Amherst, MA

csgiometti@anl.gov

Complete genome sequences give rise to open reading frame (ORF) databases that can be translated into the theoretical amino acid sequences of all proteins predicted to be encoded. In the context of proteome analysis, these ORF databases are the foundation of protein identifications. Proteins from complex mixtures are digested using site-specific proteases and the masses of the peptides are compared with the hypothetical peptide masses from the protein sequences predicted by the ORFs. As part of the Department of Energy Microbial Genome Program, we are using two-dimensional gel electrophoresis coupled with tandem mass spectrometry to identify and quantify the proteins expressed by a variety of energy- and bioremediation-related microbes, including Methanococcus jannaschii, Shewanella oneidensis, and Geobacter sulfurreducens. Sufficient data has been assimilated to allow comparisons among microbial proteomes in the context of constitutive protein expression and the modulation of protein expression in response to specific environmental conditions. Data acquisition and management are handled by using the Oracle relational database software together with a World Wide Web interface. A public web site (ProteomeWeb; http://proteomes.pex.anl.gov) with customized tools is available to enable users to query the protein identifications and experimental data. This web site includes links to genome and protein sequence databases as well as metabolic pathway databases, providing an integrated environment for the interpretation of the proteome results. The proteome studies are adding value to existing genome sequence information, providing data on the relative abundance of different ORF products, the conditions of their expression, and post-translational processing. However, mechanisms for genome databases to access and utilize these proteome data still need to be developed.

87. Computational Experiments on RNA Phylogeny

Frank Olken¹, James R. Cole², Gary J. Olsen³, Craig A. Stewart⁴, David Hart⁴, Donald K. Berry⁴, and Sylvia J. Spengler¹

¹Lawrence Berkeley National Laboratory
²Michigan State University
³University of Illinois, Urbana Champaign
⁴Indiana University

olken@lbl.gov

This work describes computational experiments aimed at automated construction of high quality phylogenetic trees from RNA sequences taken from the Ribosomal Database Project (RDP). Due to computational constraints and limitations of earlier multiple sequence alignment codes current production operations at the Ribosomal Database Project use hand tuned multiple sequence alignments and trees constructed with the WEIGHBOR neighbor joining code (by W.J. Bruno, N.D. Socci, and A.L. Halpern) Here we describe efforts to construct maximum likelihood phylogenetic trees from the RDP data set using RNACAD (by M. Brown) to construct the multiple sequence alignments and a parallel version of the fastDNAml code (G. Olsen, parallelization by C. Stewart, D. Hart, and D. Berry). to construct the maximum likelihood phylogenetic trees.

88. Identifying Transcription Factor Binding Sites by Cross-Species Comparison

Lee Ann McCue, William Thompson, C. Steven Carmack, and Charles E. Lawrence

The Wadsworth Center, New York State Department of Health, Albany, NY

mccue@wadsworth.org

We have developed a phylogenetic footprinting method with the goal of identifying complete sets of transcription factor (TF) binding sites in bacterial genomes. This method employs an extended Gibbs sampling algorithm to identify sites by cross-species comparison. The Escherichia coli genome sequence and a database of experimentally verified regulatory sites were used to test this method. Using this data and the genome sequence data from nine additional gamma proteobacterial species, we have evaluated our ability to predict TF binding sites, and addressed the questions of which species are most useful and how many genomes are sufficient for comparison with respect to phylogenetic footprinting. In a study set of 166 E. coli genes with experimentally identified TF binding sites upstream of the orf, we found that orthologous promoter data from just 3 additional species were sufficient for ~80% of predicted sites to correspond to experimentally reported sites. Also, the species characteristics that most influenced our results were phylogenetic distance, genome size, and natural habitat. We performed simulations using randomized data to determine the critical values for statistical significance of our predictions (p = 0.05). We found that the inclusion of a very closely related species (Salmonella typhi) was beneficial despite substantially increasing the critical value. We are applying this technology to the genomes of microbes that are of environmental interest and for which there is little knowledge of transcription regulation. Preliminary results for Synechocystis PCC6803 will be presented.

89. VISTA: Integrated Tool for Comparative Genomics

I. Dubchak¹, Lior Pachter², A. Poliakov¹, I. Ovcharenko¹, and E. Rubin¹

¹Lawrence Berkeley National Laboratory
²University of California Berkeley

ildubchak@lbl.gov

One of the more powerful algorithms available to identify functional regions in genomic DNA (this includes both genes and surrounding regulatory elements) involves comparative sequence analysis. They have proved to be especially efficient in finding and analyzing conserved non-coding elements potentially playing a role in gene regulation. The deluge of genomic sequence that is rapidly appearing in databases is leading to the need for faster and more robust programs for analyzing the data. For example the practice of aligning single genes only a few kilobases long has been replaced by the need to align hundreds of kilobases of BACs or even entire genomes. The algorithmic challenges posed by these large datasets have been accompanied by user interface challenges, such as how to visualize information related to enormous datasets and how to enable users to interact with the data and the processing programs.

We have developed an integrated tool incorporating a novel alignment program for large DNA sequences AVID (Bray and Pachter, in preparation) and an associated visualization tool VISTA (Mayor et al., 2000), which serves as the platform for large-scale comparative analysis of genomic sequences. Its visual output is clean and simple, allowing the user to easily identify conserved regions. Similarity scores are displayed for the entire sequence, thus helping in the identification of shorter conserved regions, or regions with gaps.

There are various modifications of VISTA for solving particular biological problems: cVISTA (complementary VISTA) is used for looking at differences between recently evolved species such as comparing mice to rats or humans to chimpanzees; multiVISTA allows to visualize several related alignments on the same scale. When orthologous sequences of three species are available, we can also apply a statistical method for calculating cutoffs to define noncoding sequences that are conserved because of functional constraints (Dubchak et al., 2000). VISTA has been implemented as a platform-independent stand-alone software package written in Java and as a Web server located at http://www-gsd.lbl.gov/vista. It is extensively used in Genome Science Department for mouse-human comparative studies and has become the main comparative sequence analysis tool of several large sequence generation centers.

Mayor C., Brudno M., Schwartz J. R., Poliakov A., Rubin E. M., Frazer K. A., Pachter L. S. and Dubchak I. (2000) VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16, 1046-1047.
Dubchak I., Brudno M., Loots G. G., Mayor C., Pachter L., Rubin E. M. and Frazer K. A. (2000) Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Research, 10.

90. Beyond Terascale Biological Computing: GIST and Genomes To Life

Philip LoCascio, Doug Hyatt, Frank Larimer, Manesh Shah, Inna Vokler, and Ed Uberbacher

Computational Biology Program, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee

http://compbio.ornl.gov/

locasciop@ornl.gov

High performance computing played a critical part in the successful completion of the working draft of the human genome, and continues to be necessary to handle the ever-increasing flood of new biological data. We have successfully met the first rounds of such computing for biology with the Genomic Integrated Supercomputing Toolkit (GIST). GIST provides a transparent, fault tolerant interface for the research community to an ever increasing suite of accelerated massively parallel biological applications. It also demonstrates key concepts that can be extended for large-scale computation for the Genomes to Life (GTL) program. GIST and associated data sets are accessible via the WWW interfaces of the Genome Analysis Toolkit, Genome Channel and other ORNL tools, and optionally a command line interface for biologists developing new applications.

With the advent of the DOE Genomes To Life Program, we are actively developing new technologies that will help biologists link large-scale experimentally derived biological data with increasingly sophisticated computational analyses. Our basic approach is to support the mode of operation where computational biology is concerned with transactional tool usage upon data and the construction of “recursive” pipelines of analyses, which ultimately execute on supercomputer resources (via GIST). The central theme here is to organize the libraries of biological tools around the available biological data types, and use the existing methodology of context dependent XML schema to classify both the data types and the biological tools. In this way, it becomes possible for users to (i) automatically detect which tools and which data can be combined in a valid operation, (ii) configure linked sets of analysis steps without detailed knowledge of computing or tools, and (iii) record transactions in an RDBMS to look for dependencies and redundancies. Furthermore, existing user interfaces to biological tools in any browser format can be instantly coupled with the appropriate tool-data combinations and linked transparently to high performance operations.

Where available, existing biomedical community templates are being used as the basis for standards which could be established across the GTL enterprise. With new data types rapidly becoming available, a strategy that reuses existing tools and interfaces, supports new tools, and makes it possible to combine tools to perform novel analyses, will be the most effective way to manage software complexity and development cost. Additionally, where data types are not “naturally” aligned, it will be possible to create “filters” for the most common types of conversions. e.g. FASTA → Masked FASTA.

This intermediate software layer can serve as an effective conduit between users, with their novel complex biological data sets, and the emerging beyond terascale computing infrastructure. Using this approach it should be possible to create friendly interfaces to new classes of algorithms that are built of both new and old components, with the flexibility necessary to tackle biological problems of increasing complexity. Libraries of software for tasks such as metabolic pathway reconstruction, gene regulatory network modeling and cell modeling can be constructed, supported and utilized by the community if organized and accessed in thus manner.

91. A Computational Pipeline for Genome-Scale Analysis of Protein Structures and Functions

Serguei Passovets, Manesh Shah, Li Wang, Dong Xu, and Ying Xu

Life Sciences Division, Oak Ridge National Laboratory

xyn@ornl.gov

We are expecting over 1000 genomes will be sequenced within the next 5-10 years. A significant percentage of the genes, to be identified computationally in these genomes, will have not have known functions, which are detectable by sequence-based homology search tools like PSI-BLAST. In our recent experience in CASP4, we found that threading-based protein fold recognition tools, like PROSPECT, can clearly detect more remote homologs than sequence-based methods can. We have recently developed a computational pipeline for automated protein structure predictions. The main components of the pipeline are: (a) a toolkit consisting of essential protein analysis tools, (b) a client/server system which provides access to the tools, (c) a pipeline manager which coordinates the processing tasks for a given analysis request, and (d) a web interface for query submission.

The pipeline operations can be categorized into three distinct phases: 1) protein triage, 2) threading-based structure prediction and 3) sequence based function determination. Protein triage phase uses PRODOM (for domain parsing), SOSUI (for classification into globular or membrane protein), SignalP (for identifying signal peptide cleavage sites) and PSI-BLAST (for sequence homology in PDB, Swissprot and other databases). Structure prediction phase uses SSP (a secondary structure prediction tool developed by our group), PROSPECT (for protein fold recognition), MODELLER (for atomic model construction) and WHATIF (for structure quality assessment). Sequence based function determination phase (not yet implemented) will use protein family classification tools Pfam, Motif and PRINTS. The pipeline manager invokes different tools depending on the user input and logic of the prediction process and controls the data and analysis flow of the pipeline. XML technology is used for data exchange between the web interface, the pipeline manager and the tools.

Initial applications of the pipeline will be done on proteins of the cyanobacteria genomes, currently being annotated at ORNL.

Reference:

D. Xu, O. Crawford, P. LoCascio, and Y. Xu, “Application of PROSPECT in CASP4: characterizing protein structures with new folds”, Proteins: Structure, Function, Genetics special issue on CASP4 (by invitation), 2001 (in press).

92. WIT3 – A New Generation of Integrated Systems for High-Throughput Genetic Sequence Analysis and Metabolic Reconstructions

N. Maltsev, G. X. Yu, E. Marland, S. Bhatnagar, R. Lusk, and E. Selkov

Mathematics and Computer Science Division Argonne National Laboratory

maltsev@mcs.anl.gov

During the past decade, the scientific community has witnessed an unprecedented accumulation of gene sequence data and data related to the physiology and biochemistry of organisms. Availability of such information allows moving to the next level of understanding of life processes by shifting of focus of investigation from individual genes and proteins to understanding of functionality of the biological systems as a whole. Therefore development of integrated computational environments that provide access to genomic data, information describing metabolic and regulatory networks, as well as computational tools for navigation, comparisons, cross-correlations and analysis of this data is critical for further advancement of biological science.

During the past years the Computational Biology group at Argonne designed and implemented a family of systems for genetic sequence analysis and metabolic reconstructions of newly sequenced genomes. The WIT2 system developed at Argonne National Laboratory (http://wit.mcs.anl.gov/WIT2) is an interactive integrated information system that is used by the scientific community worldwide for comparative analysis of the genomes and metabolic reconstructions from the sequence data. However, advances in software development and in computational capabilities now offer the opportunity to significantly enhance capability of the WIT2 system to perform high-throughput analysis of microbial genomes at a rate compatible with the increased rate of genome sequence completion.

Computational biology group at ANL is now developing the next generation of such systems, WIT3. WIT3 now contains analysis of 74 completely sequenced microbial genomes and has the following new features that allow improving performance and enhancing genome analysis:

a) WIT3 is based on relational database (Oracle). Such systems architecture increases overall performance and robustness of the system, simplifies additions of the new genomes and updates procedures.

b) Combination of SQL and Perl-based search engines allows fast execution of complex queries

c) Representation of the data in a structured XML format simplifies analysis, representation, visualization and exchange of the data.

We have also developed a number of new tools and algorithms for automated analysis of the genomes. Such clustering algorithms allow integrating the results of analysis of genomic data by a variety of comprehensive bioinformatics tools (e.g. InterPro, Blocks, CATH) and increasing sensitivity and reliability of the genetic sequence analysis. We are developing tools to support comparative analysis of the metabolic and regulatory networks, and user-driven as well as automated metabolic reconstructions from the sequence data. During the past months we have developed a new user interface that allows extensive visualization of the sequence data (e.g. domain organization, genetic maps, conserved chromosomal gene clusters, phylogenetic trees). New tools for visualization and navigation of the metabolic networks are also being developed.

Data processing is done using scalable computational supercomputing resources available at MCS.

93. Comparative and Collaborative Bioinformatics Systems to Promote Mammalian Phenotype Analysis and the Elucidation of Regulatory Networks

Erich Baker^1,5,6,7, Doug Hyatt^1,5, Barbara Jackson^1,2,6,7, Gwo-Liang Chen^1,6,7, Denise Schmoyer^1,2,6, Yesim Aydin-Son^1,5, David McWilliams^1,5, Fred Baes^1,6,7, Stefan Kirov^1,5, Michael Galloway^1,6, Michael Leuze², Line Pouchard², Brynn Jones¹, Ed Michaud^1,5, Bem Culiat^1,5, Gene Rinchik^1,5,7, Dabney Johnson^1,5,7, Ed Uberbacher^1,5, Darla Miller^{1, 7}, Frank Larimer^1,5, Jay Snoddy^1,4,5,6,7, ORNL Life Sciences Division, and the Tennessee Mouse Genome Consortium

¹Life Sciences Division, Oak Ridge National Laboratory
²Computer Science and Mathematics Division, Oak Ridge National Laboratory
³University of Tennessee Health Science Center
⁴University of Tennessee Center of Excellence in Genomics and Bioinformatics
⁵UT-ORNL Graduate School in Genome Science and Technology
⁶Tennessee Mouse Genome Consortium—Bioinformatics Core
⁷Tennessee Mouse Genome Consortium—Neurophenotyping project, PI: Dan Goldowitz UTHSC

fwl@ornl.gov

Phenotypes are complex, highly diverse across species, and variable within populations of a species. We will soon have complete lists of the genes in the mouse, human, and a few other metazoan species, but we still do not understand how these genes act together to create phenotypes from genotypes. Gene regulatory networks (GRNs) are thought to be a major component in “reading out” genotypes and creating phenotypic complexity, diversity, and variability. Analysis of the networks that create phenotypes will require new data and approaches. Collaborations under way at Oak Ridge National Laboratory (ORNL) and the Tennessee Mouse Genome Consortium (TMGC) (www.tnmouse.org) are good examples of what is needed. These studies will also require developing new aspects of collaborative and comparative bioinformatics. Our group, working with others, will need to create a Semantic Web for phenotype and GRN analysis. This semantic web should, like the current web, allow scientists to directly share and analyze data that is placed on the web; more importantly, data for a semantic web is organized so that computers can also directly analyze large data sets placed on the web. This computer-based analysis should create useful inferences and assist users in datamining.

Information systems developed to directly support the sociological aspects of web-based collaborations include a TMGC member database (ver. 1.0), TMGC protocol-expertise database (ver. 0.1), and a collaborative TMGC Web Content Management System (ver. 1.0). Other bioinformation systems discussed below are needed to support four different, but interrelated, research approaches. The first two approaches include alternative ways to find and analyze new allelic variants that have a phenotypic consequence. The third approach will start from the molecular phenotypes of gene expression (e.g. RNA expression arrays or proteomes). The final interrelated approach will start from a comparative analysis of genome sequences in the chordates to lay the groundwork to understand the evolution of genomes and gene regulatory networks.

Within a species, genotypic variation—either created experimentally or occurring naturally—may cause system perturbations that result in observed variation in phenotypes. Our group is developing bioinformation systems to assist TMGC researchers in screening ENU-mutagenized mice for phenotypes of interest and, ultimately, in using these mutant mice to help pinpoint the individual genes and gene products that are involved in the complex networks that create the phenotypic traits of interest. MuTrack ver 1.0 has been completed and is routinely used in the TMGC (www.tnmouse.org/mutrack/index.php). The Web-based user interface allows researchers to enter, share, and analyze data about mice shipped around to several institutions and labs. A new system called Elector and GeneKeyDb is being designed to help find and analyze candidate genes in targeted chromosomal regions. In addition, another system, MuGnoSys ver 0.5 allows researchers to develop complex and detailed knowledge about the different mutant mouse strains produced by TMGC or ORNL. These systems are being used to screen for neurological phenotypes but were developed to be scalable and adaptable for use in other rodent phenotype analyses, including detailed phenotypic analysis in mutants created by targeted mutagenesis or in the mutants found in the second approach discussed below.

ORNL is creating the Cyropreserved Mutant Mouse Bank (CMMB) of ENU-mutagenized mice that can be used to pursue a “gene-first” screening approach. This second approach is an alternative to the first “phenotype-first” screening approach. If bioinformatics and technology can help automate this screening, then large number of genes and SNPs can be screened at once and this can become an efficient method to create altered genes for further study. We have made progress in this bioinformatics automation. Additional comparative bioinformatics can also help prioritize testing of the discovered SNPs. If the screening is done on genomic sequence, for example, bioinformatics may suggest if a SNP could be in a possible transcription factor binding site (see analysis below) or if it is a SNP in a conserved protein-coding region that is likely to disrupt protein function.

Different alleles of a gene may perturb molecular aspects ofthese networked biosystems. Homeostatic regulation within these networks may tend to compensate for these perturbations and create different “molecular phenotypes’ by affecting the expression of yet other genes in the network. RNA expression analysis (or proteome analysis) will help find these molecular phenotypes. Bioinformatics analysis of this molecular phenotype may help elucidate both additional participants in a network and the regulation of those networks. This analysis may also elucidate both the shared and different aspects of these networks in different cell types. GIMS ver. 1.0 is a system under development for RNA expression data that can serve as a platform for further data mining and analysis developments. Combined work with MuTrack and GIMS is planned that would help track and analyze molecular phenotype data of interest. Complementary bioinformatics expertise and molecular phenotype data available in TMGC will be collaboratively applied to quantitative analysis of molecular phenotypic traits and variations. ORNL-based experimentalists are also studying the RNA expression of a set of genes in a keratinocyte model system.

Although different species exhibit a remarkable phenotypic diversity, recent genome analysis in bilaterally symmetrical animals suggests a remarkable similarity in the genes that are used in important processes. Much biological diversity and complexity in the multicellular organisms may, in fact, be due to the evolution of gene regulatory networks. Comparative sequence analysis and studying the evolution of GRNs may help highlight both the conservation of underlying networks and their important differences. To begin this analysis, we are using several different sets of genes as test cases to develop an integrated and automated pipeline to analyze orthologs and paralogs in chordates. This may detect conserved genome sequences and sequences that are candidates for cis-regulatory sites for gene transcription. These cis-acting control regions are key integrators in gene regulatory networks. To narrow down regions for detailed study, we are developing tools to find and analyze large Conserved Genome sequence Blocks (CGSBs) at high throughput. Other analysis will be employed using these CGSBs to assist in better gene modeling and to suggest transcription factor binding sites. Several bioinformatics tools are being developed: these include a new phylogentic footprinting tool, GeneTreeConstructor, Lenny alignment tools, and existing tools to be added to an integrated analysis pipeline. CGSBdb, GeneKeydb, and GeneEvodb are data resources that are under design and construction. We are initially developing these resources to analyze the gene sets that are being explored in the keratinocyte RNA expression system in the hopes that we may be able to correlate the results from the computational and the experimental approaches.

94. The ORNL Genome Analysis Toolkit, Pipeline and DAS Server

Manesh Shah, Doug Hyatt, Frank Larimer, Philip LoCascio, Inna Vokler, and Edward C. Uberbacher

Oak Ridge National Laboratory

ube@ornl.gov

The Genome Analysis Toolkit (GAT) and Pipeline provide Internet configurable genome sequence analysis and annotation capabilities for both microbial and eukaryotic organisms. The GAT has been a major analysis engine for a large number of “microbe month” microbial genomes sequenced at the JGI, as well as for human, mouse, Phanerochaete (white rot fungus), and other microbes and eukaryotes. It provides a capability to present and update the JGI genome web pages and views of these many genomes in the ORNL Genome Channel. Usage of GAT outside ORNL has increased 20 fold in the last 12 months, such that the combined tools in the toolkit process upwards of 200 million bases per month (not counting ORNL), with major remote users including organizations such as the JGI, WUSTL, JAX, Accelrys, and Gene Logic.

We have continued to enhance and extend the capabilities of the Genome Analysis Toolkit with improvements to the individual tools incorporated in the toolkit, a redesigned the client-server system that provides access to the toolkit services, and significant refinements to the Web interfaces. In addition to interactive analysis, we have developed analysis pipelines to support comprehensive analysis of DNA sequences in “batch mode” for all supported genomes. These can be accessed at http://compbio.ornl.gov/tools/pipeline. A Java interface has been implemented for interactive, graphical visualization of the pipeline results. Data is also available via the newly implemented DAS (Distributed Annotation System) server at ORNL (http://genome.ornl.gov/das) which allows remote users to compare ORNL generated annotation with that developed at other sites.

The toolkit now includes a wide variety of genome analysis tools. Gene finding systems include new versions of GrailEXP (version 3.3) (http://compbio.ornl.gov/grailexp) for human, mouse, and Phanerochaete, Genscan for eukaryotic genomes, and Generation (http://compbio.ornl.gov/generation) and Glimmer for microbial gene prediction. Grail suite of tools includes CpG islands, PolyA sites and simple and complex repetitive elements, and BAC End identification. Also included are NCBI STS E-PCR, Pfam, RepeatMasker and tRNAscan-SE systems. Sequence similarity and protein domain search tools include NCBI BLAST, Baylor Beauty post-processing, and the InterProScan analysis system.

The client-server system has been redesigned to facilitate handling of increased load on the system. The client-server protocol now incorporates issuance of a ticket (request ID) which eliminates the need for maintaining persistent client-server connections, allowing the client to retrieve the results later. The server distributes the incoming requests intelligently on the available pool of compute server machines, based on the loads on the various servers. Highly compute-intensive service requests (like BLAST, Pfam and RepeatMasker) are transparently redirected to the GIST (Genomic Integrated Supercomputing Toolkit) server running on ORNL’s IBM RS/6000 SP supercomputer.

95. GrailEXP: Gene Recognition Using Neural Networks and Similarity Search

Doug Hyatt, Frank Larimer, Philip LoCascio, Victor Olman, Manesh Shah, Ying Xu, and Edward C. Uberbacher

Oak Ridge National Laboratory

ube@ornl.gov

GrailEXP 3.3 (October, 2001) represents the latest technology in genefinding. Many improvements have been made to GrailEXP over the past year, including the addition of alternative splicing recognition, the creation of systems for more eukaryotic organisms, the incorporation of GrailEXP into ORNL's high performance computing framework, and the provision of the executables to the academic/nonprofit community. Many future improvements are also under development, including a detailed protein homology search, a prototype genefinding system based on comparison between human and Fugu rubripes, an interactive Java interface for the GrailEXP suite, the addition of more organisms to the system, and the extension of the gene finding code to recognize more forms of alternatively spliced genes.

Each of the three codes in GrailEXP has been significantly enhanced since the last contractors’ meeting. More Grail modules have been added to Perceval, the exon prediction code, including one for Phanerochaete chrysosporium, the white rot fungus genome sequenced by JGI. The accuracy of the alignments produced by Galahad, the EST alignment code, has dramatically improved; Galahad aligned known CDs entries in a test set against the genomic sequence with 99% accuracy. Finally, many significant additions have been made to Gawain, the gene modeling algorithm, including the recognition of alternatively spliced genes, “gluing” code to merge gene models that have been accidentally broken, and flexible rules sets for intron sizes/etc. for different eukaryotic organisms. Under development is a protein homology code based on BLASTX, a Java interface for the entire suite of tools, a more streamlined method for the automated training of new organisms, and a module based on aligning genomic sequences from different organisms (such as human, mouse, and Fugu.)

The public interface to GrailEXP has proven very successful. Many users are now regularly utilizing gene predictions from GrailEXP via one of three methods. The most popular still remains the Genome Channel (http://compbio.ornl.gov/channel/), which contains the results of running GrailEXP on the entire human and mouse genomes. In addition, the GrailEXP analysis page (http://compbio.ornl.gov/ grailexp/) receives over 3000 requests per month from universities, nonprofit institutions, and companies around the world and processes and average of 63 million sequence bases per month, not counting requests for ORNL. Finally, the executables have been deployed at over 125 academic/nonprofit institutions. A publication for the latest version of GrailEXP is in preparation for 2002.

96. The Genome Channel: A Foundation for Genomes to Life and Comparative Genomics

Miriam Land, Frank Larimer, Jay Snoddy, Denise Schmoyer, Doug Hyatt, Manesh Shah, Inna Vokler, Philip LoCascio, Gwo-Liang Chen, Loren Hauser, and Ed Uberbacher

Computational Biology Program, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee

http://compbio.ornl.gov/

ube@ornl.gov

Representing DOE’s unique perspective on eukaryotic and microbial genomes, the Genome Channel has made major improvements to its usability, organism coverage and supported types of analysis. Channel and related web sites receive an average of 25,000 web hits per day with usage comparing favorably with other major genome web resources. Genome Channel contains over 20 unique microbial genomes not contained on other sites and presents more types of human and mouse gene modeling than any other site. Integration of these many organisms in a single framework is making it possible to develop new tools and views that will help GTL researchers use comparative genomics to identify regulatory motifs and other conserved non-coding genome elements.

The Genome Channel has improved navigation tools which provide faster and easier access to the data. A new text-based home page has links for quick access to human and mouse data and an ever increasing list of both finished and draft microbials. Estimates of eukaryotic gene content in the Channel include five types: gene models based on GrailEXP, Genscan, Genbank annotation, U.Penn DOTS EST assemblies, and modeling based on Genbank Refseq mRNAs. Three of the methods include prediction of gene variants and these can be compared along with the evidence supporting each model. tRNA gene predictions have been added to both the text and graphical display. Microbial gene modeling is based on three methods; Generation, Glimmer, Critica, and an additional combined ORNL gene call based on these three methods. These suites of methods and organisms make the views provided for both eukaryotic and microbial genomes richer, more accurate, and more comprehensive than in other available resources. Many of the web pages now include links to ‘lists’ of data which can be downloaded and used with other tools. For example, lists of contigs, genes, or repeats on a chromosome or entire organism.

Annotation for JGI-sequenced draft microbials is accessible with links to several different summaries of the data which can be downloaded and put into other tools. These include Btab output for the predicted genes, Pfam summaries, and COGS analysis. Blast, Pfam, and COGs analysis can all be refreshed to get the latest interpretation of the data.

For GTL to be successful with Goal 2 (Gene Regulatory Networks), foundational data must be available that describes gene regulatory regions and potential transcription factor binding sites in genomes of interest. The most powerful paradigm for obtaining such information is through comparative genomics, since regulatory signals are often conserved among genomes, and also repeated within each genome proximal to genes of related function. Such inter- and intra-genome comparisons provide the basis for estimating regulatory signals and the role of genes in cell processes. The Channel and its underlying data warehousing structure and update processes are a natural foundation for facilitating comparison of genomes, including detailed comparison and analysis of regulatory units and structures in both microbes and eukaryotes. Steps are being taken to develop the Channel as a comprehensive resource for (1) facilitating and viewing large-scale genome alignment, (2) clustering and comparing gene/protein orthologs and paralogs, and identifying conserved blocks and potential regulatory motifs in their upstream regions, and (3) recognizing operon structure and comparing regulatory signals in operons of related species, and (4) identifying shared regulatory elements in regulons in genomes of interest to the GTL program.

97. Automated Visualization of Large Scale Bacterial Transcriptional Regulatory Pathways

Carla Pinon, Amit Puniyani, Peter Karp, and Harley McAdams

Stanford University and SRI

amit8@stanford.edu

The objective of the Stanford Microbial Cell project is to identify the complete transcriptional regulatory network of the aquatic bacterium, Caulobacter crescentus. We are faced with the problem of comprehending the structure and function of these complex, highly interconnected networks. The EcoCyc system developed at SRI provides tools for visualizing bacterial metabolic pathways in a visually appealing manner. We are building on the software underlying EcoCyc (called PathwayTools) to add capability to display and explore bacterial regulatory. By dynamically mapping gene expression levels determined by time sequences of RNA microarray assays onto the regulatory links, we can show the flow of control through the network as the cell cycle progresses or as the cell responds to changes in its environment. In this work we are taking advantage of methods and ideas previously used to illustrate social networks and the internet.

98. Integrating Computational and Human-Curated Annotations for the Mouse Genome

Carol J. Bult and the Mouse Genome Informatics Group

The Jackson Laboratory, Bar Harbor, ME, USA 04609

cjb@informatics.jax.org

Computationally-derived genome annotations are critical entry points into genome biology, however these also need to be integrated with existing biological knowledgebases to enable the research community to use the sequence information effectively in their research programs. The mission of the Mouse Genome Informatics (MGI) group (http://www.informatics.jax.org) is to evaluate computationally predicted gene models and integrate them with such information as genome location, alleles and phenotypes, homology, and gene expression patterns. The MGI group relies heavily on the expertise of domain specialists to analyze and interpret biological data from diverse sources as part of an overall strategy to create and maintain a highly integrated and well-curated genome biology database for the laboratory mouse¹. A major challenge facing MGI, and all model organism databases, is how best to incorporate information from large and constantly changing genomic sequence data streams with curation processes that rely heavily on human reasoning and interpretation.

Within the MGI group we are meeting the large-scale genome annotation challenge by developing an annotation infrastructure that combines both automated and human-centric curation processes. I will present an overview of this infrastructure and discuss design and implementation issues relative to our current work on integrating and updating the annotations of the mouse full-length cDNA sequence data generated by the RIKEN group in Japan². I will also present examples of our curatorial strategy for analyzing the available finished BAC sequences for the C57BL/6J strain of mouse that are available from GenBank and integrating these data with the wealth of genetic and biological knowledge about the laboratory mouse that is already available from MGI as well as other informatics resources.

Bult et al., 2000. Proceedings of the IEEE International Symposium on Bio-Informatics and Biomedical Engineering: 29-32.
The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium. 2001. Nature 409: 685-690.

99. Comparative Sequence-Based Approach to High-Throughput Discovery of Functional Regulatory Elements

Gabriela G. Loots, Ivan Ovcharenko, Inna Dubchak, and Edward M. Rubin

LBNL/DOE/Genome Sciences, 1 Cyclotron Road, MS 84-171

ggloots@lbl.gov

A major challenge of the post-human sequencing era is identifying and decoding the regulatory networks underlining gene expression embedded in the large sea of noncoding DNA. While computational tools aimed at predicting regulatory elements identify large numbers of false positives, traditional experimental techniques, though more accurate, are slow and labor-intensive. We have developed a computational tool, RegSeq: Regulatory Sequence Analyzer, for high-throughput discovery of cis-regulatory elements that combines transcription factor binding site prediction (TRANSFAC) and the analysis of inter-species sequence conservation (global alignments). This process reduces the number of predicted transcription factor binding sites by several orders of magnitude, eliminating the majority of false positive hits. To illustrate the ability of RegSeq to identify true transcription factor binding sites we analyzed several AP-1, NFAT and GATA-3 experimentally characterized binding sites in the 1 Mb well-annotated cytokine gene cluster (Hs5q31; Mm11). The exploitation of orthologous human-mouse data set resulted in the elimination of 95% of the 38,000 binding sites predicted upon analysis of the human sequence alone, while it identified 87% of the experimentally verified binding sites in this region. Since this region harbors a cluster of cytokine genes regulated by the GATA-3 transcription factor, we used RegSeq to analyzed the distribution of GATA-3 sites across the promoter regions of the 18 identified genes present on 1 Mb of orthologous human (5q31) and mouse (ch11) sequence. By searching the promoter sequences (2 kb upstream of the 5' UTR) for the presence of GATA-3 binding sites using TRANSFAC database we observed that the GATA-3 binding sites were abundantly and randomly distributed throughout the promoters of all 18 genes. We failed to observe a bias in the distribution of GATA-3 sites, favoring the genes known to be regulated by this transcription factor. Subsequent alignment of the human and mouse orthologous promoter regions revealed the presence of evolutionarily conserved GATA-3 sites only in the promoters controlling cytokine gene expression, the majority of which have previously been characterized as being GATA3 responsive. By combining sequence motif recognition provided by transcription factor database searches with multiple sequence alignment of orthologous regions, we have developed a high throughput strategy for filtering and prioritizing putative DNA binding sites involved in regulatory functions. The evolutionarily conserved transcription factor binding sites discovered by our study in the interleukin promoters serve as a genomically informed starting place for globally investigating the detailed regulation of this set of biomedically important genes.

100. Managing Targets and Reactions in a Finishing Database

Mark Mundt, Judith Cohn, Mira Dimitrijevic-Bussod, Marie-Claude Krawczyk, Roxanne Tapia, Al Williams, Larry Deaven, and Norman Doggett

DOE Joint Genome Institute and Center for Human Genome Studies, Los Alamos National Laboratory

mom@lanl.gov

The Los Alamos Center for Human Genome Studies is finishing human chromosome 16 employing a whole chromosome strategy. Using an Oracle database, single strand and low quality targets are maintained uniquely for the collection of overlapping projects on our minimal tiling path. While gaps remain in a project, a coordinateless definition of a target enables constant tracking of status to successful completion as well as eliminating the creation of duplicate reactions in different projects. One interesting case occurs when a new reaction is started within the boundaries of a target that must be split to record this history. Many SNP’s and even multiple base pair differences are being documented in this manner of combining data in assemblies. The most important feature of this system; however, is that it provides an automated approach to supplying input/instruction lists for finishing robotics and operations employing the Q-Bot (cherry picking of subclones), Packard (cherry picking of DNAs), MerMades (96 channel oligonucleotide synthesizers) and 3700’s (capillary sequencers) that keep our finishing efforts progressing. To date, we have tracked thousands of targets which have been successfully addressed.

101. Encoding Sequence Quality in BLAST Output by Color Coding

Sam Pitluck, Paul F. Predki, and Trevor L. Hawkins

U.S. DOE Joint Genome Institute, Walnut Creek, CA 94598

s_pitluck@lbl.gov

Tremendous amounts of draft sequence have been released into the public domain in recent years. While most of this sequence is of high quality, significant amounts of low quality sequence are still present. Because of this, draft sequence is of highest utility when the user is able to assess its quality. Although basecalling and assembly programs (such as Phred and Phrap) produce quality scores for each base, this information is typically lost by the time it reaches end users. We have added functionality to the BLAST program by color coding the output according to the quality scores. In another implementation, quality scores are directly displayed in the BLAST output. In either case, interpretation of BLASTing against draft sequence is significantly simplified. Our public web servers now support color-coded BLAST.

102. Whole Genome Assembly with JAZZ

Jarrod Chapman, Nicholas Putnam, and Dan Rokhsar

U.S. DOE Joint Genome Institute, Walnut Creek, CA, 94598

dsrokhsar@lbl.gov

We present a new graphical algorithm for whole genome assembly that self-consistently treats paired-end constraints. The algorithm was designed to be scalable for application to gigabase scale genomes, and has been parallelized using MPI. To aid in the development and validation of the assembler and its outputs, we have also developed a suite of graphical tools for examining and manipulating large assemblies. For typical 6X projects, large scaffolds are recovered, with sequence accuracy better than one part in 10,000. We describe the basic algorithm, demonstrate the visualization tools, and present results for a 30 MB fungal genome and a variety of microbes of varying sizes and depths sequenced at the JGI, and an initial assembly of the early whole genome shotgun sequencing of mouse.

103. Assembly and Exploration of the Public Human Genome Working Draft

Terrence S. Furey¹, Jim Kent², and David Haussler³

¹Computational Biology Laboratory, Department of Computer Science, University of California Santa Cruz
²Department of Biology, University of California Santa Cruz
³Howard Hughes Medical Institute, University of California Santa Cruz

booch@cse.ucsc.edu

A program written by UCSC student Jim Kent, called GigAssembler, is used to periodically assemble a widely used public draft version of the human genome sequence using updated data from GenBank at the National Center For Biotechnology Information (NCBI). This assembly is steadily improving as the public sequencing consortium churns out new data. We will look at the coverage statistics on the latest assembly, and then look at web tools to explore it, and what they find. The three most widely used public annotation browsers are the UCSC Genome browser (genome.ucsc.edu), the Ensembl genome browser (www.ensembl.org), and the NCBI map viewer (www.ncbi.nlm.nih.gov/genome/guide/human), the latter based on NCBI’s own sequence assembly. We will focus on the UCSC browser, which shows a rich variety of data mapped to the genome sequence, including predicted genes, expressed sequence tags, full length mRNAs, genetic and radiation hybrid map markers, cytogenetically mapped clones, single nucleotide polymorphisms, homologies with mouse and pufferfish, and more. This data is presented on different tracks of annotation that are contributed by the annotation team at UCSC and more than a dozen researchers worldwide. We briefly discuss how web-based data browsers such as this are accelerating biomolecular and biomedical research, and how scientists and engineers in other disciplines can contribute to the study of the human genome.

104. Shotgun Sequence Assembly Algorithms for Distributed Memory Machines

Frank Olken

Lawrence Berkeley National Laboratory

olken@lbl.gov

This work is concerned with methods for computing shotgun sequence assemblies of distributed memory parallel computers, in which individual computing nodes can not contain the entire dataset.

For the overlap detection phase we use an algorithm modeled on distributive hash join algorithms, yielding a algorithm with linear speed up (with the number of processors).

For the layout phase we use a connected component labeling algorithm to partition the data for distinct connected components (i.e., the data for each connected component is contained in a single node). Each connected component can then be processed separately, in parallel using conventional layout algorithms. The overall work is linear in the problem size. Note that this method requires prior effective removal of repetitive DNA sequences.

105. Benefits of J2EE Architecture for Informatics Support of Genomic Sequencing

Roxanne Tapia, Judith Cohn, and Mark Mundt

DOE Joint Genome Institute and Center for Human Genome Studies, Los Alamos National Laboratory

rox@lanl.gov

At LANL, we have been building an informatics foundation for our next generation of Genomic Sequencing. Our primary goal is to provide integration in development, user interface, and data access for diverse components including, but not limited to, Laboratory Information Management Systems (LIMS), Quality Control, Sequence Analysis, Assembly and Annotation. We need an infrastructure that will allow quick adaptation to a dynamic, complex environment. Besides changeability, other characteristics of that environment include high-throughput, intensive processing, relatively few users on diverse operating systems, an existing Java codebase, lots of data, and a small bioinformatics team.

Given these characteristics, our challenge was to build an infrastructure that would allow us to support our current efforts and continue to evolve. Given a current Java codebase and skill set, combined with the openness of the Java Platform, and the versatility of the J2EE (Java 2 Enterprise edition) specification, Java was an easy choice for development. Because of the decision to leverage J2EE, we needed an application server that is J2EE compliant. We also needed a mature database that could support lots of data, and preferably one with Java and object support. Thus, Java, SilverStream Application Server, and Oracle RDBMS were chosen for our initial implementation.

Recently, we upgraded to the latest version of Oracle, and decided to replace SilverStream with Oracle 9i Application Server (9iAS). This transition has gone smoothly because we built our J2EE objects according to pure J2EE specifications, rather than using vendor-specific enhancements. This allowed us to switch application servers without an excessive migration cost. The poster will describe the benefits we have realized since adopting a J2EE architecture including flexibility of infrastructure, code reusability, and low cost of change.

106. Production Workflow Tracking and QC Analysis at the Joint Genome Institute

Heather Kimball, Stephan Trong, Art Kobayashi, Sam Pitluck, Yunian Lou, and Matt Nolan

U.S. DOE Joint Genome Institute, Walnut Creek, CA 94598

hlkimball@lbl.gov

The Joint Genome Institute Production Genomics Facility has produced over 2.75 billion bases of draft paired-end sequencing since January 1, 2001. Our sequencing methodologies incorporate two types of DNA template generation: inoculation/SPRI purification and Rolling Circle Amplification. In order to manage the flow of samples through these processes, a robust database tracking system was developed using ORACLE. The key elements that are tracked within the workflow system include:

Instruments
Operators
Protocols
Reagents
Dates and times
Quality scores and contamination information

Data input and reporting for the workflow have been produced using a combination of commercial database development software and in-house programs. These include ORACLE’s WebDB and Perl CGI programming. By leveraging the rapid report and form development cycle using WebDB and augmenting this with the flexibility of in-house programming, we have efficiently deployed a critical laboratory information management system for our data tracking.

107. Goals, Design, and Implementation of a Versatile MicroArray Data Base

Marc Rejali¹, Marco Antoniotti¹, Vera Cherpinsky¹, Caroline Leventhal¹, Salvatore Paxia¹, Archisman Rudra¹, Joe West², and Bud Mishra¹

¹New York University Courant Bioinformatics Group
²Cold Spring Harbor Laboratory

mrejali@cat.nyu.edu

Many problems in functional genomics are being tackled using Microarray Technology. While this approach holds much promise for answering open questions in Biology, it poses significant problems from the "Data Management" point of view.

Our Bioinformatics group at NYU has been involved in several projects that use Microarray technology, for instance:

Genome Mapping and Probe Placement (in cooperation with CSHL)
Nitrogen Pathway analysis in Arabidopsis (in cooperation with NYU Biology Dept.)
Hallucinogen effects on brain functions (in collaboration with Mount Sinai School of Medicine)
Cancer related cell signaling using different cell lines (in cooperation with CSHL)

To address the needs of these collaborative research groups and others, we have developed the NYU Microarray Database (NYUMAD). It's functionality ranges from the storage of the data in relational data base management systems to front-end capabilities for the presentation and maintenance of the data.

The database is a unified platform to understand the microarray based gene expression data. The data can be output to a wide class of clustering algorithm, based on various "similarity measures" and various approaches to grouping. Particularly, we have developed a new statistically-robust similarity measure based on James-Stein Shrinkage estimators and provided a Bayesian explanation for its superior performance. Additional research is focused on incorporating statistical tests for validation and measuring the significance (e.g., jackknife and bootstrap tests). Finally, we plan to add an experiment design module, that suggests how the future array experiments should be organized, given that we understand how the past experiments have performed

Most of the underlying DB schema design follows closely the specifications put forth by the Microarray Gene Expression Database group ( http://www.ebi.ac.uk/microarray/MGED), especially when it comes to the XML-based MAML exchange format.

Functionality: The functionality of the NYUMAD system is summarized hereafter:

Microarray data is stored in relational data base management systems (RDBMS) using a database schema based on the MAML (Microarray Mark-up Language) specification.
Data is served to "clients" via the world wide web (WWW). Clients can be the NYUMAD Java applet that is part of the system described below, or custom-built user programs, or MAML XML files retrieved using a simple HTTP text based request format. In the case of the NYUMAD Java applet, data retrieval is generally transparent to the user and is carried out as a natural part of using the GUI front-end (see below). For text based requests, the returned data is in the MAML XML format.
Data submissions for updating existing data or inserting new data can be made using the NYUMAD Java applet client, or by custom-built user programs, or HTML forms that access directly the server middle tier server . As with data retrieval, the GUI front-end capabilities of the NYUMAD Java applet make data submission transparent to the user.
The NYUMAD applet presents data in a logical manner and allows easy navigation through the data. As the user navigates through the data, the required information is retrieved. It also allows straight-forward updating of existing data and the insertion of new data. The NYUMAD applet can also retrieve data in the MAML XML format which can then be cut and pasted to other applications.
The NYUMAD system integrates several Clustering algorithms and libraries, in order to provide a complete service to the user. The integration is such to automatically access the Data Base and avoid tedious data reformatting and translation tasks.

Architecture: The NYUMAD has a three-tier architecture as shown in the diagram below.

Front tier The Front tier comprises the NYUMAD applet and/or user's custom-built programs and HTML forms. The applet is written in Java and interacts with the middle tier using HTTP, requesting or submitting data using either a text based format or (Java) object serialization. Custom applications interact using HTTP and a text based format. The Microarray data in text based interactions is in MAML XML format.
Middle tier The middle tier is provided by NYUMAD servlets written in Java that handle requests and submissions from the front tier. The middle tier is invisible to the end user. Requested data is retrieved from the RDBMS in the back tier using JDBC and then sent to the front tier either in MAML XML format or in the form of serialized objects for the NYUMAD applet or applications capable of interpreting the Java Object Serialization protocol. The middle tier servlets provide all the application logic necessary to ensure the integrity of the data and adherence to necessary rules, constraints and security restrictions. In addition the middle tier caches data, allowing for faster data retrieval and better scalability. The middle tier can access multiple back tier databases, allowing for data distribution and scalability. The middle tier uses the server's file management system to store and retrieve large files such as image files. In addition, the functional abstraction provided by the middle tier shields the front tier from changes in the back end structure, thus ensuring development extensibility and flexibility for the system.
Back tier The back tier comprises the relational database management systems (RDBMS, currently PostgreSQL running on a 6 nodes Linux cluster) that store the Microarray and related data. It also includes the file management system used to store large files such as image files. The database schema is based on the MAML specification adapted to relational systems. Since the database access code in the middle tier uses JDBC, databases from different vendors can be used with relatively little additional code.

We have built the NYUMAD database as part of our integrated system for Bioinformatics centered on the VALIS tool. We took extreme care in making the system "distributable" from the start, by clearly defining a three tiered architecture that allows us to concentrate on different aspects of the design. We also closely followed the MAML standard put forth in the Spring of 2001 by the MGED group. It is our intention to follow up on this design and to augment the NYUMAD DB with a module capable to communicate with other systems on the basis of the new MAGE-ML OMG Object model standard.

108. CLUSFAVOR – Computer Program for Cluster and Factor Analyses of Microarray-Based Gene Expression Profiles

L. E. Peterson

Departments of Medicine, Molecular and Human Genetics, and Scott Department of Urology, Baylor College of Medicine, Houston, Texas 77030

CLUSFAVOR is a Windows-based computer program for performing cluster and factor (principal components) analyses of microarray-based gene expression profiles. CLUSFAVOR was designed for the Windows 95/98/NT/2000/XP operating systems so that users can easily export images to Windows objects and/or JPG files for import into MS-Word and MS-Powerpoint. A significant amount of programming for CLUSFAVOR has focused on optimization and efficient use of resources to minimize run-times and memory usage. Recent enhancements to CLUSFAVOR in Version 5.0 include use of jackknife distance functions to reduce false positives due to outlier effects and speeding up eigenanalyses. This computer demo will include a tutorial on goals and assumptions of cluster and factor analyses, input data format, program usage, interpretation of results, and import of results into other Windows-based software. A CLUSFAVOR installation package and user guide can be downloaded from web page http://mbcr.bcm.tmc.edu/genepi.

109. Partitioning Large-Sample Microarray Transcription Profiles for Adaptive Response in Human Lymphoblasts Using Principal Components Analysis

L. E. Peterson¹, M. A. Coleman², E. Yin², B. J. Marsh², K. Sorensen², J. Tucker², and A. J. Wyrobek²

¹Departments of Medicine, Molecular and Human Genetics, and Urology, Baylor College of Medicine, Houston, TX 77030
²Biology and Biotechnology Research Program, Lawrence Livermore National Laboratory, Livermore, CA 94551

peterson@bcm.tmc.edu

We are currently using the Affymetrix U95A oligonucleotide array (12,626 genes) to detect differentially expressed genes for adaptive response in human lymphoblastoid cells. Major findings to date for cells given 2 Gy vs. cells given 2 Gy six hours after a priming dose of 5 cGy indicate ~2700 genes with marked changes across adapting and nonadapting conditions, 101 genes strongly upregulated, and 110 genes strongly downregulated under adapting conditions. This paper describes analytic research involving cluster and principal components analysis (PCA) to partition natural groupings of genes with similar transcription profiles for adaptive response. We used the CLUSFAVOR algorithm for cluster analysis and PCA. Because CLUSFAVOR uses the “foolproof” Jacobi method for eigenanalysis, significant program modifications were needed in order to reduce the prohibitively long calculation times normally required for Jacobi extraction of eigenvalues and eigenvectors from the large 12,626 x 12,626 (gene by gene) correlation matrix. Program enhancements resulted in substantial reduction of run-times from days to hours. More than 90% of total variance in input data was accounted for by extracting factors whose eigenvalues exceeded unity. Run-times for varimax orthogonal rotation were insignificant when compared with run-times for eigenanalysis. Bipolar factors containing strong positive and negative loadings were used to identify two unique groups of genes, since expression profiles of genes that load positive are the opposite of genes that load negative on the same factor. While cluster analysis generated a single dendogram (tree branch) containing every gene in the input data, PCA assembled groups of genes with similar expression profiles. Image displays for cluster analysis and PCA of adaptive response transcription profiles will be presented. Statistical modeling of replicates and outliers will also be discussed.

110. EXCAVATOR: Gene Expression Data Analysis Using Minimum Spanning Trees

Ying Xu, Dong Xu, Victor Olman, and Li Wang

Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory

xyn@ornl.gov

Data clustering represents an essential step in large-scale data mining. While data clustering has been an active research subject for many years, a number of challenging problems remain, which have hindered in-depth applications of the technique. Among these challenges are (a) existing clustering methods generally do not guarantee globally optimal clustering results for multi-dimensional data; and (b) existing clustering methods often have problems with data sets containing clusters with complex boundaries. We have recently developed a new framework for multi-dimensional data clustering. The framework is based on a representation of multi-dimensional data as a minimum spanning tree (MST), a concept from the graph theory. A key property of this representation is that each cluster in the data set corresponds to one subtree of the MST representing the data, which rigorously converts a multi-dimensional clustering problem to a tree partitioning problem. We have rigorously demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Our research has led to the discovery of a number of highly attractive properties of MSTs for data clustering, which can help overcome some of the long-standing problems facing clustering techniques. Among them are that (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which can guarantee the global optimality for clustering; and (2) the tree structure captures the essential topological information while leaving out the detailed geometric information among data points, which makes the cluster shape information irrelevant, and hence can handle clusters with extremely complex boundaries. Based on this framework, we have implemented a computer software EXCAVATOR for clustering and analyzing microarray gene expression data. By taking advantage of the key properties of MSTs, EXCAVATOR provides a number of unique features for gene expression data analysis, including (i) a capability of identifying significant clusters from a very noisy background; (ii) a capability of doing information-constrained clustering (e.g., genes X, Y, and Z should or should not belong to the same cluster); (iii) a capability of identifying genes with similar expression patterns to a set of seed genes; and (iv) a capability of automatic determination of the most plausible number of clusters in a data set. We have applied EXCAVATOR to a number of expression data sets from various genomes. The test results are highly encouraging.

References:

Y. Xu, V. Olman, and D. Xu, “Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Tree”, Bioinformatics, 2001 (in press).
Y. Xu, V. Olman and D. Xu, “Minimum Spanning Trees for Gene Expression Data Clustering”, Proceedings of the Twelfth International Conference on Genome Informatics, Dec. 2001 (in press).
D. Xu, V. Olman, L. Wang, Y. Xu, “EXCAVATOR: a software for gene expression data analysis”, 2001 (submitted).

111. Flexible Customization of Micro-Array Data Analysis Pipeline

Dong-Guk Shin¹, Ravi Nori², Jae-Guon Nam², and Jeffrey Maddox²

¹Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-3155
²CyberConnect EZ, LLC, Storrs, CT 06268

shin@engr.uconn.edu

Micro-array data analysis requires access to data available from a wide variety of heterogeneous sources using many data analysis and visualization programs. Ideally, use of data-flow software would alleviate scientists from tediously repetitive data handling, database querying and analysis tasks. Another requirement is customization. Once constructed, the analysis flow should be easily modifiable to a new version in order to meet the changing requirements. For example, scientists with different foci of micro-array data analyses may require different ways of choosing cDNA samples when designing gene chips.

Our approach has been developing a data-flow editing environment in which scientists can visually depict the flow of complex data analyses into a sequence of interconnected diagrams. Once the visual depiction of the data analysis sequence (what we call a pipeline) is completed, the scientist can have the depicted tasks executed automatically following the steps described in the analysis pipeline. The scientist can monitor the progress of the task execution as well as control the flow of the task execution.

We demonstrate how such a data-flow editing environment can address many computational issues that scientists are currently facing in conducting micro-array data analysis. We show how easily scientists can modify an existing analysis pipeline by picking and choosing different analysis programs to include in the analysis pipeline “on the fly”. We illustrate how powerful a system it could be when database querying and data analysis are seamlessly integrated into one cohesive graphical editing environment. Finally, we point out that the proposed software environment supports flexible distribution of computational resources. Scientists should be able to use analysis programs and data sources regardless of who maintains them or where they are located as long as the scientists have privileges to use them.

112. Correspondence Mapping Algorithms

Lidia Cassier¹, Robert Lucito1, Vivek Mittal¹, Joseph West¹, Michael Wigler¹, William Casey², and Bud Mishra²

¹Cold Spring Harbor Laboratory
²New York University Courant Bioinformatics Group

west@cshl.org

In principle, microarrays are an ideal tool for mapping genomes because a large amount of information can be gathered in parallel from a single hybridization. In this abstract, we explain the algorithmic and the related mathematical tools needed to utilize microarray hybridization data to BAC pools and establish correspondence maps, the assignment of probes to individual BACs. The applications of correspondence mapping include

finding transcribed regions of the genome,
filling gaps in assembled genomic sequence,
integrating the annotations of two probe sets,
finding orthologs of human genes in other mammals, and
combining linear and genetic maps into integrated genomic maps.

Our central strategy involves a series of hybridizations to carefully constructed pools of BACs. In brief, we issue to a unique binary code to each BAC in a collection. We then construct BAC pools, making binary partitions according to the binary codes. After a series of array hybridizations with the BAC pools, each arrayed probe has an associated hybridization pattern. We convert the pattern of each probe to a numerical code, and then match the probe code to the closest BAC code, using a Hamming distance. From the statistical distribution of the Hamming distances, we can determine with confidence which probes have been correctly mapped to a homologous BAC.

Error correcting algorithms.
We hardly expect unambiguous data for all probe hybridizations, and even one ambiguous digit in a non-redundant code is enough to render probe assignment infeasible. The solution is to make extra poolings, by assigning extra (i.e., “redundant distinguishing”) random bits of 0 or 1 to each BAC, and carry out additional partitions and hybridizations of the resulting pools. In order to distinguish one BAC from another, we will need hybridization experiments to yield distinguishable results (e.g., color) on those bits where the BAC addresses differ. As the signal-to-noise for hybridization deteriorates, increasing the number of distinguishing bits improves our ability to tell two BACs apart.

For each outcome for each probe, the ratio between channels is mapped to the real interval from 1.0 (red) to 0.0 (green), using reasonable linear projections with thresholds, creating a probe vector in N-space, where N is the total number of hybridizations. We can then associate the closest BAC vector to each probe vector, using a Hamming metric. This Hamming metric is an unbiased statistical estimator of the true Hamming distance, with a variance that depends on the noise inherent to hybridization. Thus as long as the variance in the Hamming metric is reasonably small, the closest BAC associated with a probe under the Hamming metric is almost surely also the closest BAC under the Hamming distance. In particular, we infer that if a probe belongs to exactly one BAC then this is the only BAC with a Hamming distance of zero and hence, the association between these probes and the BACs are correct with very high probability.

We have simulated probe-BAC pool hybridizations using parameters derived from the experiments. Using the error correcting algorithm described above, the results are clear and gratifying. With the noise levels and signal to background ratios consistent with our experimental data, and using 24 partitions/hybridizations to encode 4096 BACS, we obtain about a 92% yield of all possible correct assignments, and make on the order of 2% incorrect assignments.

113. Haplotyping with Phased RFLPs: Algorithms and Mathematical Models

Will Casey¹, Thomas Anantharaman², and Bud Mishra¹

¹New York University Courant Bioinformatics Group
²Biostatistics and Medical Informatics Department, University of Wisconsin

wcasey@cims.nyu.edu

The polymorphisms due to restriction fragment length variations in the genomes in a human population had been studied intensely in the past. In parallel, the development of novel single molecule approaches has made it possible to construct high-resolution multi-enzyme ordered restriction genome-wide maps. In particular, a powerful method, called “optical mapping,” provides the possibility of making high-coverage accurate genome-wide maps of a population relatively quickly and inexpensively. Furthermore, as each polymorphic site (e.g., modeled by statistical variations in the location of a restriction site) is “covered” by large number of molecules from the two copies of the chromosomes and as neighboring sites are likely to be covered by a large fraction of these molecules, it is possible to detect each restriction fragment length polymorphism (RFLP) marker accurately using a Maximum Likelihood Estimator (MLE) algorithm or an Expectation Maximization (EM) algorithm and then “phase” these RFLP markers using sophisticated statistical algorithms. While it is only speculative as to how densely these RFLP markers are distributed and how long a typical “haplotype block” would be, the algorithms we develop have applications to other areas of genomics (e.g., placing probes along the genomes to study copy number fluctuations in cancer genomes, RH mapping, phasing single nucleotide polymorphisms (SNPs), etc.).

Our algorithm works in two phases:

EM algorithm: to detect the statistics of a bimodal distribution generated by two different restriction fragment lengths. The underlying model accurately incorporates errors due to sizing measurements and partial digestion, etc. Our initial experiments show that the EM algorithm performs efficiently and robustly.
Phase Detection by Optimization of an “Approximate Likelihood Function”: to “contig” consecutive RFLP markers with a “phase” assigned to each RFLP in the contig. We develop a greedy strategy (based on local weighted averaging) that creates these contigs with relative phases assigned. Furthermore, we assign a “p-value” to each phased RFLP marker.

We have also developed a postscript-based visualizer that allows one to view the contigs globally and understand the structure of the haplotype blocks. We have experimented with a large amount of simulated data and the result so far has been gratifying and consistent with our theory. We plan to explore the performance of our algorithms (false-positive and false-negative RFLP markers and the size distributions of the haplotype blocks) as the optical mapping data deteriorates (e.g., decreased digestion rate, less-dense markers distributions, lower coverage by genomic DNA, etc.)

114. Information Management Infrastructure for the Systematic Annotation of Vertebrate Genomes

V. Babenko¹, B. Brunk¹, J. Crabtree¹, S. Diskin¹, Y. Kondrahkin¹, J. Mazzarelli¹, S. McWeeney¹, D. Pinney¹, A. Pizzaro¹, J. Schug¹, V. Bogdanova², A. Katohkin², V. Nadezhda², E. Semjonova², V. Trifonoff², N. Kolchanov², M. Bucan³, and C. Stoeckert¹

¹Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA
²Institute for Cytology and Genetics, Novosibirsk, Russia
³Department of Psychiatry, University of Pennsylvania, Philadelphia, PA

stoeckrt@pcbi.upenn.edu

AllGenes (http://www.allgenes.org) is a human and mouse view of the GUS (Genomics Unified Schema) relational database and includes a gene index generated by assembly of publicly available EST and mRNA sequences. The assemblies integrate annotation from cDNA libraries, RH mapping data, and gene trap insertions. Automated annotation has been applied to characterize these sequences and relate them along with their predicted protein sequences to conceptual genes. The gene index contains over 3 million human and nearly 2 million mouse ESTs and mRNAs as of September, 2001 that have clustered into 150,006 human and 74,024 mouse “genes” (a new build of the index is underway). Approximately half the human and mouse genes have similarity to a known protein sequence and of these, we have been able to predict a Gene Ontology (GO) molecular function for 31% of the human and 45% of the mouse genes. Manual annotation is used to better structure the data (e.g., assign libraries to an anatomy ontology), confirm automated annotation (e.g., check GO assignments), and add new information (e.g., assign gene symbols and synonyms). Nearly 2000 human and mouse assemblies have been manually reviewed as of October, 2001 and this number is expected to greatly increase. Incorporation of genomic sequences will provide better assessment of the assemblies as genes and guide discovery of new genes and transcript alternative forms. The UCSC Golden path contigs are being used in this context with a focus on chromosome 22 of algorithmic and manual analysis. A related site on mouse chromosome 5 (http://www.cbil.upenn.edu/mouse/chromosome5/) integrates the assemblies with existing genomic resources (e.g., BAC fingerprint, RH, and genetic maps) to facilitate functional analyses. The source and ownership of all data, algorithms run on it and evidence for assertions such as GO function predictions are stored in GUS allowing users to assess the validity of the data. The GUS schema is organized around the central dogma of biology (genes are transcribed to RNA which are translated to proteins) enabling a powerful query web interface. Users can build queries using Boolean functions to identify data sets for browsing and further analysis. The sequences, their contained accessions, predicted protein translations and predicted GO functions can be downloaded at the AllGenes site.

115. Manual Annotation of the Human and Mouse Gene Index: www.allgenes.org

Brian Brunk¹, Jonathan Crabtree¹, Sharon Diskin¹, Joan Mazzarelli¹, Chris Stoeckert¹, Nico Zigouras¹, Vera Bogdanova², Alexey Katohkin², Nikolay Kolchanov², Vorbjeva Nadezhda², Elena Semjonova², and Vladimir Trifonoff

¹Computational Biology and Informatics Laboratory, Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104
²Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia

mazz@pcbi.upenn.edu

Allgenes.org is a web interface providing access to the assembled EST and mRNA sequences, or DoTS RNA transcripts, contained within GUS (Genomics Unified Schema), a relational database. The DoTS transcripts integrate annotation from cDNA libraries (tissue source) and RH mapping data also stored in GUS. Automated annotation has been applied to the DoTS transcripts to determine their predicted gene ownership, protein sequences and GO Functions. Manual annotation efforts have focused on validating the automated annotation and adding additional gene information. Manual annotation of the gene index utilizes an annotation tool, the GUS annotator interface, which directly updates the GUS database. Functional features of the interface which allow defined annotation tasks to be performed by the annotator include: determination of transcript gene membership using BLAST similarities and transcript alignments to genomic sequence, assignment of approved (HUGO or MGI) gene symbol, gene synonyms and confirmation/addition of protein GO Function assignments. Evidence for the automated annotation is stored in GUS and provided to the annotator to assist in the validation of the assignments. Evidence is also manually added by the annotator for each assignment and is stored in GUS. The human DoTS transcripts have been aligned on the UCSC Golden path contigs allowing for the identification of new genes, alternative transcript forms and annotation of the genome. Manual annotation efforts have focused initially on the genes contained within the region deleted on chromosome 22, causing DiGeorge syndrome, a developmental disorder.

116. The Comprehensive Microbial Resource

Owen White, Lowell Umayam, Tanja Dickinson, and Jeremy Peterson

The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850

owhite@tigr.org

One of the challenges presented by large-scale genome sequencing efforts is the effective display of information in a format that is accessible to the laboratory scientist. The Comprehensive Microbial Resource (CMR) contains all of the fully sequenced microbial genomes, the curation from the original sequencing centers, and further curation from TIGR (for those genomes sequenced outside TIGR). The interface to this database effectively “slices” the vast amounts of data in the sequencing databases in a wide variety of ways, allowing the user to formulate queries that search for specific genes as well as to investigate broader topics, such as genes that might serve as vaccine and drug targets. The web presentation of the CMR includes the comprehensive collection of bacterial genome sequences, curated information, and related informatics methodologies. The scientist can view genes within a genome and can also link across to related genes in other genomes. The effect is to be able to construct queries that include sequence searches, isoelectric point, GC-content, GC-skew, functional role assignments, growth conditions, environment and other questions, and isolate the genes of interest. The database contains extensive curated data as well as pre-run homology searches to facilitate data mining. The interface allows the display of the results in numerous formats that will help the user ask more accurate questions. The methodology for populating the database, the user interface, and new methods for automated analysis will be presented.

208. In Silico Discovery: Challenges in Integration and Knowledge Extraction

Su Chung

UC San Diego Supercomputer Center geneticXchange, Inc

suchung@sdsc.edu

In the era of genome-enabled medicine, high throughput technologies have generated an unprecedented volume and diversity of data. The fundamental challenge is how to transform data into useful information, and ultimately into tangible knowledge for research and discovery. Databases and computational analysis tools in life science are diverse, complex, heterogeneous, and lacking standards. The key issues are how to efficiently integrate information derived from multiple data sources and computational tools for specific inquiries. We present a workflow strategy for In Silico discovery.

210. Bioinformatics for Genome Analysis

Claudia I. Reich and Gary J. Olsen

Department of Microbiology University of Illinois at Urbana-Champaign

abest@uiuc.edu

There will soon be over 100 publicly available complete genome sequences, and there is every expectation that the number and diversity of genome sequences will continue to increase for decades. Handling the volume of data requires rapid and accurate automation of many tasks, including the identification of protein coding sequences within DNA, the extraction and translation of the sequences, the assignment of functions to whatever extent possible (without overstatement of function precision), and providing both machine readable and human interfaces to the large volumes of data. Presently, nearly all functional assignments are based on identifying similar sequences that are already in that databases and have already been assigned a function. It is then assumed that similar sequences (which share common ancestry) will share similar functions. We propose to provide improved tools by better integrating the tools and concepts of molecular phylogenetics with those of molecular biology. We will do this by more tightly coupling presumed evolutionary relationships of proteins (phylogenetic trees) with the generation of multiple sequence alignments. In addition, the methods will emphasize the ability to add data (from new genomes) without recomputing the entire alignment and tree, and without sacrificing alignment features resulting from manual editing by human experts, or from other sources such as 3-D structures. Other areas that will receive effort include protein coding region identification, and graphical display of uncertainty in molecular phylogenies (and hence in protein function assignment).

220. Understanding Protein Interactions

Xiaoqun Joyce Duan, Ioannis Xenarios, Lukasz Salwinski, Charlotte Dean, and David Eisenberg

UCLA-DOE Laboratory of Structural Biology & Molecular Medicine, University of California Los Angeles, P. O. Box 951570, Los Angeles, CA 90095-1570

joyce@mbi.ucla.edu

Networks of protein interactions control the lives of cells. Our lab uses bioinformatic approaches to study protein interactions and their relationship with protein function.

We have begun cataloguing information on protein interactions, gleaned from the scientific literature, within the Database of Interacting Proteins (DIP, http://dip.doe-mbi.ucla.edu). DIP is designed to capture the layered information on protein interactions. Protein interactions are firstly physical interactions; Biological protein interactions differ from the more general set of physical interactions in their prerequisite for specific protein states and the resultant transitions in the protein states of one or both of the interacting proteins. DIP contains information related to physical interactions. LiveDIP collects data on biological interactions, which are described in terms of protein states and state transitions. This data scheme provides a more complete picture of protein interactions inside cells.

A computational method has been developed to estimate the biologically relevant fraction of protein interactions detected in large-scale screens. It does so by comparing the mRNA expression profiles of these proteins with those of known interacting or non-interacting pairs. A second method evaluates the reliability of a given interaction by the presence of paralogous interactions. We have also developed advanced search tools to assemble pathways from currently available knowledge of protein interactions collated in LiveDIP. Another tool, JDIP2D, provides means of customized graph rendering and data manipulation.

The data and tools associated with DIP enable insight into understanding protein interactions. Analysis of the interactions in DIP indicates that many proteins form a single connected network of interactions accompanied by several smaller networks. Application of the pathway analysis tools to studying the pheromone response pathway in yeast suggests, that the pathway functions in the context of a complex protein interaction network wherein both positive and negative regulation modulate the signal intensity. Regulatory mechanisms for signaling process were proposed by integrating gene expression data with the protein interaction network.

Future directions include expanding the database, providing additional tools for integrating other genomic and proteomic data, developing computational methods for predicting protein-protein interactions, and studying cell signaling on the genome scale.

References

Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, and Eisenberg D. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30(1): 303-5.
Duan XJ, Xenarios I, and Eisenberg D. Describing Biological Protein Interactions in Terms of Protein States and State Transitions: the LiveDIP Database. Manuscript in submission.
Dean CM, Salwinski L, Xenarios I, and Eisenberg D. Protein Interactions: Two Methods for Assessment of the Reliability of High-throughput Observations. Manuscript in submission.

The online presentation of this publication is a special feature of the Human Genome Project Information Web site.