Research Abstracts 2000 DOE Human Genome Program

Bioinformatics Abstracts

DOE Human Genome Program
Contractor-Grantee Workshop VIII
February 27-March 2, 2000 Santa Fe, NM

Home

Author Index

Sequencing

Table of Contents

Abstracts

Instrumentation

Table of Contents

Abstracts

Mapping

Table of Contents

Abstracts

Bioinformatics

Table of Contents

Abstracts

Function and cDNA Resources

Table of Contents

Abstracts

Microbial Genome Program

Table of Contents

Abstracts

Ethical, Legal, and Social Issues

Table of Contents

Abstracts

Infrastructure

Table of Contents

Abstracts

Ordering Information

Abstracts from
Past Meetings

56. Software to Support BAC Mapping

Cliff S. Han and Norman A. Doggett

Bioscience Division and DOE Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, NM 87545

chan@telomere.lanl.gov

The LANL production mapping team has a major responsibility for supplying mapped BAC clones to the JGI's Production Sequencing Facility. Our approach involves the use of overgo probes to screen BAC filters and construct probe-content BAC maps. To facilitate this work we have developed and implemented several software programs. They are 1) Overgo selection program, 2) Automated contig assembly, and 3) Contig graphic draw.

Overgo selection program: The program is written to select overgos from a sequence. The program combines repeat screening and secondary structure screening with a nr Genbank search. Overgo sequences are checked for hairpins, selfdimers, heterodimers, and against RepBase and nr GenBank.

Automated contig assembly: The program assembling contigs automatically according to the input data. It can accept previously ordered backbone STS information as a constraint.

Contig graphic draw: The program is designed to draw a contig graphic map from probe(STS)-content map data that are presented in map order. The output is a Postscript III file. The file can be printed by draging onto the printer window or translated into a portable document format with Acrobat Distiller and viewed with Acrobat Reader. The input for the program is generated with the Automated contig assembly program or by a macro which can translate Excel data into the format for this program.

Supported by the US DOE, OBER under contract W-7405-ENG-36.

57. Automated Optimization of Expert System for Base-Calling in DNA Sequencing

Arthur W. Miller and Barry L. Karger

Barnett Institute, Northeastern University, 360 Huntington Ave., Boston, MA 02115

miller@ccs.neu.edu

A recurring issue in automated DNA sequencing is that base-calling lags behind improvements to instrumentation and sequencing chemistry. This is because base-callers require retraining, or because the preprocessing of the data prior to base-calling must be changed. We have previously presented an expert system for long-read base-calling, capable of read lengths up to 1300 bases in sequencing by capillary electrophoresis on optimized separation matrices (A. W. Miller and B. L. Karger, DOE Human Genome Program Contractor-Grantee Workshop VII, 1999). The expert system supplies probabilistic confidences on base-calls, with statistics computed for several different types of miscall. Here we present tools for the automated retraining and optimization of this base-caller, including preprocessing and confidences, by nonprogrammers. Training takes into account template effects, low signal, and other factors observed in production sequencing. Results are shown for large amounts of data from both ABI 3700 and MegaBACE 1000 sequencers. In addition to software, other recent developments in long-read sequencing by capillary electrophoresis will also be presented.

This work is being supported by DOE grant DE-FG02-90ER 60985.

58. Is Q20 a Sufficient Measure of Quality to Use for DNA Sequencing Process Analysis?

D.C. Bruce, M.D. Jones, J.E. Bryant¹, R. Lobb¹, J.R. Griffith¹, M.O. Mundt, N.A. Doggett, and L.L. Deaven

Bioscience Division and DOE Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, NM 87545 and ¹University of New Mexico, Department of Biochemistry and Molecular Biology, Albuquerque, NM 87131

dbruce@lanl.gov

We examined whether Q20 is a sufficient measure of sequence quality to predict the impact of a process change on the quality of data generated in the shotgun phase. Q20 is a threshold metric derived from DNA sequence trace data by the base-calling program Phred, and is commonly used to report the read length of DNA sequence data. This practice follows from the rubric that 1) the number of Q20 base pairs is a necessary and sufficient quality metric and 2) long Q20 reads improve sequence assembly and hence simplifies finishing. In addition, Q20 is used to perform cost/benefit analysis ($/Q20 base) bearing on sequencing process changes. As a test bed to model a process change, we analyzed data produced from sequencing runs produced with three different sequencing gel formulations. In addition to Q20 statistics, we determined the number of correctly aligned bases and the probability of error at each base for the assembled data. We present data that shows Q20 does not fully predict the benefit or harm associated with process alterations.

59. Annotation of Draft Genomic Sequence Generated at the JGI

Richard Mural¹, Miriam Land¹, Frank Larimer¹, Morey Parang¹, Manesh Shah¹, Doug Hyatt¹, Ed Uberbacher¹, P. Folta², T. Bobo², Zhengping Huang², and T. Slezak²

¹Computational Biosciences and Toxicology and Risk Analysis Sections, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Oak Ridge, TN 37831 and ²Human Genome Center, Lawrence Livermore National Laboratory, Livermore, CA

The JGI is a major player in the effort to complete a 90% draft of the sequence of the human genome within the next few months. Draft sequence poses special problems for the annotation process, however it is clear that a 3 to 5X coverage of genomic DNA can yield large amounts of biologically meaningful data if the appropriate analysis methods can be applied. There are a number of features that can be located and annotated in draft-sequence which are useful for further analysis, these include: STS's, BAC ends (STCs), and EST's. These features can be annotated by standard similarity methods given sufficient computational resources. Using various gene identification programs, particularly those that incorporate similarity data such as Grail-Exp, which can use both EST and complete cDNA data, provide another level to the analysis of draft data. These analyses allow not only gene identification but it can also provide some ordering information for contigs that make up the clone being analyzed. Also recall that essentially all of the genes that can be found in finished sequence can be identified in draft sequence at about 3X coverage.

To help add biologically valuable information to the draft sequence being generated at the JGI/PSF a configurable analysis pipeline has been developed to provide analysis of draft data. Draft data produced at the JGI/PSF is analyzed and the analysis results are parsed into the JGI database. The initial annotation of draft sequence is a catalog of the clone contents (STS's, STC's, genes models predicted by Grail-Exp and Genscan, as well as Blast searches of their translations against the NR protein database) which are provided in a tabular form which is accessible from the JGI web page. Further analysis of this information will help to define relationships among draft clones and will allow ordering, within and between clones.

To date we have analyzed over 1500 draft clones from human chromosomes 5, 16 and 19. The results of these analyses can be viewed at: www.jgi.doe.gov/Data/JGI_finished.html.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

60. Information Systems to Support Experimental and Computational Research into Complex Biological Systems and Functional Genomics: Several Pilot Projects

Jay Snoddy¹, Denise Schmoyer⁴, Kathe Fischer⁵, Gwo-Lin Chen⁵, Miriam Land³, Sergey Petrov¹, Sheryl Martin⁴, Ed Michaud³, Bob Barry⁷, Gene Rinchik³, Peter Hoyt³, Mitch Doktycz², and E. Uberbacher¹

¹Computational Biosciences Section, Life Sciences Division;²Biochemistry and Biophysics Section, Life Sciences Division; ³Mammalian Genetics and Development Section, Life Sciences Division; ⁴Toxicology and Risk Analysis, Life Sciences Division; ⁵Computational Physics Division; ⁶Computer Science and Mathematics Division; and ⁷Robotics and Process Systems Division; Oak Ridge National Laboratory, P.O. Box 2009, Oak Ridge, TN 37831; ⁵Department of Biochemistry, Cellular, and Molecular Biology, University of Tennessee, Knoxville, TN 37996

In order to promote a research capability that can help understand complex biological systems, the ability to acquire, manage, and interpret the complex information of biology is a prerequisite. To study biological systems, computerized systems must function to automate the routine operations that are needed in large-scale, data-driven research projects. Secondly, information systems must permit the biologist to analyze data from both data-driven and hypothesis-driven research; this analytical support needs to connect genome-scale data or other data-driven approaches with more focused, smaller-scale hypothesis-driven research into more complex biosystems. These analytical connections need to be made, in part, by generating inferences and supporting the decision-making of biologists.

Chip-based technologies for mRNA expression analysis are a large-scale, data-driven approach that can supply experimental information that is useful in exploring tissue-specific systems and pathways. In addition, advances in genomics, mutagenesis/ phenotype screening, and other areas facilitate a higher-throughput mouse biology research for insights into complex traits and systems. For example, a pilot project was recently initiated to begin developing information systems that can help ORNL and Tennessee Mouse Genome Consortium (TMGC) and other collaborators to acquire insights into complex biological systems. This will result in a Complex Biosystems Information Warehouse (CBIW) that will be developed in ORACLE 8i and is closely associated with the Genome Information Warehouse (see related abstract of Petrov et al.).

Users will enter data and acquire data from specific modules that are application-specific. Four related bioinformation modules are currently planned that will be supported by this data warehouse. These are:

Mouse Tracking and Phenotype System (MuTrack)
Genosensor Information Management System (GIMS)
Gene and Protein Catalog
Comparative Genomics Inferencing System (CompariSys)

These systems are all interrelated, will need to share some data, and will need to work together.

The first proposed information module, MuTrack, must acquire information about specific mice, tissues, and especially molecular samples and track them as they are processed through mouse phenotype screens and other experiments. This system, once out of pilot stage, must track the distribution of mice, track mouse tissue samples, and catalog observations about the phenotypes of these mice. Part of this problem that needs to be solved for the TMGC is moving mutagenized mice and samples around from ORNL to UT Memphis, Vanderbilt University, and other sites and returning phenotype screen data to a central, shared information system. This system also needs to connect to the GIMS chip expression system, especially in sharing information about mouse RNA samples sent by the mouse biologists to the chip lab for expression analysis. The chip lab must also return some data to be integrated with other observations about specific mice or mice strains.

An electronic notebook is being developed to provide some of this needed functionality and will be demonstrated at the meeting. The general electronic notebook approach should be able to allow a reasonable compromise among the power of an information system (e.g. ability to query the data) and the required flexibility in different kinds of lab data that can be stored.

GIMS, the second information module, will acquire, automate, and interpret data produced by the Genosensor chips and other similar microarrays (see related abstract of Doktycz et al.) This information system will address one of the major current bottlenecks to this technology--the data handling and, especially, the computational data interpretation to find patterns in this expression data. We are using commercially available software modules for some of this component--at least initially--but other operational logic and analytical reasoning will need to be developed to glue together these different components and provide for both operational and analytical support.

Gene and Protein Catalog is a user interface to new data about the structure and system functions of genes and proteins. This user interface is being designed and developed so that it can take data from both the Genome Information Warehouse and the Complex Biosystems Information Warehouse. It should provide access to relevant new data discovered by our experimental collaborators, any expert-curated information, predicted gene and protein models from genome annotation, and any cross-links to the underlying archival data from community databases. A pilot project, for example is testing the addition of single nucleotide polymorphisms (SNPs) that are in or next to GenScan and Grail-EXP-predicted genes.

The last information module, CompariSys, is proposed to follow on after the other systems are further developed. This system should help create classify and cross-link homologous genes and proteins. This will assist the user to extrapolate from genes and systems in the mouse, for example, to genes and systems in the human. It will use existing methods of sequence similarity, conservation of synteny, protein classification, and possibly other developing methods like large-scale phylogenetic gene tree generation, to help navigate and create links among the gene and system data found in MuTrack, Gene and Protein Catalog, and GIMS. This should allow a user or another computer system to automatically move from data about one gene to data about homologous genes, proteins, and systems. This will provide a comparative approach that is critical to understand and navigate the biological data about genes, proteins, and the pathways or systems that involve those genes and proteins.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

61. Navigation, Visualization, and Query of Genomes: The Genome Channel and Beyond

Morey Parang, Miriam Land, Denise Schmoyer, Jay Snoddy, Doug Hyatt, Richard Mural, and Ed Uberbacher

Computational Biosciences and Toxicology and Risk Analysis Sections, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831

http://compbio.ornl.gov/

The Genome Channel Browser is a Java based viewer capable of representing a wide variety of genomic-sequence annotation and links to a large number of related information and data resources. It relies on a number of underlying data resources, analysis tools, and data-retrieval agents to provide an up-to-date view of genomic sequences as well as computational and experimental annotation. The current version of the Genome Channel Browser (v2.0) provides a diverse set of functional features. New features in this version of the Genome Channel include: additional features such as tRNA and BAC ends, additional organisms including microbes, genetic and radiation hybrid maps, extended and detailed listing of features and generation of summary reports, text-based searches and query of underlying data, BLAST searches against individual or combined assembled sequences and products, and pattern searches against genomes that return genome location and context of related sequences.

In addition to Java-based browsing, the Genome Catalog, an HTML-based interface to the Genome Channel is under development. Genome, chromosome, contig, and clone summary reports, gene and protein lists, homologies, and other features are available for browsing and querying through this interface. We are researching the feasibility of providing interfaces to additional types of analysis results, such as protein threading and structural classification that might provide clues to the functions of predicted genes. Other features being studied for future implementation and visualization include gene expression data, polymorphisms and mutations.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

62. Continuation of the Genome Database

Christopher J. Porter, C. Conover Talbot Jr., Jay Snoddy, Ed Uberbacher, and A. Jamie Cuticchia

The Johns Hopkins School of Medicine, Baltimore, MD

cporter@gdb.org

Shortly after the last DOE Contractor and Grantee Workshop, in November 1997, DOE announced the termination of funding for the Genome Database effective June 30th 1998. Consequently, this has been a year of transition for GDB.

In the months following the announcement, work continued on version 6.4 of GDB, which was released in March. GDB 6.4 introduced a simplified query form for regional queries, enhancements to the display of integrated map information, and multiple modifications to improve the manner in which results are displayed and increase the speed with which they are returned. A new version of Mapview features a more intuitive user interface, and allows markers selectively to be hidden. Plans to overhaul the handling of polymorphisms were withdrawn, but display of allele size and frequency information were integrated.

Despite the project's announced termination, we received four requests and set up three new international nodes in Taiwan, Belgium and Canada. At their annual meeting, representatives from the GDB nodes offered their support for continuation of the project. Previously, at their meeting at HGM'98 in Turin, members of HUGO's HGMC expressed a strong desire that the GDB project continue.

Subsequent to meetings with representatives from NCBI, OMIM and the HUGO Nomenclature Committee to explore the disposition of essential GDB activities, it became evident that the database could not be brought to a proper close in the six months allotted. Consequently we received a six month extension for the database shutdown and plans to migrate the database to Oak Ridge National Laboratory, there to be maintained as a static resource.

In late October, however, staff at the new GDB node in Toronto located a potential source of private funds for GDB continuation. Diligent work over the following two months has resulted in a rescue program for the database. The primary, editable, copy of GDB will move to the Bioinformatics Centre at the Hospital for Sick Children in Toronto, from whence it will be replicated to all international nodes. Editing and curation of the database will continue and we will strengthen our relationship with HUGO. We are working with HUGO to supplement the 'classic' HUGO editors, who oversee genes and mapping, with editors from the sequencing community who will help to integrate physical maps and, ultimately, sequence information.

Plans are being made with the Sequence Annotation Consortium at ORNL to work on integrating GDB data into the Sequence Annotation project. Possibilities of other collaborations are also being investigated as resources allow. ORNL has received a copy of GDB, which will now be updated, to serve as primary U.S. node of the database.

63. Reconstruction and Annotation of Transcribed Sequences: The TIGR Gene Indices

John Quackenbush, Ingeborg Holt, Feng Liang, Geo Pertea, Jonathan Upton, and Thomas S. Hansen

The Institute for Genomic Research, Rockville, MD 20850

johnq@tigr.org

A goal of the Human Genome Project is identification of the complete set of human genes and the role played by these genes in development and disease. The sequencing of Expressed Sequence Tags (ESTs) has provided a first glimpse of the collection of transcribed sequences in humans and other organisms, but significant additional information can be obtained by a thorough analysis of the EST data. TIGR's analysis of the world's collection of EST sequence data, captured in our Gene Indices, provides assembled consensus sequences that are of high confidence and represent our best estimate of the collection of transcribed sequences underlying the ESTs. In addition to the Human Gene Index (HGI), we maintain Gene Indices for a variety of other species, including mouse, rat, Drosophila, zebrafish, rice, tomato, and Arabidopsis. Collectively, the Gene Indices represent a unique resource for the comparative analysis of mammalian genes and may provide insight into gene function, regulation, and evolution.

We have recently expanded the TIGR Gene Index project to include quarterly releases, expanded annotation, integration with mapping and genomic sequence data, and more robust search capabilities. In addition, we are developing a database of mammalian orthologues based on comparison of the human, mouse, and rat TC sequences and a web-based presentation to allow the data to be effectively explored. This database will provide direct links between the human, mouse, and rat assemblies and represent the most extensive catalog of eukaryotic orthologues available, providing a valuable resource for gene identification, elucidation of functional domains, and analysis of gene and genome evolution.

64. An Informatics Framework for Transcriptome Annotation

Brian Brunk¹, Jonathan Crabtree¹, Mark Gibson¹, Chris Overton¹, Debra Pinney¹, Jonathan Schug¹, Chris Stoeckert¹, Jian Wang¹, Ihor Lemischka², Kateri Moore², and Robert Phillips²

¹Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104-6021 and ²Princeton University, Princeton, NJ

coverton@pcbi.upenn.edu

It is now feasible to define the transcriptional state of a eukaryotic cell with reasonable precision by combining multiple gene expression technologies, e.g., EST analysis with microarrays. However, few of the 10,000 - 20,000 different transcripts expressed in a cell are well characterized in terms of function and cell role. In a collaborative effort, we have begun the identification and characterization of the transcripts produced in the mouse hematopoietic stem cell. The Princeton group has enriched for the stem cell from fetal mouse liver by sorting for cells positive for the markers AA4.l, Sca-1 and c-Kit and low in Lin. A normalized, subtracted (against a stromal cell cDNA library) cDNA library was generated from these cells. A similar strategy was adopted in the construction of a stromal cell library. ESTs were generated from both libraries and analyzed through an automated computational annotation pipeline followed by expert manual annotation. Currently approximately 4000 stem cell and 3000 stromal cell ESTs have been carefully annotated leading to a well-defined "molecular phenotype" of each cell type and opening the way for follow-up analyses of novel genes of interest. Based on this prototype annotation process, we have developed an integrated informatics framework for the systematic annotation of cell-specific transcriptomes. The system combines data management and visualization facilities with automated and manual data analysis components accessible through a Java servlet-based architecture. Using the K2 technology for accessing distributed databases, it integrates computationally annotated mouse and human genomes (GAIA system), computationally annotated mouse and human transcriptomes built from dbEST ESTs and known mRNAs (DOTS), and protein sequences in SwissProt. The K2 facility also provides access to a number of other remote databases and analysis services. Computational annotation steps include: clustering and assemble of ESTs/mRNAs to form consensus transcribed sequences (TSs); gene finding by similarity to TSs; similarity of TSs across species and in proteins; and assignment of cell roles/functions to TSs using computational and manual analyses. Manual annotation steps include: assessment of quality of consensus sequence to identify artifacts; refinement of cell role/function assignment; and characterization of alternative splicing. Results of the characterization of the stem and stromal cell molecular phenotypes will be presented.

65. Protein Domain Dissection and Functional Identification

Temple F. Smith, Sophia Zarakhovich, and Hongxian He

BioMolecular Engineering Research Center, College of Engineering, Boston University, 36 Cummington Street, Boston, MA 02215

tsmith@darwin.bu.edu

Using various multialignment and conserved pattern tools (e.g., psiBLAST, BLOCKS, pfam, pimaII, etc.) protein domains as "evolutionary modules" can generally be identified. Using a set of 20 completely sequenced microbial genomes (including yeast), we have generated over 1300 profiles representing diagnostic sequence domains. The majority either cover the entire length of the proteins matching the profile or identify a sequence region clearly identifiable in multiple distinct domain contexts. The relationship between such sequence domains and structural domains will be discussed with examples. The problems involved in associating these domains to a given biochemical function and/or the cellular role played by that function will also be addressed.

66. Finding Remote Protein Homologs

Kevin Karplus

University of California, Santa Cruz, Baskin School of Engineering, Santa Cruz, CA 95064

karplus@cse.ucsc.edu

Since Spring 1996, the bioinformatics group at UCSC has been working on ways to find and align homologs of proteins, even when the sequences of the proteins are quite diverged. Our main approach has been to use hidden Markov models with Dirichlet mixture regularizers for both the search and the alignment. The method uses only sequence information, not structural information, and so can be applied even to proteins whose structure is still unknown.

The main tests for the method are fold-recognition and alignment tests--searches and alignments are made for proteins whose structure is known (but not used in the search or alignment), and the results are compared with the results of structural alignment. In a test against other sequence-based search and alignment methods (including PSI-BLAST and ISS), our SAM-T98 method found more true homologs (based on SCOP) than other methods at any level of accepted errors.

A common use of remote homologs is to predict the structure of the protein. We have participated in both the CASP2 and CASP3 experiments for blind prediction of protein structure. In both, we were in the top six groups (invited to the special issue of Proteins) for fold recognition and alignment. In CASP3, our alignments of the comparative-homology targets were consistently among the best (approximately top 3), even though we made no use of structural information.

For CASP3, we also tested a secondary-structure predictor using a neural net and the SAM-T98 multiple alignments used by our fold-recognition method. This predictor was the second best of the 31 groups participating, and we have since improved it.

We have installed an automatic server on the Web to take a sequence (or seed alignment) and produce the multiple alignment of similar sequences in NCBI's non-redundant protein database, search results for proteins with structures in PDB, and secondary-structure predictions.

For more information about our projects, see the Web.

67. Multi-Way Protein Folding Classification Using Support Vector Machines and Neural Networks

C.H.Q. Ding and I. Dubchak

National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

ildubchak@lbl.gov

In bioinformatics research, classification methods for multiple classes recognition employed so far are mostly based on the one-vs-others approach. We investigated two advanced approaches, the unique one-vs-others approach and the all-vs-all approach with increased classification accuracy.

We analyzed the traditional sensitivity and selectivity measures for multi-class classification from a new perspective of contingency table in categorical analysis, and provided some insights. These true positive and false positive based measures are combined and generalized to a new unique accuracy measure which characterize more accurately the performance of a recognition system. This measure can be applied consistently and uniformly to all multi-class classification approaches thus facilitating inter-comparisons of different classification methods.

We used the state-of-art Support Vector Machine (SVM) together with an earlier neural network (NN) two-class classifiers. SVM gives higher accuracy and runs much faster than NN.

Of the six different physico-chemical based parameter sets extracted from protein sequences, we found that the amino acid composition based parameter set is the most effective for the discriminative methods. The secondary structure

based parameters are also quite effective. These are followed by parameter sets extracted from hydrophobicity, polarity, van der Waals volume, and polarizability properties.

68. Comparative Analyses of Syntenic Regions using Pattern Filtering

Jonathan E. Moore and James Lake

Molecular Biology Institute and MCD Biology, University of California, Los Angeles, CA 90095

Comparative computational analyses of syntenic DNA sequences among the bilateral animals hold great potential for the identification of genomic features, such as protein coding regions, gene boundaries, introns, and genetic regulatory elements. A computational method, developed in our lab, called pattern filtering optimally separates the signals conserved between sequences from the noise caused by the stochastic process of nucleotide substitution. We are currently developing related methods having more statistical power for the identification of protein coding regions and their boundaries. Preliminary analyses, utilizing only pattern filtering methods, of mammalian mitochondrial DNAs and of the human chromosome 12p13 locus and its syntenic region in mouse show the considerably promise of this method. We plan to continue our progress toward rapid and effective analysis through pattern filtering of genomic features in the syntenic regions of mouse, human, and other species.

69. Discovery of Distant Regulatory Elements by Comparative Sequence-Based Approaches.

Inna Dubchak¹, Chris Mayor¹, Lior Pachter², Gabriella Cretu¹, Edward M. Rubin¹, and Kelly A. Frazer¹

¹Genome Sciences Department, Lawrence Berkeley National Laboratory, 1 Cyclotron Road MS 84-171, Berkeley, CA. 94720 and ²Mathematics Department, University of California, Berkeley, CA 94720

ildubchak@lbl.gov

Distant regulatory elements, such as enhancers, silencers, and insulators, are experimentally difficult to identify. Exploiting the fact that these elements tend to be highly conserved among mammals we are using comparative sequence-based approaches to discover them. To find conserved non-coding sequences with physical attributes of distant regulatory elements we compared ~ 1 Mb of orthologous human (5q31 interleukin cluster region) and mouse (chromosome 11) sequences. Ninety non-coding sequences (>= 100 bp and >=70% identity) were identified - analysis of 15 found that ~ 70% were conserved across mammals but unique in the human genome. Although this study discovered numerous conserved non-coding sequences with features of distant regulatory elements only one of the two enhancers previously identified in the human 5q31 region was detected.

To improve the ability of comparative sequence analysis to identify distant regulatory elements we have developed a new method which globally aligns the sequences being compared and plots the percent identity of a moving average point (MAP). The advantage of MAP analysis over the previous method used is that it can detect conserved non-coding sequences with small insertions/deletions and is capable of three-way species comparisons. Comparison of ~ 200 kb of orthologous human (5q31), mouse (chromosome 11), and dog (chromosome 4) sequences using MAP analysis found all the known conserved non-coding sequences (>= 100 bp and >= 70% identity) in region as well as additional non-coding elements, including the enhancer previously undetected by comparative analysis. The overall pattern of non-coding sequence conservation in the orthologous human, dog and mouse genomic DNA is strikingly similar suggesting the majority of elements identified based on conservation are likely to have biologic function. Experimental characterization of the largest non-coding element identified in these studies determined it to be a potent regulatory element of three genes, IL-4, IL-13 and IL-5, spread over 120 kb.

70. Identification of Novel Functional RNA Genes in Genomic DNA Sequences

S.R. Holbrook, C. Mayor, and I. Dubchak

Physical Biosciences Division and National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

ildubchak@lbl.gov

Finding the location of functional RNA genes in genomic sequences is much more difficult than the assignment of ORFs as potential protein coding genes. To date, the only method of identifying functional RNA genes is by homology.

Our initial approach to locating novel RNA genes was based on the premise that all stable, functional RNAs share common structural elements and that sequences corresponding to these elements occur preferentially in RNA genes. These elements include tetraloops, uridine turns, tetraloop receptors, adenosine platforms, and a high percentage of double helical base pairing. We have also used the free energy of folding as a structural parameter representing double helicity in RNA sequences. Since the frequency of occurrence of RNA structural elements can not be expected to identify non-RNA sequences in a positive manner, we identified additional sequence preferences based on global sequence descriptors (previously applied to protein fold prediction) to discriminate RNA genes from non-RNA genes. These descriptors include composition, distribution, and transition parameters.

A total of 610 examples of E.coli sequence windows (305 from RNA genes, 305 from non-assigned regions) were used to calculate the descriptors and train neural networks. In order to optimize prediction, we used a voting procedure in which predictions were accepted only when predicted by both types of networks. The accuracy of RNA gene prediction using different combinations of global and structural parameters was estimated by the cross-validation test. Similarly we trained neural network to recognize RNA genes in other species.

Using trained neural networks we have predicted putative RNA genes in complete genomes of E. coli, M. genitalium, M. pneumoniae, and P. horikoshii. The weights from the trained neural networks are now used in a public web server to allow users to make predictions using their sequences. We will be enlarging the number of organisms present in the database of our server, including other bacteria, lower eukaryotes such as yeast and ultimately human.

71. Automatic Discovery of Sub-Molecular Sequence Domains in Multi-Aligned Sequences: A Dynamic Programming Algorithm for Multiple Alignment Segmentation

Eric Poe Xing, Ilya Muchnik¹, Denise Wolf, Inna Dubchak, Casimir Kulikowski¹, Manfred Zorn, and Sylvia Spengler

Center for Bioinformatics and Computational Genomics, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 and ¹DIMACS, Rutgers University, Piscataway, NJ 08855

EPXing@lbl.gov

Automatic identification of sub-structures in multiple sequence alignment is of great importance for effective and objective structural/functional domain annotation, phylogenetic treeing and many other types of molecular analyses. We present a segmentation algorithm that optimally partitions a given multi-alignment into a set of potentially biologically sensible segments based on the statistical profile of sequence compositions of the multi-alignment, such as gap frequency and character heterogeneity, through dynamic programming and progressive optimization. Using this algorithm, a large multi-alignment of eukaryotic 16S rRNA was analyzed. Three types of sequence patterns: shared conserved domain; shared variable motif; and rare signature sequence, were identified automatically in a very short time compared to manual annotation, and the result was consistent with the patterns identified through independent phylogenetic approaches. This algorithm potentially facilitates the automation of sequence-based sub-molecular structural and evolutionary analyses through statistical modeling and high performance computation.

72. Classification of Multi-Aligned Sequence Using Monotone Linkage Clustering and Alignment Segmentation

Eric Poe Xing, Ilya Muchnik¹, Manfred Zorn, and Sylvia Spengler

Center for Bioinformatics and Computational Genomics, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 and ¹DIMACS, Rutgers University, Piscataway, NJ 08855

EPXing@lbl.gov

Optimal clustering of a set of sequence based on arbitrary set function is often of exponential complexity. In this paper, a low order polynomial procedure, which is based on the quasi-concavity of a special type of objective functions, was developed to cluster the multi-aligned sequences based on each of the segments resulted from the aforementioned segmentation process. It clusters sequences according to their degree of similarity to a pre-specified reference pattern (i.e. a consensus sequence or a particular organismal sequence of choice). A combination of such clustering from multiple segments results in a fairly fine-grained classification of all the sequences in the alignment, with a general pattern that is reminiscent of the branching order in a corresponding phylogenetic tree, but with additional information regarding the assumption of modular evolution. This algorithm can be applied to a broad spectrum of molecular sequence analysis purposes such as phylogenetic subtree construction or recognition, tree updating and labeling, and can serve as a framework to organize sequence data in an efficient and easily searchable manner.

73. Extensions to the Arraydb Micro-Array LIMS

Donn Davy¹, Daniel Pinkel², Donna Albertson², Gregory Hamilton², Joel Palmer², Donald Uber¹, Arthur Jones¹, Joe Gray², and Manfred Zorn¹

¹Lawrence Berkeley National Laboratory, Berkeley, CA 94720 and ²University of California, San Francisco Cancer Center, San Francisco, CA

dfdavy@lbl.gov

We have extended our Arraydb Laboratory Information System (LIMS) collaboration with UCSF Cancer Center to allow separate laboratories' users to separately track progress as micro-array slides are printed and used. The system prevents users from seeing each others' data, except where specified, while tracking clones, DNA, as it is prepared and plated, microtiter plates, print-run specifications, slide-printing runs and slides printed, experiments on and images of the slides, and image-analyses. Both slide printer and image-analysis software write directly to the database. Users have the option of downloading selected tabular reports to Excel spreadsheets. Enhancements underway will also allow upload into the database from spreadsheets.

The system is implemented on an Oracle 8 database and served on the Web by a NetDynamics application server, providing a highly scaleable, flexible, and responsive solution. It is accessible from java-compatible web browsers, and provides a fine-grained control over security and accessibility.

74. Identifying Single Nucleotide Polymorphisms (SNPs) in Human Candidate Genes

Deborah A. Nickerson, Scott L. Taylor, and Mark J. Rieder

Department of Molecular Biotechnology, Box 357730, University of Washington, Seattle, WA 98195

debnick@u.washington.edu

Single nucleotide substitutions and unique base insertions and deletions are the most common form of polymorphism and disease-causing mutation in the human genome. Based on the natural frequency of these variants, they are likely to be the underlying cause of most phenotypic differences in humans. Because of their functional importance, a number of methods have been developed to identify single nucleotide polymorphisms (SNPs). Among these, direct sequence analysis has many advantages because it provides complete information about the location and nature of any variants in a single pass, is automatable, widely available, and simple to apply. To further automate the detection of SNPs by direct sequence analysis we have developed PolyPhred which together with Phred, Phrap, and Consed identifies nucleotide substitutions within a target sequence. Over the past year, we have developed several approaches to increase the accuracy and selectivity of PolyPhred as well as a tool known as PolyPhred2db that simplifies the development of databases of SNPs using information obtained by PolyPhred. The application of these tools in analyzing the diversity of human candidate genes will be described. Our results suggest that the levels and patterns of sequence variation found in human genes could pose challenges in identifying the sites, or combination of sites, that influence variation in the risk of disease within and among human populations.

75. Integrating Sequence and Biology: Developing an Informatics Infrastructure for Mouse/Human Comparative Genomics

C.J. Bult, J.T. Eppig, J.A. Blake, J.E. Richardson, and J.A. Kadin

The Jackson Laboratory, Bar Harbor, ME 04609

cjb@informatics.jax.org

Sequence similarity provides a powerful mechanism for predicting orthogonal relationships between mouse and human genes. However, it is the extension of sequence level correspondence to the detailed knowledge about the genes and the relationships of genes to phenotype that makes the comparative genomics approach such a powerful one for understanding biological processes. As the capacity to collect large data sets of complex biological information grows, integration of data from diverse sources about the same genomic feature from diverse sources will be key to developing new insights into human biology using mouse as a model organism.

Although a number of highly automated sequence annotation pipelines have been developed to support large-scale genomic sequence projects, relatively little attention has been paid to developing the infrastructure needed to support the integration of genomic sequence data with related biological information and data. The Mouse Genome Sequence (MGS) database is being developed to provide access annotated mouse genome sequence that has been integrated with existing biological knowledge about the laboratory mouse (e.g., phenotype, expression data, gene homology that are represented in other databases (see the Mouse Genome Database and Gene Expression Database at http://www.informatics.jax.org). MGS represents a critical component of the informatics infrastructure that is needed to support comparative, computational, and functional genomics.

76. WIT2 -- An Integrated System for Genetic Sequence Analysis and Metabolic Reconstruction

Ross Overbeek^1,2, Gordon Pusch^1,2, Mark D'Souza¹, Evgeni Selkov Jr.^1,2, Evgeni Selkov^1,2, and Natalia Maltsev¹

¹Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL and ²Integrated Genomics Inc., Chicago, IL

maltsev@mcs.anl.gov

The WIT2 system was designed and implemented to support genetic sequence and comparative analysis of sequenced genomes as well as metabolic reconstructions from the sequence data. It now contains data from 38 distinct genomes. WIT2 provides access to thoroughly annotated genomes within a framework of metabolic reconstructions, connected to the sequence data; protein alignments and phylogenetic trees; as well as data on gene clusters, potential operons and functional domains. We believe that the parallel analysis of a large number of phylogenetically diverse genomes simultaneously can add a great deal to our understanding of the higher level functional subsystems and physiology of the organisms. The unique features if WIT2 include: 1. WIT2 is based on the unique EMP/MPW collection of the enzymes and metabolic pathways developed by E. Selkov et al., which contains extensive information on enzymology and metabolism of different organisms. 2. WIT2 allows researchers to perform interactive genetic sequence analysis within a framework of metabolic reconstructions and to maintain user models of the organism's functionality. 3. WIT2 provides access to a set of Web-based and original batch tools that offer extensible query access against the data. 4. WIT2 supports both shared and nonshared annotation of features and the maintenance of multiple models of the metabolism for each organism. 5. WIT2 supports metabolic reconstructions from Expressed Sequence Tags (EST) data.

77. PUMA2 -- An Environment for Comparative Analysis of Metabolic Subsystems and Automated Reconstruction of Metabolism of Microbial Consortia and Individual Organisms from Sequence Data

Natalia Maltsev and Mark D'Souza

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL

maltsev@mcs.anl.gov

We have developed a working prototype of interactive environment PUMA2 which intends to accomplish the following goals:

1. Allow comparative analysis of the metabolic subsystems in different organisms

2. Provide a framework for the automated reconstruction of the metabolism of microbial consortia and individual species

3. Provide a framework for representation of the expression data.

Analysis of data in PUMA2 is based on an original approach for representation of metabolism as a network of interconnected modules connected to sequence data. The results of such analyses will be presented in graphical form based on hierarchical representation of the functional subsystems and annotated with sequence data and literature information.

78. Progress Report on EMP Project

Evgeni Selkov, Nadezhda Avseenko, Valentina Dronova, Galina Dyachenko, Aleksandr Elefterov, Milyausha Galimova, Nadezhda Fedotcheva, Maria Fomkina, Tatiana Kharybina, Irina Krestova, Aleksandr Kuzmin, Elena Mudrik, Nikolay Mudrik, Valentina Nenasheva, Valeri Nenashev, Evgeni Nikolaev, Aleksandr Osipov, Lyudmila Pronevich, Anna Rykunova, Aleksey Selkov, Evgeni Selkov, Jr., Vladimir Semerikov, Tatiana Sirota, Anatoly Sorokin, Oleg Stupar', Vadim Ternovsky, and Olga Vasilenko

EMP Project Inc, Russian Subsidiary, Institutskaya 4, suite 121, 142292 Pushchino, Moscow Region, Russia 142292 Pushchino, Moscow Region, Russia

In the first quarter of this FY, the main focus of the EMP Project was on the following:

A high database annotation rate of about 800 records; there were 2,423 and 6,850 records encoded during the last three and twelve months, respectively (See Tab. 1), totaling 29,991 records by the end of 1999.

Table 1. EMP Updating Rate For the Last Twelve Months

Month Records Encoded
Jan-99	147
Feb-99	395
Mar-99	372
A[r-99	461
May-99	484
June-99	615
July-99	795
Aug-99	599
Sept-99	559
Oct-99	827
Nov-99	899
Dec-99	697
TOTAL	6850

Updating the EMP content with new records for novel enzymes classified by the Supplements 5 and 6 to the Enzyme Nomenclature.

Developing a new EMP format to simplify the encoding and retrieving the information with a special effort of extending the format toward signal transduction pathways and phenotype.

Organizing and training groups of annotators in some specific information domains, e.g. Enzyme Kinetics, Signal Transduction, Metabolic Pathways, Phenotype, etc.

Organizing a Software Development Team of 8 professional developers. Designing a new web site interface for EMP to be installed with next two months.

Opening a new office in Pushchino of 120 m2 with 19 working stations; improving hardware, network connection, and technical for annotators working at home.

In the coming months, we plan to develop a new EMP format documentation for annotators and users. This will slow down the annotation rate. Still, it will remain to be at the planned level of not less than 500 records a month.

With a very active support of GlaxoWellcome, we began a dialogue with the Swiss-Prot staff to unify the nomenclature and software development and to coordinate the information processing and encoding for Swiss-Prot and EMP. A trilateral workshop to discuss this cooperation between Swiss-Prot, EMP Project, and GlaxoWellcome is to be held in Geneva on March 27-29.

79. BCM Search Launcher - Providing Distributed, Enhanced Sequence Analysis

M. P. McLeod, Z. Yang, and K. C. Worley

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030

mmcleod@bcm.tmc.edu

We provide web access to a variety of enhanced sequence analysis search tools via the BCM Search Launcher. The BCM Search Launcher is an enhanced, integrated, and easy-to-use interface that organizes sequence analysis servers on the WWW by function, and provides a single point of entry for related searches. This organization makes it easier for individual researchers to access a wide variety of sequence analysis tools. The Search Launcher extends the functionality of other web-based services by adding hypertext links to additional information that can be extremely helpful when analyzing database search results.

The BCM Search Launcher Batch Client provides access to all of the searches available from the Search Launcher web pages in a convenient interface. The Batch Client application automatically 1) reads sequences from one or more input files, 2) runs a specified search in the background for each sequence, and 3) stores each of the search output files as individual documents directly on a user's system. The HTML formatted result files can be browsed at any later date, or retrieved sequences can be used directly in further sequence analysis. For users who wish to perform a particular search on a number of sequences at a time, the batch client provides complete access to the Search Launcher with the convenience of batch submission and background operation, greatly simplifying and expediting the search process.

BEAUTY, our Blast Enhanced Alignment Utility makes it much easier to identify weak, but functionally significant matches in BLAST protein database searches. BEAUTY is available for DNA queries (BEAUTY-X) and for gapped alignment searches. Up-to-date versions of the Annotated Domains database present annotation information. The latest version of this database includes domain information from DOMO and Prodom in addition to BLOCKS, PRINTS, Pfam, Entrez sequence records, and Prosite.

Recent enhancements to the BCM Search Launcher include the addition of searches for human genomic sequences, additional domain information with BEAUTY, and an updated help system designed to assist researchers with little or no experience in computational biology.

Our collaboration with the Genome Annotation Consortium provides BEAUTY search results for all of the predicted protein sequences found in the human genomic sequences produced by the large scale sequencing centers.

Support provided by the DOE (DE-FG03-95ER62097/A000).

80. Data Submission Tool

Manfred D. Zorn and David Demirjian

Lawrence Berkeley National Laboratory, Berkeley, CA 94720

DGDemirjian@lbl.gov

SubmitData provides researchers with a simple and intuitive solution for annotated data submissions to public databases.

Current feature list:

Incorporates XML as the data exchange format
User defined database definitions using XML documents
Smart GUI interfaces
Complete data validation using data type syntax and rule processing
Point & Click error correction on invalid data elements
Simple and Complex Batch Submission support
Create persistent reusable Batch Submission Templates
Batch Submission process can merge data from external files into selected elements using Batch Submission Template Variables
User definable data export format
XML GUI Document Editor / Parser / Validation
Complete Help Pages
Uses current Java Technologies: Java2, JavaMail, JavaHelp, Java Project X

81. Working Examples of XML in the Management of Genomic Data

J. D. Cohn and M. O. Mundt

Bioscience Division and DOE Joint Genome Institute, Los Alamos National Laboratory, Los Alamos, NM 87545

jcohn@lanl.gov

XML is fast becoming the universal format for structured data exchange and documents on the Web. Standards for XML were developed by the World Wide Web Consortium (W3C) and have been adopted for a wide range of applications from e-commerce to mathematics and chemistry. Unlike HTML, XML was designed to be extended and offers a much richer base to build upon (including capabilities for using binary as well as ASCII data).

Until now, exchange of genomic data has been limited primarily to FastA files and a few proprietary or application-specific formats. XML seems to offer an ideal means of enhancing our capability of exchanging data within the genomic community. Using a growing array of XML parsers and other development tools, XML formatted data can be utilized by software applications written in a variety of different languages across multiple hardware platforms. Major database systems (e.g. Oracle) are beginning to offer XML output for SQL database queries. Further, the W3C is working on a standard for an XML query language for searching
XML documents directly.

Recently we have taken the first steps in making use of XML in our distributed sequencing informatics system. Among the applications for XML which we will describe are: 1) automated loading and analysis of sample files from multiple sources (production sequencing, finishing, cDNA, outside laboratories, etc) using naming convention documents; 2) distributed data management; 3) user preference files; and 4) sequence annotation. All of these have been accomplished using the XML Parser for Java from Datachannel. Examples of XML code as well as descriptions of the applications will be presented.

82. The Genome Database -- Integrating Maps with Sequence

Christopher J. Porter, C. Conover Talbot Jr., and A. Jamie Cuticchia

The Johns Hopkins University, Baltimore, MD and The Hospital for Sick Children, Toronto, ON, Canada

jamie@bioinfo.sickkids.on.ca

As reported at the 1999 Contractor-Grantee Workshop, the Genome Database (GDB) is now hosted by the Bioinformatics Supercomputing Centre (BiSC) of the Hospital for Sick Children, Toronto. The database was transferred in May 1999, and has since been moved to an SP supercomputer, donated to the GDB project by IBM.

GDB introduced a number of new tools during 1999. We have used NCBI's electronic PCR (e-PCR) software in a tool that retrieves GDB Amplimers predicted to amplify from a sequence. GDB-BLAST uses the BLAST server on BiSC's Origin supercomputer, and displays GDB objects linked to the sequences retrieved. Additionally, output from BiSC's public high speed BLAST server was modified to display links to GDB. BiSC's supercomputers made feasible the use of e-PCR to create a database of potential amplification sites for GDB Amplimers in all public human sequence.

These resources are serving to improve GDB's mapping of Amplimers and Genes. Work progresses to integrate the extensive body of variation and SNP data from GDB into sequence-level maps.

The recent release of a full sequence of chromosome 22 has served as a proving ground for the integration of GDB data with the complete human sequence. BiSC's sequence analysis tools were used to map GDB objects onto the sequence. These results are displayed as a interactive graphical map, and are being integrated into GDB's comprehensive map. These approaches show how GDB will integrate classical mapping information with the rapidly emerging genomic sequence.

Collaborations with sequencing centers and the Genome Annotation Consortium are continuing to load clone tiling paths into GDB, and to create bidirectional links between GDB records and the annotated sequence.

International interest in GDB continues - 1999 saw the establishment of a new GDB node in Beijing, China, and work is underway to create a node in Bangalore, India.

83. A Visual Data-Flow Editor Capable of Integrating Data Analysis and Database Querying

Dong-Guk Shin¹, Ravi Nori², Rich Landers², and Wally Grajewski²

¹Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-3155 and ²CyberConnect EZ, LLC, Storrs, CT 06268

shin@engr.uconn.edu

Determining mapping sequence variations or polymorphism between homologous genomic regions requires access to genomic data available from different sources and use of many data analysis and visualization programs. It is imperative that software be developed to enable genome scientists to automate tedious and repetitive data handling, database querying and analysis tasks. Our approach is to develop a data-flow editing environment in which genome scientists with minimal computer training can easily describe data analysis tasks. The scientists' use of the software tool involves organizing and coordinating individual tasks of data retrieval from different data sources, combined with data analysis tasks to derive answers to biologically significant questions.

Phase I aimed at developing prototype software which demonstrates the feasibility of a full-scale development of a data-flow editing environment in which interactions between data access and data analysis can be freely described by genome scientists with minimal computer training. The feasibility study is based on a working scenario of determining homology relationships between some known DNA sequences from one species and unknown sequences from a taxonomically-related species.

Software of this kind is expected to be immediately usable by molecular biology and the pharmaceutical industry both of which are becoming more computationally intensive. Since data-flow management problems are not unique to computational biology, the software developed is expected to be useful in many other data and computationally intensive areas, e.g., physics, chemistry, engineering and finance.

The proposed software will enable scientists to automate the repetitive analysis tasks involving an enormous amount of DNA sequence data that must be analyzed to understand its implications to biological and environmental processes. Without the software tool, the difficulties involved in conducting these large scale data analysis projects could be insurmountable due to the magnitude of data available and the variety of analysis techniques involved.

This work was supported in part by the DOE SBIR Phase I Grant No. DE-FG02-99ER82773.

84. Annotating DNA with Protein Coding Domains

Winston A. Hide¹, Robert Miller¹, Gary L. Sandine², and David C. Torney²

¹South African National Bioinformatics Institute, University of the Western Cape, South Africa and ²Los Alamos National Laboratory, Los Alamos, NM

dct@ipmati1.lanl.gov

DNA genomic sequence is now becoming readily available for the human and fly genomes. Reliably finding genes and annotating gene information remains, however, at a premium. Coding domains within gene sequences are detected both by gene prediction programs, locating exons based on predictive models, and by by similarity to known expressed sequences.

Predictive gene detection methods have yet to be sufficiently sensitive to be able to accurately predict all exons of a given gene. In addition, once predicted, only sequence comparison provides reliable corroboration. Once located, exonic DNA sequences need to be correctly translated into their corresponding proteins. The proteins may then be compared with known protein sequences corresponding to known structures as determined by direct or modelled homology.

Our annotation approach employs the novel paradigm of direct annotation of DNA based upon the secondary-structure properties of its translate (e.g. helix, sheet, and turn). To accomplish this, we have developed Bayesian classification methods for biological sequences. These methods use examples for 'training'. We have used several secondary-structure classes of polypeptides from the CATH database (Orengo et al 1997). ( The latter is a valuable resource because it has strictly hierarchically classed secondary structures and presents homologous superfamilies found in genome sequences).

We have successfully completed an analysis of such classes. The integrals of the Bayes'-rule formulas are approximated by finding the global maximum of the integrand the product of probabilities of the sequences in the sample. This has been a challenge for numerical analysis, but, a constrained quadratic programming program yielded near-optimal points. For example, for peptides of length four, taken from alpha-helix sequences, each point consists of 300 parameters characteristic of the sequences in the sample. These parameters yield posterior likelihoods for all peptides of length four to belong to the class.

The DNA sequences for these polypeptides may also be used for 'training'. Our direct approach, however, has been to submit genomic sequence to exon prediction engines such as 'Genome Annotator Pipeline', and, in addition, to generate a large number of processed expressed sequence tag fragments for reduced redundancy and consensus generation using clustering. Proteins predicted from both the exons and the clustered consensus sequences are submitted for analysis using our statistical methods.

The results are presented via a web system which reveals likely structural domains within exons and coding expressed sequence. We have implemented a web tool which accepts raw DNA sequence and generate predicted coding regions from expressed sequences or accept predicted exonic information and process for statistical states of structural class and display predicted states for each of the structurally encoded parameters.

Our next steps will be to

Determine sensitivity and selectivity of the statistic with respect to current secondary structural prediction (tools that rely models/empirical derivatives)
Analyze 100,000 EST consensus sequences, producing peptides and predicting their domains.
Determine the efficacy of employing an implementation of our methods for synergistic support of gene finding tools
Map gene prediction outputs onto structurally predicted states to determine jointly predicted exons.
Annotate jointly predicted exons onto known gene and protein structures
Determine the efficacy of combining our methods with other methods for finding genes
Implement these methods in other important contexts, such as functional promoter class characterization and annotation.

85. Clustering and Visualizing Yeast Microarray Expression Data Using VxInsight™

George Davidson¹, Edwina Fuge², and Margaret Werner-Washburne²

¹Sandia National Laboratories, Albuquerque, NM and ²Biology Department, University of New Mexico, Albuquerque, NM

maggieww@unm.edu

The database visualization tool VxInsight™ was used to cluster and visualize two sets of data from Saccharomyces cerevisiae. These included Spellman's cell-cycle data and data obtained in our laboratory examining gene expression during exit from stationary phase. Microarray hybridization data were ordinated using correlation, after Eisen, and displayed in VxInsight™. Chromatin genes expressed in S-phase are shown to group closely together in this 2D visualization environment, as expected. This visualization paradigm can be used to identify uncharacterized genes that are closely grouped with well studied genes, for example genes associated with stationary phase. VxInsight™ automatically assembles HTML pages with expression plots and with the associated links into the Proteome database. Further development of this capability will enable faster and more user-friendly interpretation of huge volumes of microarray data.

86. Comprehensive Microbial Genome Display and Analysis

Frank Larimer, Doug Hyatt, Miriam Land, Richard Mural, Morey Parang, Manesh Shah, Jay Snoddy, and Ed Uberbacher

Computational Biosciences and Toxicology and Risk Analysis Sections, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831

http://compbio.ornl.gov

We are now representing all completed microbial genomes in the Genome Channel and the Genome Catalog, providing comprehensive sequence-based views of genomes from a full genome display, to the nucleotide sequence level. We have developed a tool for comparative multiple genome analysis that provides automated, regularly updated, comprehensive annotation of microbial genomes using consistent methodology for gene calling and feature recognition. The visual genome browser currently represents ca. 51,000 Microbial GRAIL gene models as well as providing over 45,000 GenBank gene models. Precomputed BEAUTY searches are provided for all gene models, with links to original source material as well as links to additional search engines. Comprehensive representation of microbial genomes will require deeper annotation of structural features, including operon and regulon organization, promoter and ribosome binding site recognition, repressor and activator binding site calling, transcription terminators, and other functional elements. Sensor development is in progress to provide access to these features. Linkage and integration of the gene/protein/function catalog to phylogenetic, structural, and metabolic relationships are being developed.

A draft analysis pipeline has been constructed to provide annotation for the microbial sequencing projects being carried out at the Joint Genome Institute. The pipeline is being applied to annotating the Nitrosomonas europaea and Prochlorococcus marinus genomes currently being sequenced. Multiple gene callers (currently Generation, Glimmer and Critica) are used to construct a candidate gene model set. The conceptual translations of these gene models are used to generate similarity search results and protein family relationships; from these results a metabolic framework is constructed and functional roles are assigned. Simple repeats, complex repeats, tRNA genes and other structural RNA genes are also identified. Annotation summaries are made available through the JGI Microbial Sequencing web site; in addition, draft results are being integrated into the interactive display schemes of the Genome Channel/Catalog.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

87. Infrastructure and Tools for High Throughput Computational Genome Analysis

Doug Hyatt, Phil Locascio, Victor Olman, Manesh Shah, and Inna Vokler

Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831

http://compbio.ornl.gov/

The Computational Biosciences Section at Oak Ridge National Laboratory provides Computational Genome Analysis resources to the DOE Joint Genome Institute, other major Genome Centers, and to the international biology community. In addition, these resources are also used internally to support the analysis of sequences in the ORNL Genome Channel and Genome Catalog systems. With the imminent publishing of the draft human genome sequence by the spring of 2000, challenges in the computational analysis of biological data are now critical. We have constructed a computational infrastructure to meet these new demands for processing sequence and other biological data for genome centers and for the biological community at large. Utilizing OBER's timely investment in a high performance resource at ORNL, we have developed the Genomic Integrated Supercomputing Toolkit (GIST) to address this critical throughput challenge and to provide advanced capabilities for the Genome Analysis Toolkit (GAT). Both systems are described below.

Genome Analysis Toolkit (GAT)

The Genome Analysis Toolkit incorporates a wide variety of analysis tools: exon and gene prediction tools, other kinds of feature recognition systems and database homology search systems. The exon and gene recognition systems include Grail, GrailExp and Genscan; and microbial gene prediction systems, Generation and Glimmer. Additionally, Grail suite of tools, consisting of CpG islands, PolyA sites, Simple and Complex Repeats, and BAC End analysis tools, have also been incorporated. Also included are NCBI STS E-PCR, RepeatMasker and TRNAScan-SE systems. Database homology systems include NCBI BLAST and Beauty post-processing. Supported organisms include human, mouse, arabidopsis, drosophila, and most sequenced microbial organisms.

Access to these resources is provided by the GAT client server system. Genome Analysis Toolkit is structured as a layered system. The innermost layer is the tool layer, which comprises the binary executables for the individual tools and associated configuration and data files required by these tools. The binaries are compiled for all supported hardware platforms and operating systems. The service layer, implemented in Perl, provides a platform-independent mode of tool execution. When a service script is invoked by the server, it determines the platform on which it is running and calls the appropriate tool binary. Rigorous error checking has been added at this layer to guarantee that errors in tool execution will be caught and reported to the server.

Access to individual services is provided through a master-slave server layer. The master server receives all analysis requests from clients and distributes them among the heterogeneous pool of slave machines to best utilize the available compute resources and to achieve optimal throughput. Compute-intensive analysis tasks like BLAST searches are directed to the GIST server, running on ORNL's IBM RS/6000 SP infrastructure, described below.

A generic, platform-independent command line (client) interface, written in Perl can be used to submit individual analysis requests to the server. A specialized batch processing tool, ornl_pipeline, has been developed to facilitate specification of customized analysis pipelines. ornl_pipeline, on invocation, reads a user specified configuration file, consisting of a set of analysis directives. A single directive can consist of a logical chain of analysis to be performed on the given sequence. The pipeline then interacts with the server, submitting the specified requests along with associated input data, and collecting the server responses. The output of one analysis is typically fed as input to the next analysis in the chain, in a pipelined fashion. All results are then suitably organized and reported.

GIST (Genomic Integrated Supercomputing Toolkit)

The initial tools included in GIST are a framework of high performance biological application servers that include massively parallel BLAST codes (versions of BLASTN, BLASTP, and BLASTX), which are at the heart of analyses processes such as gene modeling with GRAIL-EXP. We are currently in the process of adding gene modeling tools (e.g., GRAIL-EXP) and plan multiple sequence alignment, protein classification, protein threading, and phylogeny reconstruction (for both gene trees and species trees).

The GIST resources are utilized by the GAT server in a transparent fashion, permitting the gradual introduction of new algorithms and tools, without jeopardizing existing operations. Due to the logical decoupling of the query infrastructure, we have been able to produce an infrastructure with both excellent scaling abilities and many fault-tolerant characteristics. In testing the ability to run multiple instances of tools requiring BLAST we have demonstrated that the removal of any dependent services does not cause loss of data. Instead, where processing power is removed, we observe a graceful degradation of services as long as there is some instantiation of service available, and options to permit "never fail" operation, to cope with network failure and long running operations. GIST's logical structure can be thought of as having three overall components: client, administrator, and server. All components share a common infrastructure consisting of a naming service and query agent, with an administrator having policy control over agent behavior, and namespace profile.

The tools and servers are transparent to the user but able to manage the large amounts of processing and data produced in the various stages of enriching experimental biological information with computational analysis. The goal of GIST is not only to provide one-stop shopping to a genome sequence-data framework and interoperable tools but also to run the codes in the toolkit on platforms where the kinds of questions users can ask are not greatly affected by hardware limitations.

Located at Oak Ridge National Laboratory within both the Center for Computational Sciences and the Computational Biosciences section, the computational infrastructure consists of the centerpiece IBM SP3, some SGI SMP machines, a DEC Alpha Workstation cluster, and a trial Linux PC cluster. We are rapidly approaching beta-stage deployment testing; after testing performance and stability, we hope to deploy the framework at NERSC, other high-performance computing sites, and other collaborators.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

88. Genome Information Warehouse: Information and Databases to Support Comprehensive Genome Analysis and Annotation

Miriam Land, Denise Schmoyer, Morey Parang, Jay Snoddy, Sergey Petrov, Richard Mural, and Ed Uberbacher

Computational Biosciences and Toxicology and Risk Analysis Sections, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830

http://compbio.ornl.gov/

Genome Information Warehouse (GIW) supports the ORNL-based genome annotation and analysis effort by integrating experimental data and computational predictions within a single framework. This is a heterogenous collection of different databases and data stores. The primary purpose of this data warehouse is to provide the data management for user interfaces and other analytical functions for genome information and genome sequence annotation. Some current user interfaces supported by this data warehouse include Genome Channel, Genome Catalog, U.S. node of Genome DataBase, and a SRS mirror of community databases. The information found in GIW includes comprehensive annotation for human and mouse genomic sequences and completed microbial genomes. While the genomic sequences, themselves, are available from NCBI, EBI or DDBJ, the genome features, especially predicted genes and proteins, that can be inferred from each sequence are not being annotated at a rate that matches the rate of sequencing. As the world's knowledge-base about gene, proteins, and their interrelationships continues to grow, new insights can be gained by analyzing and reanalyzing all existing data with a consistent, managed process. One function of the GIW is to provide automated operation support for a consistent annotation process that uses the Analysis Pipeline and its analysis tools to acquire this very useful information.

GIW makes the assumption that the computed and annotated links are not going to be permanent. Since the underlying databases and knowledge change, results are likely to change. For example, the archival data sets like the nonredundant database (NR) at NCBI continue to grow and change, so that a new Blast analysis of a specific gene model can identify additional proteins with good similarities. As the knowledge-base about genes grows, gene modeling methods continue to be refined and improved which provide the impetus for recalculating the gene predictions. Libraries of BAC ends and repetitive sequences continue to grow and can provide new analysis insights by reexamining established sequences. The GIW is supports the rerunning of annotation in order to provide researchers with good information and insight that was not available at the time a sequence was first published.

A significant challenge of GIW is to reanalyze existing sequences in a timely fashion while maintaining currency of the underlying archival data from legacy databases. Many of these critical, underlying archival databases do not have a very robust update mechanism; for example, new and modified sequences from NCBI must be recognized and processed and should not be confused with any previous versions of sequences or contigs. Changes to underlying databases may occur during an analysis cycle. To maintain consistency over all sequences, we need to create analysis versions or epochs that use a consistent archival dataset. Another challenge is to present rapidly evolving information to the user in a way that provides some consistency in navigation and retrieval of data. One major challenge is to continue to develop flexible data structures in biology that can adapt to the evolving understanding of how biological entities relate to each other and new desired user functions.

The GIW primarily uses Oracle 8i to store and manage new experimental and computational data that is created at ORNL. Archival data from other legacy databases in GIW is stored and managed with SRS, flat files, GDB (Sybase-backed), XML files, and others; this archival data must be stored and updated to facilitate the value-added computational cross-linking and annotation.

A few of the completed user interfaces to these GIW databases can be accessed through http://genome.ornl.gov/.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

89. BiSyCLES: Biological System for Cross-Linked Entries Search

Michael Brudno, Igor Dralyuk, Sylvia Spengler, Manfred Zorn, and Inna Dubchak

Center for Bioinformatics and Computational Genomics, National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

ildubchak@lbl.gov

We have developed a prototype of an Object Oriented search system, which would allow researchers in biological sciences and medicine to combine the information found in diverse databases. The information a researcher needs in most cases is not to be found in any single source, but is divided among several. Further, these sources are often cross-linked, either by the curators of the databases or by the contributing authors. The vast amount of information available to the researcher make manual searches extremely time-consuming, so biologists require the assistance of bioinformatics specialists to process and retrieve the relevant information. The number of projects addressing similar concerns, of which the TAMBIS and the K2 are just two examples, underscores the importance of this problem.

BiSyCLES possesses features, which should be both of immediate use to biomedical researchers and of interest to the bioinformatics community: it is easy-to-use, flexible, and extendable. We built an intuitive user interface, accessible through the World Wide Web. BiSyCLES allows user-defined queries to be executed across all of the recognized databases. Simple query syntax, similar to AltaVista™, makes our program easy to learn and use. Furthermore, the set of databases is easily extendable because of our use of inheritance and other object-oriented techniques. This prototype works with the two databases most often used in biological research, Genbank and Medline, and it will be further extended in the near future to include others.

90. Updated ASDB: Database of Alternatively Spliced Genes

I. Dralyuk, M. Brudno, M.S. Gelfand¹, S. Spengler, M. Zorn, and I. Dubchak

National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA and ¹State Scientific Center for Biotechnology NIIGenetika, Moscow, 113545, Russia

ildubchak@lbl.gov

Version 2.1 of ASDB consists of two divisions, ASDB(proteins), which contains 1922 amino acid sequences, and ASDB(nucleotides) with 2486 genomic sequences. ASDB(nucleotides) was developed in 1999, while ASDB (proteins)was updated with the latest data from SwissProt and improved clustering procedures. The database can be assessed on the Web.

SwissProt uses two formats for description of alternative splicing. Thus the protein sequences were selected from SwissProt using full text search for the words "alternative splicing" and "varsplic".

In order to group proteins that could arise by alternative splicing of the same gene, we developed the clustering procedure. Two proteins were linked if they had a common fragment of at least 20 amino acids, and clusters were initially defined as maximum connected groups of linked proteins. Each cluster was represented by multiple alignment of its members.

It turned out that some clusters were chimeric, in the sense that they contained members of multigene families, but not alternatively spliced variants of one gene. Therefore the multiple alignments were subject to additional analysis aimed at detection of chimeric clusters.

This processing covers the cases when alternatively spliced variants are described in separate SwissProt entries. The other kinds of ASDB records, originating from the SwissProt entries with the "varsplic" field in the feature table, usually provide the information on the variable fragments of the several proteins which result from the alternative splicing of a single gene. Thus ASDB(proteins) entries are marked with different symbols to allow for easy differentiation among the three types: those proteins which are part of the ASDB clusters and the corresponding multialignments, those which have the information on different variants in the associated SwissProt entries, and those for which the information on the variants is not available at the present time. ASDB contains internal links between entries and/or clusters, as well as external links to Medline, GenBank and SwissProt entries.

The ASDB(nucleotides) division was generated by collecting all GenBank entries containing the words "alternative splicing" and further selection of those entries that contain complete gene sequences (all CDS fields are complete, i.e. they do not have continuation signs).

91. Splice Site Recognition

Terry Speed and Simon Cawley

University of California at Berkeley, Berkeley, CA 94720-3860

With the increasing abundance of completely sequenced genomes the automation of genome annotation has become an important research goal. We focus on the classification of splice sites in eukaryotic genes, an integral sub-task in most successful genefinding programs. In particular we focus on probabilistic models for splice sites, since they can be readily incorporated into probabilistic genefinders without having to worry about how to weight the evidence of splice site classifiers. We make use of variable length Markov chains (also known as context models). VLMCs can capture long-range dependencies in splice sites without having the usual problem of exponential increase in the number of parameters encountered with regular Markov models. We compare these VLMCs with existing splice site recognition methods, both as a stand-alone problem and within PfParser, a hidden Markov model genefinding program for Plasmodium falciparum (a Malaria parasite).

92. Refreshing Curated Data Warehouses Using XML

Susan B. Davidson, Hartmut Liefke, and G. Christian Overton

Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104-6389

susan@cis.upenn.edu

The process of building a new database relevant to some field of study in biology involves transforming, integrating and cleansing multiple external data sources, as well as adding new material and annotations. Such databases are commonly called curated warehouses (or materialized views) due to the fact that they are derived from other databases with value added. Building them entails two primary problems:

1) specifying and implementing the transformation and integration from the underlying source databases to the view database.

2) automating the refresh process.

Previously, we have reported on the development of the Kleisli system for implementing data transformation and integration (the first problem). In this abstract, we focus how XML can be used to solve the second.

XML is a "self-describing" or semi-structured data format that is increasingly being used for data exchange. More recently, XML query languages and storage techniques have been proposed which enable its use in data-warehousing; we study the problem of using XML to detect and propagate updates. Note that determining how the underlying data sources have changed is a complicated problem, due to the fact that biomedical databases propagate their updates in one of three ways:

a) Producing periodic new versions;

b) Timestamping data entries; and

c) Keeping a list of additions and corrections; each element of the list is a complete entry.

We have developed efficient "diff" techniques for comparing old versions of entries with updated versions of entries which produce the minimal updates in XML. Using these minimal updates, we show that the curated warehouse can be incrementally updated rather than recomputed from scratch for a large class of warehouse definitions.

93. Genome-Scale Protein Structure Prediction in Prochlorococcus europae Genome

Ying Xu, Dong Xu, Oakley H. Crawford, J. Ralph Einstein, and Ed Uberbacher

Computational Biosciences Section, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6480

xyn@ornl.gov

The goal of this pilot project is to assign the maximum amount of structural information to proteins, computationally identified from genes, of the Prochlorococcus europae genome, using a combination of a number of existing methods. Proteins are first classified into four categories: (1) proteins having high level (> 40%) of sequence similarity with their homologs in PDB, as identified by BLAST searches; (2) proteins having medium level (25-40%) of sequence similarities with their homologs in PDB, as detected by PSI-BLAST and (super-)family-specific profiles like HMM models; (3) proteins having low level (< 25%) of sequence similarity with their homologs in PDB, as detected by threading methods; and (4) proteins having no homologs in PDB, as determined by threading and statistical analysis. For each protein of the first class, our prediction system applies MODELLER and SWISS-MODEL to generate a few all-atom structure models. Structure models are generated similarly for proteins of the second class after some refinement on the BLAST-generated alignment based on information extracted from HMM models, active site/motif search results, residue-residue contact patterns, etc. The initial alignments of proteins of the third class are generated by threading methods, including our own program PROSPECT, and refinements are done in a similar fashion. Loop regions are first modeled using mini-threading methods; all-atom models are then generated using MODELLER, SWISS-MODEL, and CNS, based on the threading alignments and modeled loop regions. A combined method of threading and statistical analysis is used to determine if a protein has a new structural fold. Instead of attempting to generate full 3D structures for proteins of class 4, our prediction system searches for possible active sites and predicts structural motifs using the local threading option of PROSPECT. For each prediction, the system assigns a confidence value of the prediction based on our performance analysis on a benchmark data set. Preliminary prediction results will be presented in this presentation.

(Research sponsored by the Office of Biological and Environmental Research, USDOE under contract number DE-AC05-96OR22464 with Lockheed Martin Energy Research Corp.)

94. The Ribosomal Database Project II: Providing an Evolutionary Framework

James R. Cole, Bonnie L. Maidak, Timothy G. Lilburn, Charles T. Parker, Paul R. Saxman, Bing Li, George M. Garrity, Sakti Pramanik, Thomas M. Schmidt, and James M. Tiedje

Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824

colej@msu.edu

The Ribosomal Database Project - II (RDP-II) provides rRNA related data and tools important for researchers from a number of fields. These RDP-II products are widely used in molecular phylogeny and evolutionary biology, microbial ecology, bacterial identification, characterizing microbial populations, and in understanding the diversity of life. As a value-added database, RDP-II offers aligned and annotated rRNA sequence data, analysis services, and phylogenetic inferences derived from these data to the research community. These services are available through the RDP-II web site.

Release 7.1 (September 1999) contained more than 10,000 aligned and annotated small subunit (SSU) rRNA sequences. A special focus of this release was the identification and annotation of sequences from type material. Over 3,000 type sequences representing 636 distinct prokaryotic genera were included in release 7.1. These type sequences provide a mechanism for users to place new sequences in a taxonomic as well as phylogenetic framework. This release also included the introduction of an interactive assistant to help with the planning and analysis of T-RFLP experiments (TAP T-RFLP).

We are now preparing release 8, scheduled for March 2000. For this release we are enhancing the alignment to match a new set of alignment guidelines to help us provide an alignment with more consistent treatment of secondary structure regions. This release will contain over 20,000 aligned prokaryotic SSU rRNA sequences, including the vast majority of prokaryotic SSU sequences available through GenBank release 114 (October 15th 1999). Initially, release 8 will be made available without manual curation of annotation information. We hope the RDP advisory panel we are in the process of establishing will help us set new annotation standards that better serve our users with available curation resources. Release 8 will also mark a turning point for RDP. It will be the first release since 1994 where the delay between sequences becoming available through GenBank and being released in aligned format by RDP has actually decreased. We expect both the time to release and frequency of releases to continue to improve through 2000.

145. BioInformatics Prototyping Language for Mapping, Sequence Assembly and Data Analysis

Bud Mishra

Professor of Computer Science and Mathematics (Courant, NYU) Adjunct Professor of Bioinformatics (CSHL) Institutional Affiliation Courant Institute New York University 251 Mercer Street, New York, N.Y.10012

Our research program is aimed at enhancing the current bioinformatics tools within a coherent and unified framework in order to solve and explore several mathematical and computational problems arising in computational genomics. Our initial primary research activity centers around the algorithmic tools necessary to create high-resolution, high-accuracy, sequence-ready, fully-verified reference physical maps based on our earlier Bayesian approach to mapping. The subsequent activities focus upon sequence verification and anchoring tools, map and sequence analysis tools and tools for detecting polymorphisms and chromosomal aberrations. Instead of developing each tool ab initio, the chosen strategy is to embed the building blocks of the interdependent tool set in a high-level programming language that supports a fast robust free-format data base and fast access to powerful program libraries supporting string algorithms, statistical subroutines, map manipulation routines, image processing routines, visualization widgets and database tools.

Department of Energy 2000-2003. 25-74100-F1799.

151. Commercialization of the GRAIL EXPTM Gene Discovery System

Doug Hyatt, Morey Parang, and Ed Uberbacher

Genome Informatics Corporation, 1020 Commerce Park Drive, Oak Ridge, Tennessee 37831

The sequence of long contiguous regions of human genomic DNA will soon be generated at a rate of several million bases per day. Many genes are embedded in these sequences and no currently available gene finding program is capable of properly parsing (finding where one gene ends and another begins) these genes and accurately predicting their structure. Only by including experimental information from EST (expressed sequence tag) databases and databases of full length cDNAs can the proper extraction of gene models from long genomic DNA sequences be carried out in an efficient, accurate and automated way. Technology developed at Oak Ridge National Laboratory (GRAIL-EXP) combines pattern recognition and EST information to identify, model and properly parse genes from long stretches of genomic DNA sequence in a manner which is superior to other gene modeling systems. The Genome Informatics Corporation (Genomix) has licensed this unique technology and through a number of additional technical developments funded in this Phase I SBIR, has built a robust commercial product based on GRAIL EXP.

At the outset of this SBIR, GRAIL EXP was a research code and not available as a robust, well structured, documented, and user friendly package that could be marketed to pharmaceutical and biotechnology companies. Significant restructuring and performance improvements were needed as well as a graphical user interface. In order to accomplish our objectives, we will have completed the following specific improvements to GRAIL EXP: (1) Restructured the GRAIL EXP system with much higher modularity, (2) Developed a client server architecture for the system, (3) Developed an applications programming interface for software access to GRAIL EXP, (4) Developed several strategies for making the EST database search and alignment portion of the program more efficient and manageable in different computational environments, (5) Provided a comprehensive complete cDNA database to assist the gene modeling algorithms, (6) Provided mechanisms in the code for customers to access and include local proprietary cDNA or EST databases in the analysis, and (7) Constructed a Java graphical user interface for the system.

*GRAIL and GRAIL EXP are trademarks of UT Battelle, LLC and Genome Informatics Corporation, respectively.(Research sponsored by the Office of Biological and Environmental Research, USDOE under SBIR grant number DE-FG02-99ER82794 with Genome Informatics Corporation.)

154. A Simulation Extension of a Workflow-based LIMS

Gary Lindstrom, T. Richard Bogart, Peter Cartwright, and William Delaney

Cimarron Software, Inc. 175 S. West Temple St. Suite 530, Salt Lake City, 84101

gary@cimsoft.com

High throughput molecular biology laboratories critically depend upon cost effective management of complex workflows, including systematic planning and optimization of resource utilization. This SBIR Phase II project is developing simulation software that models performance of laboratory workflows under varying scenarios. The software is unique in that it derives its workflow model and configuration parameters from the real laboratory workflow, as managed by its operational LIMS, and can deliver concrete guidance for optimized performance of the real workflow. Broadly speaking, there are two principal insights to be gained by "what if" experiments using the simulator. The first is capacity analysis, whereby resource deployment plans and policies can be investigated for their effect on overall laboratory productivity. Secondly, relative priorities of competing projects in the laboratory workflow can be assessed for their aggregate impact on individual project milestones. For example, suppose a laboratory manager is asked to take on the production demands of an additional project. By simulation studies, the manager can quantitatively assess the new project's impact on deadline fulfillment for existing projects. Similarly, the consequences of adjusting priorities for a current project mix can be quantified. The decisions resulting from these investigations can be deployed to laboratory operations in the form of task selection rules guiding batch selection at each station in the workflow. The simulator models detailed properties and management policies concerning all laboratory durable resources. Examples of the latter include laboratory instruments, staff, and computer processes, with accurate modeling of capacities, functional capabilities, and availability cycles -- both scheduled (shifts, maintenance) and unscheduled (break down and repair). Modeling of non durable resources, e.g., consumables such as reagents, is another possibility. Customers of Cimarron Software are helping shape and evaluate the simulation system, which is expected to be commercially available early in 2001.

157. Our Vision for the New Protein Data Bank

Helen Berman¹, John Westbrook¹, Kyle Burkhardt¹, Zukang Feng1 Shri Jain¹, Rachel Kramer¹, Bohdan Schneider¹, Christine Zardecki¹, Peter Arzberger², Phil Bourne², John Badger², Helge Weissig², Gary L. Gilliland³, Phoebe Fagan³, Diane Hancock³, Narmada Thanki³, and Gregory B. Vasquez³

¹Rutgers University, Piscataway, NJ, USA, ²San Diego Supercomputer Center, University of California, San Diego, CA, USA, and ³National Institute of Standards And Technology, Gaithersburg, MD, USA

gregory.vasquez@nist.gov

On October 1, 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became responsible for the management of the Protein Data Bank (PDB). The RCSB members are Rutgers, the State University of New Jersey, the San Diego Supercomputer Center of the University of California, San Diego, and the National Institute of Standards and Technology. The vision of the RCSB (http://www.rcsb.org/pdb/) is to enable new science by providing accurate, consistent, and well-annotated structure data via the application and development of modern information technology. Data is deposited and processed by the RCSB using an integrated dataprocessing system called ADIT (the AutoDep Input Tool). Adit provides rapid and reliable data processing, and is also being used to revisit all existing structures in the PDB to create a more uniform archive. The RCSB has also developed a query and reporting interface to search across the PDB archive. Searches and reports can be generated for single or multiple structures. As the quality of the data improves, the reliability of the query results will improve. These systems and plans for extending the capabilities of the new PDB will be described. This project is funded by the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine.

The online presentation of this publication is a special feature of the Human Genome Project Information Web site.