Milestone 1: Develop Techniques to Determine the Genome Structure and Functional Potential of Microbes, Plants, and Microbial Communities
Research Highlights for Milestone 1: Sequences, Proteins, Molecular Complexes
GTL Milestone 1 has Two Distinct Components:
- Component A. Microbial Sequences and Protein Characteristics
- Component B. Molecular Complexes
Component A. Microbial Sequences and Protein Characteristics
Background and Science Needs
Proteins are the chemically and physically active products of virtually all genes. Highly dynamic and shifting in amount, modification state, higher-order association, and subcellular localization, proteins carry out the primary functions of a cell in response to intracellular and extracellular signals.
For a systems understanding of microbes, we first must understand the panoply of proteins the genome is capable of producing. GTL’s first challenge in studying mission-relevant microbes and microbial communities is to determine the system’s genetic makeup and the extent and patterns of genetic diversity. This is especially true when many identified coding genes are unknown, microbes are unculturable, or only gene sequence is in hand (e.g., metagenomic experiments involve determining the genetic sequence of a whole community of microbes).
Unknown genes are the first target. With a mature database of thousands of microbes available within a decade, comparative genomics, phylogenetic analysis, and sophisticated computational annotation will provide an increasingly complete set of gene functional assignments. In the interim and to reach that end state, we must be able to perform functional annotations based on information from proteins produced from sequence and analyzed biophysically and biochemically in vitro. GTL’s ultimate goal, however, goes beyond simple assignment to achieving a mechanistic structural and functional understanding of proteins and molecular machines that can form the basis for comprehensive and predictive systems models.
The availability of gene sequence and proteins allows the generation of various affinity reagents. Development of affinity methods and reagents from produced proteins will open the door to identifying and tracking microbes and specific proteins in complex and dynamic microbial systems. Affinity reagents also can be used to manipulate (activate or inactivate) proteins, capture and track them, and determine their relative locations through a variety of sensitive analytical methods for understanding and visualizing protein structure, function, and behavior. Specific milestone objectives are set forth below.
-
Genome Sequences. Develop methods for sequencing uncultivated
microbes and microbial communities and identifying the extent and patterns
of genetic diversity and evolution, including:
- Sequence-assembly methods.
- Single-cell in situ sequencing for verification and environmental experimentation.
-
Protein Characteristics. Develop methods and concepts
to understand the range and characteristics of proteins encoded in genomes,
including:
- Refined computational annotation for primary gene assignment and putative protein function.
- Advanced comparative analysis and methods based on evolutionary relationships to understand the functions of newly discovered genes and proteins using the comprehensive GTL Knowledgebase.
- Biophysical and biochemical analyses of proteins produced directly from genome sequences for more rigorous assignment of gene function and as a starting point for mechanistic understanding of microbial capabilities, function, and control. This capability provides a cost-effective and rapid alternative to culturing for the determination of hypotheticals and unknowns.
- Application of these analytical capabilities to genetically modified proteins to assist in deriving design principles and optimizing microbial and protein function.
- Affinity methods and reagents for locating and analyzing proteins and complexes outside living cells and dynamically inside living cells and for identifying and tracking microbes and specific proteins in complex microbial systems.
Computation Needs
Computational challenges in characterizing the composition and functional capability of microorganisms range from “simple”data management to complex data analysis, integration, and use. New algorithms for DNA sequence assembly, as well as better use of current state-of-the-art methods and annotation, will be required to analyze multiorganism sequence data; new modeling methods will be needed to predict the behavior of microbial communities. Computational research must develop methods to
- Deconvolute mixtures of genomes sampled in the environment and identify individual microbial genomes.
- Facilitate multiple-organism, shotgun-sequence assembly.
- Improve comparative approaches to microbial-sequence annotation and use them in conjunction with data generated by high-throughput experimentation to more accurately assign functions to genes and proteins.
- Accomplish pathway reconstruction from sequenced or partially sequenced genomes to evaluate the combined metabolic capabilities of heterogeneous microbial populations.
Component B. Molecular Complexes
Background and Science Needs
Most proteins do not act alone but instead are organized into molecular complexes (machines) that carry out activities needed for metabolism, communication, growth, and structure. GTL’s first milestone includes the creation of capabilities for comprehensively identifying, characterizing, and beginning to understand multiprotein complexes. These studies will help build the essential knowledgebase, and the stage will be set for linking proteome dynamics and architecture to cellular and community functions.
Identifying and characterizing multiprotein complexes on a genome-wide scale will require new tools and research strategies designed to increase throughput, reliability, accuracy, and sensitivity. While RNA measurements, such as microarrays, can give us a notion of which machines might form, the importance of understanding post-transcriptional and post-translational regulation requires direct knowledge of proteins and their interactions. Also, new tools for characterizing these complexes must bridge current size and resolution gaps between the high-resolution technologies for studying single proteins and those suitable for very large protein assemblies and cellular ultrastructures that are more amenable to just-emerging nanoscale structural techniques.
An initial target for GTL is to develop a suite of methods to isolate, identify, and characterize all essential protein complexes in a microbial system. Currently, only a few of the most stable and common protein complexes are well characterized, but data suggest that hundreds, if not thousands, of other complexes operate together to carry out cellular functions. Many important associations may be less stable, less abundant, and more dynamic. The near-term challenge is to develop methods to analyze the difficult ones. These most demanding protocols can be supported in a comprehensive way only with a technically and scientifically robust infrastructure. Providing the necessary infrastructure and scaling up these capabilities in research centers will enable scientists to rapidly generate a draft protein-machinery map of a typical microbe of interest to DOE.
An important aspect of understanding the assembly, stability, and function of protein complexes is the high-throughput characterization of protein-protein proximity and interfaces within complexes and between interacting complexes. When coupled with other information about structure and interrelationships among proteins, this characterization will provide a comprehensive database for understanding spatial and temporal hierarchies in the assembly of protein complexes. Ultimately, this analysis will reveal the internal, transmembrane, and extracellular structure of cells and bring understanding of how assembly and disassembly of these complexes are organized and controlled. Data on coincident expression and cellular or subcellular localization can powerfully constrain possible functions for a given multiprotein complex. By coupling localization and colocalization information with genetic and biochemical data from diverse sources, scientists can postulate and then test the contributions of specific complexes to a cell’s survival and behavior. High-throughput implementation of new and existing technologies will be needed to achieve these goals.
Molecular Complexes.
Develop Capabilities for a Predictive Understanding of Protein Interactions and the Resulting Structure and Properties of Molecular Complexes
- Discover and define the repertoire of molecular interactions and multimolecular complexes. New methods will be required to isolate and analyze transient and rare complexes.
- Develop predictive methods to define experimental conditions favoring the occurrence of condition-dependent and transitory complexes to assist in their capture.
- Determine the structure of complexes, with localization of components and characterization of their reaction interfaces. Establish high-throughput methods to define the protein-protein interfaces within and between complexes.
- Determine the cellular and subcellular localization and colocalization of protein complexes, including their conditional and temporal variations. Define physical relationships among protein complexes and integrate this information with candidate functions.
- Develop principles, theory, and predictive models for the structure, function, assembly, and disassembly of multiprotein complexes. Test predictions of these models in experimental systems and apply them to optimization of functions for applications.
- Correlate information about multiprotein complexes with relevant structural-fold data generated in the NIH Protein Structure Initiative to better understand the geometry, organization, and function of these protein machines.
Computation Needs
- Identify and characterize life’s multiprotein complexes, involving substantial computational demands and ranging from sophisticated data analysis to atomic-scale simulations of protein interactions. Meeting these needs will require the development of new algorithms and databases and the use of high-performance computers.
- Adapt and develop databases and analysis tools for integrating experimental data on protein complexes measured with different methods under varied conditions.
- Develop novel approaches and methods for automatically identifying protein functional modules based on high-throughput genomic, proteomic, and metabolomic data.
- Develop algorithms for integration of diverse biological databases including transcriptome and proteome measurements, as well as functional and structural annotations of protein-sequence data to infer complex formation and function.
- Develop modeling capabilities for simulating multiprotein complexes and for predicting the behavior of protein complexes in cell networks and pathways.