DOE Microbial Genome Program Report

Research Abstracts
DOE Microbial Genome Program Report

Section 1: Sequencing and Analysis

**Sequencing the Genome of Nitrosomonas europaea, an Obligate Lithoautotrophic, Ammonia-Oxidizing Bacterium**

Daniel J. Arp, Alan B. Hooper,¹ Jane E. Lamerdin,² David Arciero,¹ Andre Arellano,² Karolyn Burkhart-Schultz,² Anne Marie Erler,² Norman Hommes, Martin G. Klotz,³ Jenny M. Norton,⁴ Warren Regala,² Luis Sayavedra-Soto, and Stephanie Stilwagen²

Botany and Plant Pathology; Oregon State University; 2082 Cordley; Corvallis, OR 97331-2902
541/737-1294, Fax: -3573, rpd@bcc.orst.edu

¹University of Minnesota
²Lawrence Livermore National Laboratory
³University of Louisville
⁴Utah State University

As part of the DOE initiative to explore the role of microorganisms in global carbon sequestration, the Joint Genome Institute intends to obtain the complete genomic sequence of the autotrophic nitrifying bacterium Nitrosomonas europaea. This organism is the most studied of the ammonia-oxidizing bacteria that are participants in the biogeochemical N cycle. Nitrifying bacteria play a central role in the availability of nitrogen to plants and hence in limiting CO₂ fixation. The reaction catalyzed by these bacteria is the first step in the oxidation of ammonia to nitrate. These bacteria also are important players in the treatment of industrial and sewage waste in the first step of oxidizing ammonia to nitrate. Evidence suggests that ammonia-oxidizing bacteria contribute significantly to the global production of nitrous oxide (produced by the reduction of nitrite). N.europaea also is capable of degrading a variety of halogenated organic compounds, including trichloroethylene, benzene, and vinyl chloride. The ability of nitrifying organisms to degrade some pollutants may make these organisms attractive for controlled bioremediation in nitrifying soils and waters.

N. europaea is a Gram-negative member of the B-proteobacteria subdivision, possessing a genome size of at least 2.2 Mb. The microbe can be transformed and deletion mutants engineered, allowing the study of genotype-phenotype relationships. To complete the sequence of the N. europaea genome, a whole-genome shotgun strategy is being used similar to that employed successfully for many tens of bacterial organisms. The 8X genome coverage generated by the shotgun data is being supplemented with a scaffold of paired end sequences from clones in the low-copy-number fosmid vector. Shotgun data from this organism were assembled with PHRAP (Phil Green, University of Washington) and will progress through "auto-finishing" using software written by Matt Nolan (JGI-LLNL) and David Gordon (University of Washington) prior to human intervention in the assembly. Fingerprinting of a minimal spanning path of fosmids will be used to aid verification of the final assembly. A sequence-analysis pipeline, developed by Manesh Shah and Frank Larimer of Oak Ridge National Laboratory, is being used to define open reading frames (ORFs) and query public databases for protein-nucleotide similarities. Periodic lists of putative ORFs will appear on the Web site as the genomic coverage continues to grow. The raw sequence data also are directly queryable through the accompanying BLAST server or can be downloaded from the JGI ftp server.

This will be the second member of the b-subdivision to have been sequenced. The most-studied gene products in this organism are those involved in the oxidation of ammonia, principally the hydroxylamine oxidoreductase (HAO), ammonia monooxygenase (AMO), and the accompanying cytochromes that make up the electron-transport chain. We hope the genome sequence will reveal strong candidates for as-yet-unidentified proteins specific to the N-oxidation pathways unique to this organism. The nature and regulation of enzymes in the nitrite-to-nitrous oxide pathway also are of interest. The operon encoding the subunits of AMO is duplicated and the amino acid sequences of the two operons differ by only a single nucleotide. The gene that codes for HAO is present in three copies. The extent to which other genes are duplicated in the genome is not known but is one anticipated outcome of generating the genomic sequence of N. europaea. As one of the few strictly autotrophic bacteria currently being sequenced, N. europaea's genome sequence is expected to reveal the identity and number of genes required for and suited to autotrophy and possibly provide an indication of the basis for obligate autotrophy. The sequence will allow direct comparison to genes identified in another lithoautotrophic organism, Thiobacillus ferroxidans, which derives its energy from the oxidation of iron or sulfur compounds. Comparison of the metabolic capabilities of this organism with those of photoautotrophs and other lithoautotrophs may reveal the range of capabilities that were lost or gained as N. europaea descended from its evolutionary ancestors.

Nostoc Genome Sequencing

Ronald M. Atlas

Department of Biology; University of Louisville; Louisville KY 40292
502/852-3957, Fax: -0725, r.atlas@louisville.edu

An expert advisory panel met with Jane Lamerdin of the Joint Genome Institute (JGI) to select a strain of the heterocystous cyanobacterium Nostoc for genome sequencing. Based upon its relevance to carbon sequestration and the likelihood of providing significant new scientific information, the panel selected Nostoc punctiforme PCC 73102, ATCC 29133. This strain fixes nitrogen and carbon dioxide, forms symbiotic relationships, exhibits cell differentiation with the formation of motile hormogonia (a diagnostic characteristic of the genus Nostoc), has a complex life cycle, has established genetic transfer systems, and is divergent from other cyanobacteria being sequenced. DNA from N. punctiforme is being prepared by Jack Meeks for submission to JGI for sequencing. The advisory panel will work with JGI during the annotation phase and will participate in publication of the data. The panel consists of Ronald M. Atlas (Department of Biology, University of Louisville), Jack Meeks (Division of Biological Sciences, University of California, Davis), Malcolm Potts (Department of Biochemistry and Nutrition, Virginia Polytechnic Institute); Jeff Elhai (Department of Biology, University of Richmond), and Theresa Thiel (Department of Biology, University of Missouri, St. Louis).

The Complete Genome Sequence of Prochlorococcus

Sallie W. Chisholm

Departments of Civil and Environmental Engineering and Biology; Massachusetts Institute of Technology; 15 Vassar St. 48425; Cambridge, MA 02139
617/2531771, Fax: /2587009, chisholm@mit.edu
http://web.mit.edu/chisholm/www

Prochlorococcus is a unicellular cyanobacterium that is very abundant in the temperate and tropical oceans. It has been shown to contribute 32 to 80% of the total photosynthesis in the world's oligotrophic oceans, the higher values being found in the Pacific. Thus, Prochlorococcus plays a significant role in the global carbon cycle and the regulation of the earth's climate.

Molecular phylogenies have shown that Prochlorococcus is closely related to marine Synechococcus, forming a single lineage within the cyanobacteria. Unlike Synechococcus, Prochlorococcus lacks phycobilisomes and contains divinyl chlorophyll a (8desethyl, 8vinyl chlorophyll a, or "chla2") and divinyl chlorophyll b (chlb2) as its major photosynthetic pigments. These pigments enable it to absorb blue light more efficiently than Synechococcus at the low-light intensities and blue wavelengths characteristic of the deep euphotic zone.

We recently demonstrated that there are at least two ecotypes of Prochlorococcus, each of which is distinguished by its photophysiology and molecular phylogeny. One is capable of growth at irradiances, and the other is not. We hypothesize that multiple ecotypes of Prochlorococcus coexist in all oceanic environments, alternating in dominance according to light gradients and seasonal mixing dynamics. We would expect to find, for example, that ecotypes adapted to low light are dominant at the base of the euphotic zone in stratified waters and those adapted to high light dominate at the surface. The ecotypes differ in other physiological properties besides light-harvesting efficiencies, and these too will play a role in regulating their distributions. Ultimately, a comparison of the complete genomes of these two ecotypes will provide valuable insights into the regulation of microdiversity in marine microbial systems.

Prochlorococcus is an ideal candidate for complete genome sequencing for a variety of reasons: (1)it is the smallest known phototroph and has a relatively small genome size (1.8Mb); (2)it is widespread and abundant and is easily identified and enumerated in its environment using flow cytometry; (3)its unique photosynthetic pigment (divinyl chlorophyll a) makes its contribution to total photosynthetic biomass in natural communities easily assessed; (4)different ecotypes have been identified that are very closely related according to their 16S rRNA sequences but are physiologically distinct; and (5)we have an extensive culture collection of isolates from different oceans and environments.

We plan to work with scientists at the DOE Joint Genome Institute (JGI Prochlorococcus Web site) to obtain the entire genomic sequence of Prochlorococcus marinus (MED4), one of the ecotypes adapted to high light. Our role in the project is to supply Prochlorococcus DNA and to be a general source of information on the ecology and biology of the organism.

Sequencing the Large Linear Chromosome of Borellia burgdorferi and a Strain of Clostridium

John J. Dunn and F. William Studier

Biology Department; Brookhaven National Laboratory; Bldg. 463, 50 Bell;
P.O. Box 5000; Upton, NY 11973-5000
631/344-3012, Fax: -3407, jdunn@bnl.gov
631/344-3390, Fax: -3407, studier@bnl.gov
www.genome.bnl.gov

In a program to explore possible improvements in the accuracy, speed, and efficiency of genome sequencing, we sequenced the large linear chromosome of Borrelia burgdorferi, the spirochete that causes Lyme disease. This 909,275-bp sequence is available on our Web site, along with a comparison of the same sequence determined independently by The Institute for Genomic Research (TIGR).

The Brookhaven National Laboratory (BNL) sequence was determined by random first-end and directed second-end sequencing of plasmid libraries of random chromosomal fragments, followed by primer walking using 12-mer primers generated by ligation of two hexamers on hexamer templates. The sequence assembly was confirmed and contigs were aligned by end sequencing a framework of ~35-kb fesmid clones, which spanned the entire sequence. The few remaining gaps were filled by polymerase chain reaction amplification from fesmid clones or genomic DNA. The sequence extends to the ends of the clones we obtained (which did not include the covalently closed ends of the chromosome) and lacks 404bp at the left end and 249bp at the right end of TIGR's sequence, which extends to the ends. The entire BNL sequence was determined at least once on each complementary strand.

The BNL and TIGR sequences are very similar, but there are some differences. The TIGR sequence contains seven copies of a 162bp imperfect tandem repeat that occurs only twice in the BNL sequence. There are 86 other discrepancies, only some of which are in a few remaining areas of relatively low quality in the BNL sequence. In addition, the BNL sequence contains 65 ambiguities (reflecting different base pairs at the same position in different clones), and the TIGR sequence contains 43 ambiguities. For each ambiguity in either sequence, one of the ambiguous bases matches the base at that position in the other sequence. It seems likely that each DNA preparation used for cloning and sequencing has polymorphisms at the 0.01% level, with a similar level of polymorphism between the two DNA preparations.

We are currently sequencing the genome of a Clostridium strain being studied at BNL as a possible bioremediation agent. This anaerobic, nitrogen-fixing spore former can convert water-soluble uranyl ion U(VI) to less soluble U(IV). Its circular genome is about 4Mb, and no plasmids have been detected. More than 500kb of edited unique sequence has been obtained so far. Clone libraries are being constructed in vectors we developed that allow an ordered set of nested deletions to be generated from either end of cloned fragments at least 10 kb long. These vectors were designed to allow sequencing and ordered assembly of both DNA strands in highly repeated regions such as those encountered in human DNA. In Clostridium, the vectors allow directed sequencing of particularly interesting areas by using nested deletions to fill in the framework generated by end sequencing. We expect to sequence the relevant U and N₂ reductases and identify most genes involved in intermediary metabolism.

This is a completed project.

DOE-Funded Microbial Genome Sequencing
at The Institute for Genomic Research

Claire Fraser

The Institute for Genomic Research; 9712 Medical Center Dr.; Rockville, MD 20850
301/838-3500, Fax: -0209, cfraser@tigr.org
www.tigr.org

The Institute for Genomic Research (TIGR) is a not-for-profit research institute with interests in structural, functional, and comparative analysis of genomes and gene products in viruses, bacteria, archaea, and both plant and animal eukaryotes, including humans. Microbial genome-sequencing efforts at TIGR supported by the Department of Energy since 1995 have produced complete genome sequences for five organisms: Mycoplasma genitalium, Methanococcus jannaschii, Archaeoglobus fulgidus, Thermotoga maritima, and Deinococcus radiodurans. In addition, nine other DOE-funded microbial genome projects are in progress at TIGR, with an estimated completion date of 2001 for all work. In total, the DOE-funded microbial genome sequencing projects at TIGR represent nearly 33million base pairs (Mb) of DNA and an estimated 30,000 microbial genes. The information generated in these projects is available from the TIGR Microbial Database.

The strategy that we use for whole-genome sequencing is called a "shotgun" method. In shotgun sequencing, the genome is sheared randomly into small pieces that are then cloned, sequenced, and reassembled to form a whole genomic sequence. With the shotgun approach, there is no need to develop a genetic or physical map of the genome before sequencing it; the sequence itself serves as the ultimate map. In large shotgun-sequencing projects, DNA fragments are assembled into a consensus sequence. Key to the success of the shotgun method is the availability of a truly random genomic DNA clone library and a powerful, accurate algorithm for reassembling the fragments into a complete genome. The basic approach for genome assembly is to compare all individual sequences to find overlaps and use this information to build a consensus sequence. Using new software developed at TIGR for large-scale genome sequencing projects, we have assembled the complete genomes of 12 microbial species to date.

The next step in whole-genome analysis is to identify all the predicted genes and search the translated protein sequences against protein sequences available in public databases. Because of the tremendous conservation in protein sequence among organisms throughout evolution, putative genes can be identified by sequence similarities.

The Minimal Gene Complement of M.genitalium

The Mycoplasma class consists of small wall-less bacteria that parasitize a wide range of hosts, including humans, animals, plants, insects, and cells in culture; they are believed to represent a minimalist life form, having yielded to selective pressure to reduce genome size and eliminate unnecessary genes. M.genitalium was selected as one of the first to be sequenced because it has the smallest genome of any known free-living organism. M.genitalium lives in a parasitic relationship with its primate hosts in ciliated epithelial cells of genitalia and respiratory tracts. Examining the makeup of the M. genitalium genome reveals much about the metabolic and biochemical capacity of this organism.

All genes necessary for life in M. genitalium are packaged in a 580,070-base (bp) circular chromosome. Genome analysis suggests that the M. genitalium genome contains about 470 genes (average size, 1040bp), which make up 88% of the genome (on average, a gene every 1235bp). This value is similar to that found in other microbial genome sequences. These data indicate that Mycoplasma's reduction in genome size has not resulted in increased gene density or decreased gene size.

A complement of genes involved in DNA maintenance, repair, transcription, translation, and cellular transport is present; however, no complete pathways for amino acid, fatty acid, purine, or pyrimidine biosynthesis were identified in M. genitalium. Comparison of the minimal M. genitalium genome to that of more complex organisms suggests that differences in genome content are reflected as profound differences in physiology and metabolic capacity. The reduction in M. genitalium's genome size is associated with a marked reduction in the number and components of biosynthetic pathways, thereby requiring the pathways to use metabolic products from their hosts.

Perhaps one of the most surprising findings from whole-genome sequencing and analysis of M. genitalium is that about one-third of the predicted proteins identified in this organism displayed no sequence similarity to known genes from any other organisms. This means that, even for this simplest of free-living organisms, we still do not understand a considerable amount of its biology. Determining whether the unknown genes in M. genitalium are species specific or exhibit a more widespread phylogenetic distribution will be of interest.

Comparing the M. genitalium genome with those of other microorganisms from diverse habitats will provide insights into what constitutes a minimal set of genes necessary for a self-replicating organism as well as the mechanisms associated with changes in genome organization and content in nature. This information, in turn, will be useful for modifying and engineering organisms to perform specific biochemical tasks in the laboratory or the environment.

Genome Sequence of the Archaeon M. jannaschii

The archaea were discovered as a unique phylogenetic domain of life by Carl Woese in the 1970s using sequence data from the small subunit of ribosomal RNA as a biosystematic marker. M. jannaschii was the first representative of the archaeal domain to be completely sequenced. Isolated in 1982 from a deep-sea hydrothermal vent, M. jannaschii fixes carbon dioxide to methane as its primary energy-producing biochemical pathway. Because this organism thrives at deep-sea pressures and temperatures of 85°C and above, its genome should provide insights into how genomes and gene products survive and function under these extreme conditions. Understanding the genetic basis of methanogenesis biochemistry in the thermophilic, barophilic M. jannaschii will bring us closer to harnessing the unique biochemistry of methanogens as a source of renewable energy.

Analysis of the M. jannaschii genome sequence reveals that between 50 and 60% of its genes or gene products have no match to any other currently known gene sequence. In addition, initial attempts to map database-matched genes onto known biochemical pathways suggest that M. jannaschii's biochemistry and physiology are quite unique among cellular organisms. For example, certain enzymes associated with gluconeogenesis and the synthesis of pentose sugars for nucleotide biosynthesis, such as fructose 1,6-biophosphate aldolase and fructose 1,6-biophosphate phosphatase, are not found among the predicted genes in M. jannaschii. Whether other gene products have been recruited to serve the function of these missing genes or the genes cannot be detected by standard sequence similarity methods is not yet known.

Most genes involved in M. jannaschii's cellular-information processing (replication, transcription, and translation) are more similar to functionally equivalent counterparts in eukaryotes, not bacteria. On the other hand, M. jannaschii genes that are involved in energy production, cell division, and basic cellular metabolism are more like genes in bacteria. Further analysis of the M. jannaschii genome sequence, together with sequence from other members of the archaeal domain of life, will give additional insights into the evolutionary relationship among the prokaryotes.

Complete Sequence of the Thermophilic Archaeon A. fulgidus

Biological sulfate reduction is part of the global sulfur cycle, ubiquitous in the earth's anaerobic environments and essential to the workings of the biosphere. Growth by sulfate reduction is restricted to relatively few groups of prokaryotes; all but one of these is bacteria, the exception being the archaeal sulfate reducers in the archaeoglobales. These organisms are unique in that they are unrelated to other sulfate reducers and they grow at extremely high temperatures, between 60 and 95°C. They can grow both organoheterotrophically (using a variety of carbon and energy sources) or lithoautotrophically on hydrogen, thiosulfate, and carbon dioxide. The known archaeoglobales are strict anaerobes, most of which are hyperthermo-philic marine sulfate reducers found in hydrothermal environments and in subsurface oil fields. High-temperature sulfate reduction by Archaeoglobus species contributes to deep subsurface oil well "souring" by producing iron sulfide, which causes corrosion of iron and steel in oil- and gas-processing systems.

The genome of the type-strain of the archaeoglobales A. fulgidus was sequenced to better understand the biology of this group of organisms. Genome analysis reveals a total of ~2400 genes; these include genes for sulfate reduction, a great diversity of electron transport systems, a large number of transporters with specificity for both organic and inorganic molecules, and b-oxidation of fatty acids. The information-processing systems and the biosynthetic pathways in A. fulgidus have counterparts in the archaeon M. jannaschii. However, the genomes of these two archaea indicate dramatic differences in the way these organisms sense their environment, perform regulatory and transport functions, and gain energy. Another interesting feature revealed by genome analysis is that A. fulgidus displays extensive gene duplication in comparison with other fully sequenced prokaryotes. This suggests that gene duplication has been an important evolutionary mechanism for increasing physiological diversity in the archaeoglobales.

About 25% of the A. fulgidus genome encodes conserved genes with unknown biological function, two-thirds of which are shared with M. jannaschii. Another 25% of the A. fulgidus genome represents genes that are unique to this organism, indicating that there is substantial diversity among members of the archaea. As additional archaeal and bacterial genome sequences are completed, we may begin to define a core set of genes that are shared among prokaryotes and those that are unique to bacterial or archaeal species.

Thermotoga maritima

The thermotogales are a group of nonsporeforming rod-shaped bacteria that represent the most thermophilic of the known organotrophic bacteria. The type strain Thermotoga maritima MSB8, isolated originally from geothermal-heated marine sediment at Vulcano, Italy, has an 80°C optimum temperature for growth. T. maritima metabolizes many simple and complex carbohydrates including glucose, sucrose, starch, xylan, and cellulose. Xylan is a complex plant polymer that represents the most abundant noncellulosic polysaccharide in angiosperms, where it accounts for 20 to 30% of the dry weight of wood tissues. Cellulose is the most abundant biopolymer occurring in nature, estimated to account for 75 X 10⁹ tons of dry plant biomass annually. Both cellulose and xylan, through conversion to fuels (e.g., H₂), have major potential as renewable carbon and energy sources.

T. maritima is of evolutionary significance because small subunit ribosomal RNA (SSU rRNA) phylogeny has placed the bacterium as one of the deepest and most slowly evolving bacteria. To further elucidate its unique metabolic properties and evolutionary relationship to other microbial species, we sequenced the genome of T. maritima MSB8 using the whole-genome random sequencing method. The 1,860,725-bp T. maritima genome contains 1872 predicted coding regions, 54% (1005) of which have functional assignments and 46% (867) of which are of unknown function. Almost 7% of the predicted coding sequences in the T. maritima genome are involved in the metabolism of simple and complex sugars, a percentage more than twice that seen in other bacterial and archaeal species sequenced to date. Biosynthetic pathways for nine amino acids were identified in T. maritima, but the bacterium has an extensive system for the uptake of peptides from the environment.

Phylogenetic analysis of genes in the T. maritima genome has demonstrated that gene evolution may not give a true picture of organismal evolution; gene duplication, gene loss, and horizontal gene transfer probably account for many inconsistencies in single-gene phylogenies. The complete genome of T. maritima has, however, revealed a degree of similarity with the thermophilic archaea in terms of gene content and overall genome organization that was not previously appreciated. Of the sequenced bacteria, T. maritima has the highest percentage(24%) of genes that are most similar to archaeal genes. Some 81 of these genes are clustered in regions of the genome that range in size from 4 to20kb. Five of these regions have a composition substantially different from the rest of the genome, suggesting that lateral gene transfer has occurred between the thermophilic archaea and bacteria. In addition, repeat structures in T. maritima have been identified only in thermophiles, and 108genes on the T. maritima genome have orthologues only in the genomes of other thermophilic bacteria and archaea. One explanation for the relatedness between thermophilic organisms seems to be the occurrence of lateral gene transfer.

Deinococcus radiodurans

Deinococcus radiodurans, originally discovered in food samples exposed to severe gamma irradiation, is the most radioresistant organism ever isolated. An important component of this resistance is the ability to repair damage to its own chromosomal DNA. D. radiodurans cultures exposed to 1.5Mrad of radiation display a reduction in size of genomic DNA fragments corresponding to about 100 double-stranded breaks per genome. Typically, most prokaryotic and eukaryotic organisms cannot tolerate more than five double-stranded breaks per genome without reduced survival.

Within 8 to 10 hours after radiation exposure, the D. radiodurans genome is fully restored with no evidence of double-stranded breaks. During this repair time, cellular replication of D. radiodurans is arrested; after this 8- to 10-hour interval, the cells display 100% survival with no detectable mutagenesis of their completely restored genome. DOE's interest in D. radiodurans includes understanding its ability to withstand radiation, particularly as it relates to the possibility of this organism's potential for bioremediation of toxic waste sites that contain radioactive isotopes.

The genome sequence of D. radiodurans is complete, and we have determined that the genome is composed of three chromosomes and a small plasmid. Inspection of the set of genes with similarity to DNA-repair enzymes has so far been inconclusive regarding radiation resistance; D. radiodurans does not appear to contain repair genes that would make it unique among other bacteria. However, a number of unique sequence elements have been identified that are being tested for their role in radiation resistance. These experiments, coupled with the high-throughput analysis of gene expression using microarray technology, should lead to a more complete understanding of this bacterium's gamma radiation resistance in the near future.

Shewanella putrefaciens: A Model Organism for Bioremediation

Shewanella putrefaciens is a bacterium involved in microbiologically influenced corrosion, anaerobic consumption of toxic organic pollutants, removal of toxic metals by sulfide precipitation, and removal of toxic metals and radionuclides by conversion to insoluble reduced forms. Whole-genome sequencing of S. putrefaciens will furnish the bioremediation community with detailed knowledge of metabolic pathways involved in all these processes, providing an excellent model system for manipulating organisms for remediation or control.

In addition, a complete genome sequence for S. putrefaciens will furnish important information on engineering specific regulatory mutants for bioremediation. For example, mutants that continue to metabolize anaerobically, even in the presence of oxygen, could be used to remove uranium (U⁶⁺) in dilute environments where oxygen is still present. S. putrefaciens grows both aerobically and anaerobically. In its anaerobic phase, it acts as a metal reducer. The potential of metal-reducing bacteria in pollutant removal is very high for both the short and long terms, especially for those iron reducers that are not inhibited by oxygen.

Two separate reports suggest that Shewanella spp. can donate electrons to chlorinated hydrocarbons, thus reductively dechlorinating toxic compounds by converting tetrachloromethane to trichloromethane. In addition, organisms such as S. putrefaciens, which can produce Fe²⁺, have potential to catalyze the reduction of toxic nitrates. Metals can be removed from solution via direct reduction by metal-reducing bacteria such as S. putrefaciens.

While iron and manganese are solubilized, other metals are converted to insoluble forms upon reduction. Of note are chromium (Cr⁶⁺) and uranium (U⁶⁺), both of which are soluble in the oxidized form but insoluble as the respective species reduced by Cr³⁺ and U³⁺. Reduction of U⁶⁺ has been demonstrated for S. putrefaciens and has been proposed as a mechanism for concentrating and thus removing radionuclide waste. As with uranium, the removal of toxic chromium should be possible using either intact cells or cell-free systems of the metal-reducing bacteria.

Complete genome sequences for all these metabolic processes would accelerate bioremediation efforts in metal and radionuclide reduction, chlorinated hydrocarbon pollutants, and toxic nitrates. We are midway through the closure process in the complete genome sequencing of S. putrefaciens. Random sequencing was completed in July 1998, and closure began in August 1998. Analysis of the assemblies suggests that the completed genome size will be about 5Mb.

Preliminary observation of the gene content of this organism has shown similarities between S. putrefaciens and Vibriocholerae in some role categories (small molecule biosynthesis, central intermediary metabolism) but differences in others (sugar metabolism). It will be interesting to examine these similarities and differences in light of the different ecological niches occupied by these organisms.

Chlorobium tepidum

The taxonomic group of green sulfur bacteria (Chlorobiaceae) are formally classified as Gram-negative organisms. Members of this genus are photoautotrophs that can generate chemical energy through an electron transport chain in the cytoplasmic membrane that is associated with a light-harvesting complex housed in a specialized organelle called the chlorosome. The components of this light-harvesting apparatus and some of its organizational structure are reminiscent of photosystems found in plant chloroplasts and, therefore, the evolutionary relationship of these prokaryotes to eukaryotic organelles is of interest. Chlorobium species also can fix CO₂ , although the biochemical pathway used by these prokaryotes is distinct from the Calvin cycle found in higher plants.

C. tepidum initially was identified from a hot spring in New Zealand. This species is thermophilic with an optimum growth temperature of about 47°C. It has a genome size of 2.1Mb with a G+C content of 56.5mol%. C. tepidum was nominated for sequencing by DOE because of its photosynthetic capacity and its interesting phylogenetic position in the bacterial kingdom.

C. tepidum sequencing and closure has been completed. Genome annotation is under way and soon will be completed.

Caulobacter crescentus

Caulobacter crescentus is placed in the alpha-purple bacteria that also include Rickettsia, Rhizobium, Agrobacterium, and Brucella species. It is the most prevalent nonpathogenic bacterium in nutrient-poor freshwater streams. It is also found in marine environments. To facilitate location of nutrient sources, C. crescentus is motile and chemotactically competent during the swarmer phase of its life cycle. In its nonswarmer phase Caulobacter adheres to solid substrates such as rocks. It is a component of the organisms responsible for sewage treatment. Caulobacters are being modified for use as bioremediation agents for removing heavy metals from wastewater streams.

Caulobacter crescentus exhibits a well-studied developmental pattern, independent of environmental stress, with morphologically defined stages of the cell cycle. It has easily observable physical structures that define these specific cell cycle stages. Two major events in C. crescentus cell cycle are used by researchers to elucidate fundamental processes required for development. These are the tight regulation of chromosomal replication and the temporally and spatially regulated biogenesis of the flagellum. The two processes are linked by a common transcriptional regulator that orchestrates the response of multiple cellular processes to the progression of the cell cycle.

The genome was electronically annotated at the end of the random sequencing phase; the data, along with the assembly files, was sent to Dr. Lucy Shapiro (Stanford University), Dr. Bert Ely (University of South Carolina), and Dr. Janine Maddock (University of Michigan), who are collaborating with us on final assembly and annotation of the genome. The project is now in the closure phase.

Pseudomonas putida

Sequencing of Pseudomonas putida KT2440 began in January 1999 as a joint effort between TIGR and a German consortium consisting of groups from MHH (Medizinische Hochschule Hannover, Hannover, Germany); GBF (Gesellschaft für Biotechnologische Forschung mbH, Braunschweig, Germany); DKFZ (Deutsches Krebsforschungs-zentrum, Heidelberg, Germany); and QIAGEN (QIAGEN GmbH, Hilden, Germany). The study is supported by grants from BMBF of Germany and the U.S. Department of Energy.

The genome sequence will be used for in-depth functional analyses including comparisons of genome structure and function with the related organism P. aeruginosa. Understanding structure and function of the P. putida genome will allow for its increased use in biotechnological areas, including the production of natural compounds, remediation of polluted habitats, and the use of strains to fight plant diseases.

The P. putida genome sequence is expected to be closed in the next few months. The number of libraries for scaffolding the genome, access to the genome sequence of P. aeruginosa, and the complementary functional studies being conducted by the German consortium should reduce chances of major assembly problems in the genome.

Geobacter sulfurreducens

The complete genome sequence of Geobacter sulfurreducens is being determined to better understand its genetic potential. G. sulfurreducens is an important member of a family (Geobacteraceae) of delta proteobacteria capable of oxidizing organic compounds including aromatic hydrocarbons to carbon dioxide with Fe(III) or other metals and metalloids including U(VI), Tc(VII), Co(III), Cr(IV), Au(III), Hg(II), As(V) and Se(VII) serving as the terminal electron acceptor. It is the dominant group of iron-reducing microorganisms recovered from a wide variety of aquifer and subsurface environments when both molecular and traditional culturing techniques are used. Geobacter plays a critical role in the biogeochemical cycling of carbon, iron, and other metals. Its genetics and physiology are a subject of intense study in part due to the importance that these processes can play in the remediation of contaminated anaerobic subsurface environments. The determination of the G. sulfurreducens genome is being accomplished using a random shotgun cloning approach to provide at least sixfold coverage of a 1-Mb genome followed by closure of remaining physical or sequence gaps. Searches of sequences and contigs from the early random phase of sequencing using the BLAST algorithm and database have produced high scores with low expect values indicating significant homologies to proteins contained in the database. These include enzymes considered important to basic housekeeping functions such as tRNA synthases and amino acid synthesis as well as those essential to other metabolic processes known to occur in G. sulfurreducens including nitrogen fixation. A number of sequences have produced no significant alignments indicating the likelihood of genes encoding for novel functions. Of further significance has been the extension of N-terminal sequences previously obtained from cytochromes known to be important in dissimilatory iron reduction. Thus, the genome will provide information crucial to the further understanding of this important metabolic process.

The Comprehensive Microbial Resource

One of the challenges presented by large-scale genome sequencing efforts is the effective display of information in a format that is accessible to the laboratory scientist. Conventional databases offer the scientist the means to search for a particular gene, sequence, or organism but do little to display the vast amounts of curated information that are becoming available. TIGR has developed methods to effectively "slice" the vast amounts of data in the sequencing databases in a wide variety of ways, allowing the user to formulate queries that search for specific genes as well as to investigate broader topics such as genes that might serve as vaccine and drug targets.

The Comprehensive Microbial Resource (CMR) is a facility for annotation of TIGR genome sequencing projects, a Web presentation of all fully sequenced microbial genomes, curation from the original sequencing centers, and further curation from TIGR (for those genomes sequenced outside TIGR). The Web presentation of CMR includes the comprehensive collection of bacterial genome sequences, curated information, and related informatics methodologies. The scientist can view genes within a genome and also can link to related genes in other genomes. This allows construction of queries that include sequence searches, isoelectric point, GC-content, GC-skew, functional role assignments, growth conditions, environment, and other questions and the isolation of genes of interest. The database contains extensive curated data as well as prerun homology searches to facilitate data mining. The interface allows the display of the results in numerous formats that will help the user ask more accurate questions. This resource should be of value to the scientific community to design experiments and spur further research. Resources of this type are an essential tool to make sense of bacterial genome information as the number of completed genomes continues to grow.

Rhodopseudomonas palustris Genome Project

Caroline S. Harwood

Department of Microbiology; University of Iowa; 3-432 Bowen Science Bldg.; Iowa City, IA 52242
319/335-7783, Fax: -7679, caroline-harwood@uiowa.edu

Rhodopseudomonas palustris is a common soil and water bacterium that makes its living by converting sunlight to cellular energy and by absorbing atmospheric carbon dioxide and converting it to biomass. This microbe can also degrade and recycle components of the woody tissues of plants (wood is the most abundant polymer on earth). Because of its intimate involvement in carbon management and recycling, R. palustris has been selected by the DOE Carbon Management Program to have its genome sequenced by the Human Genome Program's Joint Genome Institute (JGI).

R. palustris is acknowledged by microbiologists to be one of the most metabolically versatile bacteria ever described. Not only can it convert carbon dioxide gas into cell material but nitrogen gas into ammonia, and it can produce hydrogen gas. It grows both in the absence and presence of oxygen. In the absence of oxygen, it prefers to generate all its energy from light by photosynthesis. It grows and increases its biomass by absorbing carbon dioxide, but it also can increase biomass by degrading organic compoundsincluding such toxic compounds as 3chlorobenzoateto cellular building blocks. When oxygen is present, R. palustris generates energy by degrading a variety of carboncontaining compounds (including sugars, lignin monomers, and methanol) and by carrying out respiration.

R. palustris undergoes two major developmental processes. The first is cell division by budding. This process of asymmetric cell division results in two different kinds of daughter cellsone a motile swarmer cell and the other a stalked nonmotile cell. The second is the differentiation of an elaborate system of intracytoplasmic membrane vesicles when cells run out of oxygen and are placed in light. The membranes are used to house photosynthetic pigments and associated proteins. Budding division and differentiation to photosynthetically competent cells both require a temporally regulated program of gene expression followed by a pattern of precise localization of protein products.

The diverse metabolism and the developmental cycles of R. palustris are a large part of what makes this bacterium such a seductive target for genome sequencing. With the entire genome sequence in hand, determining how R. palustris can coordinate and appropriately express its many metabolic capabilities in response to changing environmental conditions will be possible, as will devising strategies to maximize this bacterium's carbon-recycling capabilities.

R. palustris has a genetic system; genes can be moved in and out of this bacterium easily, and specific genes thus can be targeted for mutagenesis. This is of great value because it will allow researchers to rapidly apply information gained from genome sequencing to the developing area of functional genomics.

This work will supply the JGI with sufficient R. palustris genomic DNA for genome sequencing as well as any information needed about the biology of R. palustris.

Sequencing Microbial Genomes of Environmental Relevance

Jane E. Lamerdin

Joint Genome Institute; Lawrence Livermore National Laboratory; 7000 East Ave.; Livermore, CA 94550
925/423-3629, Fax: /422-2282, lamerdinl@llnl.gov
http://spider.jgi-psf.org/JGI_microbial/html/

The DOE Joint Genome Institute (JGI) has established a new program to obtain the complete genome sequence of microorganisms that may significantly impact global climate. This program supports the new DOE Global Carbon Management and Sequestration initiative, which funds basic research aimed at understanding factors that contribute to global warming and effective ways to manage carbon (particularly carbon dioxide) in soil and ocean ecosystems. The goal of JGI's effort is to explore the role of diverse microorganisms in carbon cycling by elucidating their genetic content to identify metabolic pathways that allow these organisms to adapt to their respective niches. These specialized processes include nutrient-uptake systems, pathways that contribute to nitrogen fixation and carbon cycling in soils, and pathways that regulate photosynthesis. JGI's work is focused initially on five microorganisms: Nitrosomonas europaea, Rhodopseudomonas palustris, Nostoc punctiforme, and two marine cyanobacteria, Prochlorococcus marinus and Synechococcus. The common trait shared by these microbes is that all are autotrophic (i.e., they fix C0₂ as their sole carbon source), are fairly numerous within their respective ecosystems, and contribute materially to carbon cycling or biomass production (with the exception of N. europaea).

N. europaea is a soil-dwelling chemolithoautotroph that oxidizes ammonia to nitrite, a process that often depletes nitrogen available to plants, thereby limiting C0₂ fixation. Significantly, when oxygen concentrations in soils are low, N. europaea oxidizes nitrite to N₂0, a catalyst of ozone breakdown and greenhouse gas production. We expect that the genome sequence of N. europaea, one of the few obligately autotrophic bacteria currently being sequenced, will allow us to catalog the identity and number of genes required for autotrophy. The genome sequence also should uncover special redox enzymes that allow N. europaea to adapt to the narrow niche it occupies.

R. palustris is a purple nonsulfur phototrophic bacterium commonly found in soils and fresh water. This species is of particular interest to the Carbon Management program because it is able to degrade and recycle components of woody tissues of plants (wood is the most abundant polymer on earth). It also possesses a large repertoire of metabolic capabilities, including the ability to fix C0₂ into cellular material, fix nitrogen gas into ammonia, and produce and use hydrogen gas. In the absence of oxygen, it grows phototrophically; in the presence of oxygen, it can generate energy by degrading sugars, organic acids, and methanol and can carry out respiration.

Nostoc punctiforme is a cyano-bacterium that enters into symbiotic associations with fungi and lichens; these relationships are relevant to carbon cycling and sequestration in tundra. Nostoc species also have complex life cycles, fix nitrogen, and are capable of chromatic adaptation. Prochlorococcus and Synechococcus are unicellular picoplankton, which are major biomass producers in the world's temperate and tropical oceans. Synechococcus species are abundant in surface waters, while Prochlorococcus is found to exist in the layer 100 to 200 m deep. Prochlorococcus possesses an unorthodox pigment composition of divinyl derivatives of chlorophyll a and b, alpha carotene, zeaxanthin, and a type of phycoerythrin. The last has not yet been shown to function in light harvesting. By contrast, the highly related Synechococcus contains chlorophyll a and phycobilins that are more typical of cyanobacteria. Prochlorococcus, the only photosynthetic organism known to contain this particular combination of pigments, could be a model for the ancestral photosynthetic bacterium that gave rise to cyanobacteria and chloroplasts. Sequence analysis of the Prochlorococcus genome may shed more light on this hypothesis, and a comparison of the two genomes should provide additional insights into cyanobacterial radiation in general.

In part due to the lack of physical maps and mapping resources for these particular organisms, we have employed a whole-genome shotgun strategy to determine the complete sequence of each microbe. To aid our assembly, we are supplementing our six- to eightfold genome coverage in plasmid paired ends with a large-insert scaffold of paired ends in the low-copy-number fosmid vector. As the genome size increases (e.g., in Nostoc), we will shift to BAC clones for this scaffold. These scaffold clones are being fingerprinted to aid in verification of the final sequence assembly. We also will obtain optical maps of several of the larger organisms, Nostoc in particular, through a collaboration with David Schwartz at the University of Wisconsin.

JGI has completed the initial data-generation phase for N. europaea and P. marinus, which produced >95% of the genomic sequence for each microbe. (Progress towards completion can be monitored through our Web site) A similar level of coverage is anticipated for R. palustris by mid-March. Finishing is under way on the first two organisms, and we expect closure of both by spring of 2000. With the level of coverage achieved by the initial data-generation phase, we can readily generate a rough inventory of the types of genes present in each organism. Preliminary or draft analyses have been performed on N. europaea and P. marinus by Frank Larimer and his team at Oak Ridge National Laboratory. The resulting catalog format provides user scientists with access to the contents of unfinished sequence data in a consumable format, without the need for protracted data manipulations on their part (see example). This allows them to focus on identifying gene products of particular interest to their research programs. The raw sequence data also are directly queryable through an accompanying BLAST server or can be downloaded from JGI's ftp server.

In summary, JGI's new microbial sequencing program is well under way, with at least three organisms on target to be completed before the end of FY00. A scientific advisory board has assigned additional organisms for FY00 that continue the theme of relevance to the Global Carbon Management and Sequestration effort. We anticipate generating about 20 to 25 Mb of microbial genomic sequence in FY00 (initially in ~eightfold genome coverage) and ramping to a rate of 60 Mb in FY01.

See also the related abstracts of Ronald Atlas, Daniel Arp, David Schwartz, Caroline Harwood, Frank Larimer, and Sallie Chisholm.

The Genome of Geobacter sulfurreducens

B. A. Methe, Linda Banerjei,¹ William C. Nierman,¹ O. Snoeyenbos-West, S. Sciufo, and Derek R. Lovley
Department of Microbiology; University of Massachusetts; Amherst, MA 01003
413/545-9651, Fax: -1578, dlovley@microbio.umass.edu

¹The Institute for Genomic Research; Rockville, MD 20850

The complete genome sequence of Geobacter sulfurreducens currently is being determined to better understand its genetic potential. G. sulfurreducens is an important member of a family (Geobacteraceae) of delta proteobacteria. This family is capable of oxidizing organic compounds including aromatic hydrocarbons to carbon dioxide with Fe(III) or other metals and metalloids including U(VI), Tc(VII), Co(III), Cr(IV), Au(III), Hg(II), As(V) and Se(VII) serving as the terminal electron acceptor. It is the dominant group of iron-reducing microorganisms recovered from a wide variety of aquifer and subsurface environments when both molecular and traditional culturing techniques are used. Geobacter plays a critical role in the biogeochemical cycling of carbon, iron, and other metals. Its genetics and physiology are a subject of intense study in part due to the importance that these processes can play in the remediation of contaminated anaerobic subsurface environments. The determination of the G. sulfurreducens genome is being accomplished using a random shotgun cloning approach to provide at least sixfold coverage of a 1-Mb genome followed by closure of remaining physical or sequence gaps. Assembler software and other computer programs developed by The Institute for Genomic Research are used to assemble the genome and aid in gap closing, finishing, and annotation. Searches of sequences and contigs from the early random phase of sequencing using the BLAST algorithm and database have produced high scores with low expect values indicating significant homologies to proteins contained in the database. These include enzymes considered important to basic housekeeping functions such as tRNA syntheses and amino acid synthesis as well as those essential to other metabolic processes known to occur in G. sulfurreducens, including nitrogen fixation. A number of sequences have produced no significant alignments, indicating the likelihood of genes encoding for novel functions. Of further significance has been the extension of N-terminal sequences previously obtained from cytochromes known to be important in dissimilatory iron reduction. Thus, the genome will provide information crucial to the further understanding of this important metabolic process.

**Optical Approaches for Physical Mapping and Sequence Assembly of the Deinococcus radiodurans Chromosome**

David C. Schwartz

Biotechnology Center; University of Wisconsin-Madison; 425 Henry Mall; Madison, WI 53706
608/2650546, Fax: /2626748, dcschwartz@facstaff.wisc.edu
www.chem.wisc.edu/~schwartz

Maps of genomic or cloned DNA frequently are constructed by analyzing the cleavage patterns produced by restriction enzymes. Restriction enzymes are remarkable reagents that consistently cleave only at specific four- to eight-nucleotide sequences, varying according to the specific enzymes.

Restriction enzymes are reliable, numerous, and easily obtainable, and there now are around 250 different sequences represented among thousands of enzymes. Restriction maps characterize gene structure and even entire genomes. Furthermore, such maps provide a useful scaffold for the alignment and verification of sequence data. Restriction maps generated by computer and predicted from the sequence are aligned with the actual restriction map.

Restriction enzyme action traditionally has been assayed by gel electrophoresis. This technique separates cleaved molecules on the basis of their mobilities under the influence of an applied electrical field within a gelseparation matrix (small fragments have a greater mobility than large ones). Although gel electrophoresis distinguishes differentsized DNA fragments (known as "fingerprinting"), the original order of these fragments remains unknown. The subsequent task of determining the order of such fragments is labor intensive, especially when making restriction maps of whole genomes, and, therefore, the procedure is not widely employed despite its obvious usefulness to genome analysis.

Our laboratory developed Optical Mapping, a system for the construction of ordered restriction maps from individual DNA molecules. The mapping substrate consisted of very large, randomly sheared genomic DNA fragments that were bound to derivatized glass surfaces and cleaved with the restriction enzyme Nhe I. The resulting fragments were imaged by fluorescence microscopy. Cut sites were visualized as gaps between cleaved DNA fragments that retained their original order. A whole-genome restriction map of Deinococcus radiodurans, a radiationresistant bacterium able to survive up to 15,000 grays of ionizing radiation, was constructed without using DNA libraries, the polymerase chain reaction, or electrophoresis. Very large, randomly sheared, genomic DNA fragments were used to construct maps from individual DNA molecules that were assembled into two circular overlapping maps (2.6 and 0.415 Mb), without gaps. A third smaller chromosome (176 kb) was identified and characterized. Aberrant nonlinear DNA structures that may define chromosome structure and organization, as well as intermediates in DNA repair, were visualized directly by optical mapping techniques after irradiation.

This highresolution restriction map was used by collaborators at The Institute for Genomic Research to verify sequenceassembly data from D. radiodurans by aligning the restriction map predicted from their sequence. Optical mapping of D. radiodurans also rendered insights into the organism's biology by providing a picture of the entire genome's basic organization. The genome was shown to be composed of two rather than one chromosome, and the presence of other extrachromosomal elements was demonstrated.

Whole-genome characterization by optical mapping may facilitate further understanding of the radiationresistant nature of D. radiodurans, which is being used as a vehicle for bioremediation of toxic organic pollutants within radioactive waste dumps.

Whole-Genome Sequence of Pyrobaculum aerophilum

Melvin I. Simon and Sorel Fitz-Gibbon

Biology Division; California Institute of Technology; 1200 E. California Blvd.; Pasadena, CA 91125
626/395-3944, Fax: /796-7066, simonm@starbase1.caltech.edu
www.tree.caltech.edu

Pyrobaculum aerophilum was chosen as a model organism for the study of hyperthermophiles and archaea. This rod-shaped microbe, isolated from a boiling marine vent, has a maximum growth temperature of 104°C, not far from the 113°C maximum known for all life. Unlike most hyperthermophiles, however, P. aerophilum is able to withstand exposure to oxygen and thus is amenable to experimental manipulations on the laboratory benchtop. In addition to being an ideal model-organism candidate, P. aerophilum warrants further studies because of its phylogenetic position as a member of the crenarchaea-eocytes, which may be the eukaryotes' closest prokaryotic relatives.

The entire P. aerophilum genome has been sequenced using a random shotgun approach (3.5X genomic coverage) followed by oligonucleotide primer-directed sequencing guided by our fosmid map. The genome was assembled and edited using the Phred-Phrap-Consed system. The 2.2-Mb genome codes for about 2500 proteins, 30% of which have been identified by sequence similarities to proteins of known function. We have made extensive use of the MAGPIE software for genome annotation and GeneMark and Glimmer for prediction of coding regions. In completing the "polishing" of the genome, we are nearing our goal of no more than 1error in 10,000 bases. We also are continuing to annotate the genome and attempting to improve our functional predictions by using information on conserved residues, potential 3-D structure alignments, and gene phylogenies.

In our publications early in 1999, we discussed in detail the results of the annotation process. One interesting set of results pertains to genes involved in DNA repair. Two major mechanisms for avoiding mutations during DNA replication are the DNA polymerase's immediate editing of the growing strand and the mismatch-repair system's detection and correction of mismatches soon after replication. Homologs of the Escherichia coli proteins involved in mismatch repair have been found in humans, and damage to them has been implicated in hereditary nonpolyposis colon cancer. However, homologs to mismatch-repair proteins have not been detected in the P. aerophilum genome nor in any of the other three completed archaeal genomes. It remains to be seen whether mismatch-repair activities can be detected in these organisms, and, if so, whether different enzymes have been recruited for these functions or the archaeal homologs have diverged too much to be recognized by simple sequence comparisons.

Having the entire genome sequence is an extraordinary tool for research on this organism, and numerous downstream projects already are in progress. The genome sequence has been invaluable in guiding work to develop a laboratory research system that would allow such E. coli-like experiments as gene knockouts and homologous overexpression of archaeal proteins. The P. aerophilum genome-proteome also is being used by several laboratories worldwide to develop methods for high-throughput 3-D structure determination. Proteins from thermophiles appear to be more stable than their mesophilic homologs and may have higher rates of successful crystallization, thus simplifying the development of high-throughput "structural proteomics."

Completion of microbial genome sequences provides not only a wealth of information on individual species but also allows implementation of new methods for deciphering genomes. For example, it is now possible to predict functionally linked proteins simply by looking for the presence or absence of similar distribution patterns among completed genomes. With perhaps half the proteins in microbial genomes having no clear functional assignments, a good deal of exciting work remains to be done.

This is a completed project.

Whole-Genome Shotgun Sequencing

Douglas Smith

Genome and Technology Development; Genome Therapeutics Corp.; 100 Beaver St.; Waltham, MA 02154-8440
781/398-2378 or /893-5007 (ext. 219), Fax: /893-9535 or /642-0310, doug.smith@genomecorp.com
www.genomecorp.com

The information in the chromosome of a bacterium (or any other organism) is encoded in the specific sequence of four chemical building blocks called nucleotides. Millions of these nucleotides are polymerized into long strands that stick together in pairs to form the DNA double helix. Genes are encoded in the DNA by specific sequences of nucleotides, much as the words in this paragraph are encoded by sequences of letters. Bacterial chromosomes typically contain 1 to 7million nucleotide pairs (abbreviatedMb).

Current biochemical methods for determining DNA sequences generate "reads" of about 500 to 700nucleotides. To sequence an entire bacterial genome, therefore, a method is needed for accurately piecing together lots of individual reads. To accomplish this, we use a "whole-genome shotgun" approach in which thousands of sequence reads (enough to span a whole genome 7 to 8times) are generated from random locations in the genome. Using powerful computer programs, investigators then assemble these sequences into overlapping sets that, together with additional information, can be joined to reassemble the entire chromosome.

Methanobacterium thermoautotrophicum

This organism is a member of the archaea, one of the three major kingdoms into which all living things can be classified (the other two are bacteria, which include most of the familiar disease-causing organisms; and eucarya, which include protozoa, fungi, plants, animals, and humans). Archaea are interesting because many of their cellular processes are similar to those of eucarya, while others are more closely related to bacteria.

M. thermoautotropicum, originally isolated from sewage sludge, also is found in the manure of farm animals. In combination with other organisms, M. thermoautotrophicum can be used to produce methane from such materials. The organism prefers growth temperatures of about 65°C and is capable of growing and producing methane in the presence of only hydrogen, carbon dioxide, and a few salts. The complete genome sequence provides informationthat could be used to reengineer the organism to grow more rapidly and to produce larger amounts of methane with fewer by-products. The thermostable proteins may be useful in the chemical industry as reagents for bioconversion or biocatalysis.

Using the whole-genome shotgun approach, we completed the sequence of the entire 1.75-Mb genome of M. thermoautotrophicum during 1997. In the shotgun phase, we generated over 36,000 sequence reads (about 13Mb, or 7.5-fold genome coverage). The reads were assembled, and the resulting sets of overlapping fragments were joined together by using a "primer-walking" technique to generate new sequences extending from the ends of the contigs. Additional biochemical tools and computer programs were used to identify and fix misassembled regions and to confirm the links between the assembled sequences, allowing us to reconstruct the entire circular chromosome.

The resulting sequence was analyzed to identify the encoded genes. Many M. thermoautotrophicum genes encode proteins that are more closely related to eucaryal proteins (from higher plants and animals) than to bacterial ones. This is especially true of components involved in transcription and translation, processes by which gene sequences are "expressed" to produce protein products in the cell. Comparisons to the genome of Methanococcus jannaschii (another archaeon) revealed many similarities but also many differences. Both organisms contain a significant number of unique genes that are unrelated to any other known genes. This finding underscores the high degree of complexity and genetic diversity present in the biological universe.

Clostridium acetobutylicum

Continued Microbial Genome Program work in our laboratory focused on the gram-positive, spore-forming bacterium C. acetobutylicum ATCC 824. Its 4.1-Mb genome, reflecting its more complex life processes and metabolism, is more than twice the size of Methanobacterium. The organism is related to the pathogenic species C. botulinum, C. tetani, and C. perfringens, which cause the diseases botulism, tetanus, and gangrene, respectively.

Isolates of C. acetobutylicum were identified before the First World War when rubber shortages stimulated a search for microbes that could produce butanol for synthetic rubber production. Chaim Weizmann (who later became the first president of Israel) developed a process for ABE fermentation (to produce acetone, butanol, and ethanol) using C. acetobutylicum and plant starch that was later pursued commercially. Demand for acetone during the Second World War led to the establishment of a molasses-based ABE process, but increases in the cost of molasses, together with advances in the petrochemical industry, led to its eventual abandonment.

Since that time, scientific interest in the solvent-producing Clostridia has continued. A great deal of work has been done to elucidate the metabolic pathways by which solvents are produced. Many solvent-overproducing derivatives (strains) have been identified, and it is now possible to pursue a rational approach to develop modified strains with industrially useful properties. Experimental research systems have been developed that allow genes to be manipulated in these organisms, and strains have been altered to grow on cellulose constituents that will not support the growth of natural strains. The complete genome sequence will be immensely useful in further development of these organisms as natural bioconversion factories for the chemical and fuel industries.

C. acetobutylicum ATCC 824 was sequenced by the whole-genome shotgun approach, essentially as described above but including several technological advances. The finishing phase involved exhaustive gap closure and quality enhancement using a variety of biochemical methods and computational tools. Only a few gaps remain, and a publication describing the work is expected during 1999.

The genome sequences of M. thermoautotrophicum and C.acetobutylicum are freely available in public databases, enabling research scientists throughout the world to access the information to expedite the development of useful derivatives of these and other organisms.

This is a completed project.

The Complete Genome of the Hyperthermophilic Bacterium Aquifex aeolicus

Ronald Swanson

Diversa Corporation; 10665 Sorrento Valley Road; San Diego, CA 92121
619/623-5156, Fax: -5120, rswanson@diversa.com
www. diversa.com

Diversa Corporation has completed the genome sequence of the most heat tolerant bacterium currently known. This organism, Aquifex aeolicus, is capable of growing at up to 95°C (203°F). Isolated and described only recently, Aquifex is related to filamentous bacteria first observed at the turn of the century, growing at 89°C in the outflow of hot springs in Yellowstone National Park. Observation of these macroscopic assemblages would later be instrumental in the drive to culture hyperthermophilic organisms.

Aquifex is able to grow on hydrogen, oxygen, carbon dioxide, and simple mineral salts. The complex metabolic machinery necessary to function as a hyperthermophilic chemolithoautotroph is encoded within a 1,551,335-bp genome only one-third the size of Escherichia coli; this small size appears to limit metabolic flexibility. The use of oxygen as an electron acceptor is enabled by the presence of a complex respiratory apparatus. Despite the fact that this organism grows at bacteria's extreme thermal limit, only a few specific indications of thermophily are apparent from the genome.

One of the most exciting results of sequence analysis is the lack of coherence in the apparent phylogenies of different genes. It was widely anticipated that, because of the small subunit ribosomal RNA gene's branching position near the bacterial lineage's root, Aquifex gene analysis would shed light on the phenotype of bacteria's last common ancestor, including the bacterial domain's hypothesized thermophilic origin. However, protein-based phylogenies do not in many cases support the original rRNAbased placement and show no consistent picture of the organism's phylogeny. This result has fundamental implications for our understanding of the evolutionary mode.

The sequencing strategy used to assemble the complete genome was based on the whole-genome shotgun approach. Shotgun sequencing is characterized by two phases: an initial, completely random phase in which most data are collected, and a closure phase in which directed techniques are used to close gaps and complete the assembly. By pursuing a strategy in which only 97% coverage was achieved initially, we were able to limit the number of random-phase sequences to only 10,500. Sequence fragments were assembled on an Apple Macintosh computer using Sequencher, a commercially available assembly and editing program. Sequences were obtained from both ends of clones randomly chosen from a fosmid library; using Sequencher, we assembled these sequences with consensus sequences derived from the contigs of random-phase sequences. Gaps between contigs were closed by direct sequencing on fosmids not wholly contained within a contig. The final assembly comprises 13,785sequences with an average edited read length of 557bp.

More than half of Aquifex's 1512 open reading frames were assigned a putative function based on similarity to known sequences. The extreme thermostability of Aquifex proteins, coupled with their bacterial origins, makes them ideal candidates for over expression, nuclear magnetic resonance imaging, and Xray crystallographic studies. Consequently, large numbers of researchers are pursuing structures of the thermostable Aquifex proteins, and several heterologously expressed proteins are being evaluated in commercial applications.

This is a completed project.

The Genome Sequence of the Hyperthermophilic Archaeon Pyrococcus furiosus

Robert B. Weiss, Frank Robb,¹ and James R. Brown²
Human Genetics Dept.; Eccles Institute of Human Genetics; 20 South 2030 East, Room308 BPRB; University of Utah; Salt Lake City, UT 84112-5330
801/585-3435 or -5606, Fax: -7177, bob.weiss@genetics.utah.edu or bob@watneys.med.utah.edu

¹Center of Marine Biotechnology; University of Maryland
²Microbial Bioinformatics Group; SmithKline Beecham Pharmaceuticals

http://www-genetics.med.utah.edu/

Pyrococcus furiosus is the best-studied member of the unusual class of organisms known as extreme hyperthermophiles because they live at extremes of temperature and pressure. Isolated from geothermally heated marine sediment in the shallow waters off Vulcano Island, Italy, P. furiosus grows optimally at 100°C and derives its energy by fermentation of protein, peptide, and sugar mixtures found in its geothermal environment. The organism is fast growing and capable of dividing every 40 min.

Extreme hyperthermophiles play an important role in advancing the fundamental understanding of protein biochemistry, RNA and DNA metabolism, and protein interactions. How has a cell's macromolecular machinery adapted to function at 100°C? Proteins from organisms living at moderate temperatures unfold or denature when heated, but proteins from hyperthermophiles maintain their three-dimensional shapes. The genome sequence provides a resource for beginning to understand why this happens.

Extremely stable proteins have potential biotechnological uses as rugged industrial catalysts. The diverse metabolism of P. furiosus provides a wide variety of biocatalysts that are potentially useful as environmentally safe reagents in transforming biomass to derive energy and specialty chemicals and in degrading organic compounds for environmental detoxification.

The P. furiosus genome sequence was completed recently. Its circular chromosome is 1,908,253bp long with a G-C content of 40.8%. The sequencing strategy tested a variant of whole-genome shotgun sequencing with a new sequencing vector that allows the genome to be subcloned as larger pieces. The genome was pieced together from fewer than 2500 subclones, compared to the more typical number of 20,000. These mediuminsert sequencing vectors may help to assemble the larger human and mouse genomes.

Genome analysis and annotation are ongoing. Recently, the complete sequence of the distantly related P. horikoshii, which was isolated from a hydrothermal vent at a depth of 1395m in the Sea of Japan, was determined by a group in Japan. P. furiosus and P. horikoshii diverged over 100 million years ago, and comparisons between them are providing unique insight into processes that result in changes to genes and genomes by revealing complex gene rearrangements and changes in gene content.

The sequence was completed in late November 1998, and the annotation phase was completed early in 1999. The sequence is available for searching and downloading from the Web. Library construction, sequencing, and assembly and the production of finished sequence was done at the University of Utah. Dr.Frank Robb's group provided the organism and has assisted in the finishing and annotation stages. Dr.Brown's group is assisting in the gene-finding and annotation stages of the project.

This is a completed project.

The online presentation of this 2000 publication is a special feature of the Human Genome Project Information Web site.

Sequencing the Genome of Nitrosomonas europaea, an Obligate Lithoautotrophic, Ammonia-Oxidizing Bacterium

Nostoc Genome Sequencing

The Complete Genome Sequence of Prochlorococcus

Sequencing the Large Linear Chromosome of Borellia burgdorferi and a Strain of Clostridium

DOE-Funded Microbial Genome Sequencing at The Institute for Genomic Research

Rhodopseudomonas palustris Genome Project

Sequencing Microbial Genomes of Environmental Relevance

The Genome of Geobacter sulfurreducens

Optical Approaches for Physical Mapping and Sequence Assembly of the Deinococcus radiodurans Chromosome

Whole-Genome Sequence of Pyrobaculum aerophilum

Whole-Genome Shotgun Sequencing

The Complete Genome of the Hyperthermophilic Bacterium Aquifex aeolicus

The Genome Sequence of the Hyperthermophilic Archaeon Pyrococcus furiosus

**Sequencing the Genome of Nitrosomonas europaea, an Obligate Lithoautotrophic, Ammonia-Oxidizing Bacterium**

DOE-Funded Microbial Genome Sequencing
at The Institute for Genomic Research

**Optical Approaches for Physical Mapping and Sequence Assembly of the Deinococcus radiodurans Chromosome**