DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-

Genomes to Life Contractor-Grantee Workshop II
February 29-March 2, 2004, Washington, D.C.

Genomics:GTL Program Projects


Institute for Biological Energy Alternatives

26

Estimation of the Minimal Mycoplasma Gene Set Using Global Transposon Mutagenesis and Comparative Genomics

John I. Glass1, Nina Alperovich1, Nacyra Assad-Garcia1, Holly Baden-Tillson1, Hoda Khouri2, Matt Lewis3, William C. Nierman2, William C. Nelson2, Cynthia Pfannkoch1, Karin Remington1, Shibu Yooseph1, Hamilton O. Smith1, and J. Craig Venter1 (jcventer@tcag.org)

1Institute for Biological Energy Alternatives, Rockville, Maryland; 2The Institute for Genomic Research, Rockville, MD; and 3The J. Craig Venter Science Foundation Joint Technology Center, Rockville, MD

IBEA aspires to make bacteria with specific metabolic capabilities encoded by artificial genomes. To achieve this we must develop technologies and strategies for creating bacterial cells from constituent parts of either biological or synthetic origin. Determining the minimal gene set needed for a functioning bacterial genome in a defined laboratory environment is a necessary step towards our goal. For our initial rationally designed cell we plan to synthesize a genome based on a mycoplasma blueprint (mycoplasma being the common name for the class Mollicutes). We chose this bacterial taxon because its members already have small, near minimal genomes that encode limited metabolic capacity and complexity. We are using two mycoplasma species as platforms to develop methods for construction of a minimal cell. Mycoplasma genitalium is a slow-growing human urogenital pathogen that has the smallest known genome of any free-living cell at 580 kb. It has already been used to make a preliminary estimate of the minimal gene set. Global transposon mutagenesis identified 130 of the 480 M. genitalium protein-coding genes not essential for cell growth under laboratory conditions. That study also predicted there may be as many as 85 other M. genitalium genes that are similarly not essential. Mycoplasma capricolum subsp. Capricolum, an organism endemic in goats, was chosen as another platform because of its rapid growth rate and reported genetic malleability. To facilitate work with this species we sequenced and annotated its 1,010,023 bp genome. In anticipation of eventually synthesizing artificial genomes containing a minimal set of genes necessary to sustain a viable replicating bacterial cell we took two approaches to determine the composition of that gene set.

In one approach we used global transposon mutagenesis to identify non-essential genes in both of our two platform mycoplasma species. We created, isolated, and expanded clonal populations of sets of random mutants. Transposon insertion sites were determined by sequencing directly from mycoplasma genomic DNA. This effort has already expanded the previously determined list of non-essential M. genitalium genes, and in this study, because we isolated and propagated each mutant, we can characterize the phenotypic effects of the mutations on growth rate and colony morphology. Additionally, identification of non-essential genes in our two distantly related mycoplasma species permits a better estimate of the essential mycoplasma gene set.

In our other approach, we analyzed 11 complete and 3 partially sequenced mycoplasma genomes to define a consensus mycoplasma gene set. Previous similar computational comparisons of genomes across diverse phyla of the eubacteria are of limited value. Because of non-orthologous gene displacement, pan-bacterial comparisons identified less than 100 genes common to all bacteria; however determination of conserved genes within the narrow mycoplasma taxon is much more instructive. The combination of comparative genomics with reports of specific enzymatic activities in different mycoplasma species enabled us to predict what elements are critical for this bacterial taxon. In addition to determining the consensus set of genes involved in different cellular functions, we identified 10 hypothetical genes conserved in almost all the genomes, and parologous gene families likely involved in antigenic variation that comprise significant fractions of each genome and presumably unnecessary for cell viability under laboratory conditions.

27

Whole Genome Assembly of Infectious fX174 Bacteriophage from Synthetic Oligonucleotides.

Hamilton O. Smith1, Clyde A. Hutchison III2, Cynthia Pfannkoch1, and J. Craig Venter1 (jcventer@tcag.org)

1Institute for Biological Energy Alternatives, Rockville, MD and 2Department of Microbiology and Immunology, University of North Carolina, Chapel Hill, NC

We have improved upon the methodology and dramatically shortened the time required for accurate assembly of 5 to 6 kb segments of DNA from synthetic oligonucleotides. As a test of this methodology we have established conditions for the rapid (14 days) assembly of the complete infectious genome of bacteriophage fX174 (5,386 bp) from a single pool of chemically synthesized oligonucleotides. The procedure involves three key steps: 1) Gel purification of pooled oligonucleotides to reduce contamination with molecules of incorrect chain length, 2) Ligation of the oligonucleotides under stringent annealing conditions (55C) to select against annealing of molecules with incorrect sequences, and 3) Assembly of ligation products into full length genomes by polymerase cycling assembly (PCA), a non-exponential reaction in which each terminal oligonucleotide can be extended only once to produce a full-length molecule. We observed a discrete band of full-length assemblies upon gel analysis of the PCA product, without any PCR amplification. PCR amplification was then used to obtain larger amounts of pure full-length genomes for circularization and infectivity measurements. The synthetic DNA had a lower infectivity than natural DNA, indicating approximately one lethal error per 500 bp. However, fully infectious fX174 virions were recovered following electroporation into E. coli. Sequence analysis of several infectious isolates verified the accuracy of these synthetic genomes. One such isolate had exactly the intended sequence. We propose to assemble larger genomes by joining separately assembled 5 to 6 kb segments; approximately 60 such segments would be required for a minimal cellular genome. Below is a schematic diagram of the steps in the global synthesis of infectious fX174 bacteriophage from synthetic oligonucleotides.

The power of the above global assembly method will be fully realized when methods to remove errors from the final product are developed. Further experiments are underway to increase the efficiency of error correction.

Schematic diagram

Fig. 1. Schematic diagram of the steps in the global synthesis of infectious fX174 bacteriophage from synthetic oligonucleotides.

28

Development of a Deinococcus radiodurans Homologous Recombination System

Sanjay Vashee, Ray-Yuan Chuang, Christian Barnes, Hamilton O. Smith, and J. Craig Venter (jcventer@tcag.org)

Institute for Biological Energy Alternatives, Rockville, MD

A major goal of our Institute is to rationally design synthetic microorganisms that are capable of carrying out the required functions. One of the requirements for this effort entails the packaging of the designed pathways into a cohesive genome. Our approach to this problem is to develop an efficient in vitro homologous recombination system based upon Deinococcus radiodurans (Dr). This bacterium was selected because it has the remarkable ability to survive 15,000 Gy of ionizing radiation. In contrast, doses below 10 Gy are lethal to almost all other organisms. Although hundreds of double-strand breaks are created, Dr is able to accurately restore its genome without evidence of mutation within a few hours after exposure, suggesting that the bacterium has a very efficient repair mechanism. The major repair pathway is thought to be homologous recombination, mainly because Dr strains containing mutations in recA, the bacterial recombinase, are sensitive to ionizing radiation.

Since the mechanism of homologous recombination is not yet well understood in Dr, we have undertaken two general approaches to study this phenomenon. First, we are establishing an endogenous extract that contains homologous recombination activity. This extract can then be fractionated to isolate and purify all proteins that perform homologous recombination. We are also utilizing information from the sequenced genome. For example, homologues of E. coli homologous recombination proteins, such as recD and ruvA, are present in Dr. Thus, another approach is to assemble the homologous recombination activity by purifying and characterizing the analogous recombinant proteins. However, not all genes that play a major role in homologous recombination have been identified by annotation.

As a case in point, there are two candidates for the single-stranded DNA binding protein, Ssb (Dr0099 and Dr0100). To determine which of the two is the real Ssb, we first resequenced the Ssb region. We discovered two single-base deletions that when corrected give rise to a contiguous gene that contains two Ssb OB fold domains. We have purified the recombinant protein almost to homogeneity and characterized its DNA binding and strand-exchange properties. Our results suggest that despite some minor differences, the Deinococcus Ssb is very similar to the E. coli protein. In addition, using antibodies we have raised against DrSsb, we have determined that the amount of DrSsb protein, like recA, increases in the cell when exposed to a DNA damaging agent.

29

Environmental Genome Shotgun Sequencing of the Sargasso Sea

J. Craig Venter (jcventer@tcag.org), Karin Remington, Jeff Hoffman, Holly Baden-Tillson, Cynthia Pfannkoch, and Hamilton O. Smith

Institute for Biological Energy Alternatives, Rockville, MD

We have applied whole genome shotgun sequencing to pooled environmental DNA samples in this study to test whether new genomic approaches can be effectively applied to gene and species discovery and to overall environmental characterization. To help ensure a tractable pilot study, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, we concentrated on the genetic material captured on filters sized to isolate primarily microbial inhabitants of the environment, leaving detailed analysis of dissolved DNA and viral particles on one end of the size spectrum, and eukaryotic inhabitants on the other, for subsequent studies.

Surface water samples were collected from three sites off the coast of Bermuda in February 2003. Additional samples were collected from a neighboring fourth site in May 2003. Genomic DNA was extracted from filters of 0.1 to 3.0 microns, and genomic libraries with insert sizes ranging from 2-6kb were made and sequenced from both ends. The 1.66 million sequences from the February samples were pooled and assembled to provide a single master assembly for comparative purposes. An additional 325,608 reads from the May samples were also analyzed. The assembly generated 64,398 scaffolds ranging in size from 826 bp to 2.1 Mbp, containing 256 Mbp of unique sequence and spanning 400 Mbp. Evidence-based gene finding revealed 1,214,207 genes within this dataset, including 1412 distinct small subunit rRNA genes. With this set of rRNA genes, using a 97% sequence similarity cut-off to distinguish unique phylotypes, we identified 148 novel phylotypes in our sample when compared against the RDP II database2. Because the copy number of rRNA genes varies greatly between taxa (more than an order of magnitude among prokaryotes), rRNA-based phylogeny studies can be misleading. Therefore, we constructed phylogenetic trees using various other represented phylogenetic markers found in our dataset. Assignment to phylogenetic groups shows a broad consensus among the different phylogenetic markers.

Just as phylogenetic classification is strengthened by a more comprehensive marker set, so too is the estimation of species richness. In this analysis, we define “genomic” species as a clustering of assemblies or unassembled reads more than 94% identical on the nucleotide level. This cut-off, adjusted for the protein-coding marker genes, is roughly comparable to the 97% cut-off traditionally used for rRNA. Thus-defined, the mean number of species at the point of deepest coverage was 451; this serves as the most conservative estimate of species richness. However, in most of the samples we observed an average maximum abundance of only 3.3%. This is a level of diversity akin to what has been observed in terrestrial samples3.

While counts of observed species in a sample are directly obtainable, the true number of distinct species within a sample is almost certainly greater than that which can be observed by finite sequence sampling. Modeling based on assembly depth of coverage indicates that there are at least 1,800 species in the combined sample, and that a minimum of 12-fold deeper sampling would be required to obtain 95% of the unique sequence. Further, the depth of coverage modeling is consistent with as much as 80% of the assembled sequence being contributed by organisms at very low individual abundance, compatible with total diversity orders of magnitude greater than the lower bound just given. The assembly coverage data also implies that more than 100Mbp of genome (i.e., probably more than 50 species) is present at coverage high enough to permit assembly of a complete or nearly-complete genome were we to sequence to 5- to 10-fold greater sampling depth.

We demonstrate the utility of such a dataset with a study of genes relevant to photobiology within the Sargasso Sea. The recent discovery of a homolog of bacteriorhodopsin in an uncultured g-proteobacteria from the Monterey Bay revealed the basis of a novel form of phototrophy in marine systems4 that was observed previously by oceanographers5,6. Environmental culture-independent gene surveys with PCR, have since shown that proteorhodopsin is not limited to a single oceanographic location, and revealed some 67 additional closely related proteorhodopsin homologs7. More than 782 rhodopsin homologs were identified within our dataset, increasing the total number of identified proteorhodopsins by almost an order of magnitude. In total, we have identified 13 distinct subfamilies of rhodopsin-like genes. These include four families of proteins known from cultured organisms (halorhodpsin, bacterorhodopsin, sensory opsins, and fungal opsin), and 9 families from uncultured species of which 7 are only known from the Sargasso Sea populations.

While we are a long way from a full understanding of the biology of the organisms sampled here, even this relatively small study demonstrates areas where important insights may be gained from the comprehensive nature of this approach. Our assembly results demonstrate one can apply whole-genome assembly algorithms successfully in an environmental context, with the only real limitation being the sequencing cost.

References

  1. The authors acknowledge the significant contributions of their collaborators on this project: J. Heidelberg, J.A. Eisen, D. Wu, I. Paulsen, K.E. Nelson, W. Nelson, D. E. Fouts, O. White and J. Peterson at The Institute for Genomic Research, A.L. Halpern, D. Rusch, and S.l Levy at The Center for the Advancement of Genomics, A. H. Knap, M. W. Lomas and R. Parsons at the Bermuda Biological Station for Research, Y. Rogers at the JCVSF Joint Technology Center, and K. Nealson at the University of Southern California.
  2. J. R. Cole et al., Nucleic Acids Research 31, 442 (Jan 1, 2003).
  3. T. P. Curtis, W. T. Sloan, J. W. Scannell, Proceedings of the National Academy of Sciences of the United States of America 99, 10494 (AUG 6, 2002).
  4. O. Beja et al., Science 289, 1902 (Sep 15, 2000).
  5. Z. S. Kolber, C. L. Van Dover, R. A. Niederman, P. G. Falkowski, Nature 407, 177 (Sep 14, 2000).
  6. Z. S. Kolber et al., Science 292, 2492 (Jun 29, 2001).
  7. G. Sabehi et al., Environ Microbiol 5, 842 (Oct, 2003).