IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

Background

Metagenome (environmental genome, community genome) is a sample of an aggregate genome of all community members obtained directly from the natural environment, without a preliminary cultivation step.

Two alternative metagenome sequencing strategies are generally followed:

  1. directed sequencing, i. e. sequencing of long-insert libraries after screening for the presence of certain phylogenetic (e.g. 16S rRNA genes) or functional (e.g. certain enzymatic activity) markers;
  2. shotgun sequencing of random clones generated from aggregate DNA by Sanger sequencing or pyrosequencing of aggregate DNA without cloning

Metagenome data analysis aims at addressing at least one of the following three questions:

  1. Diversity and abundance of community members ("who is there");
  2. Metabolic potential of the community and its members ("what they are doing");
  3. Ecological relations between members of the community ("why they are there").

Definitions

Metagenomic sample - usually is equivalent to the isolated aggregate DNA obtained from a certain environment; ideally should be accompanied by comprehensive metadata describing how this sample was obtained (e. g. - location, type of an environment, host, isolation protocol, etc.). The DNA is used to create a metagenomic clone library, which is further sequenced to produce reads. DNA isolation and cloning may introduce certain biases, so the representation of each species in metagenomic sequence may be different from that in the environment.

Read - Sanger sequencing read; if left unassembled, becomes a single-read contig in the metagenomic dataset. Single-read contigs almost never appear in isolate genomes (even at the draft stages), but some metagenomic datasets consist almost entirely of unassembled reads (so called shrapnel).

Contig - the result of assembly of reads based on nucleotide sequence identity; the sequence of the contig is a consensus sequence of multiple reads generated by genome assembler (such as JAZZ, PHRAP or Celera assembler [1-3]) and may not be identical to any particular read.

Scaffold - the result of assembly of contigs joined by N-bridged gaps based on read mate-pair information; both scaffold and contig sequences are further used to identify genes.

RNA-coding genes: 16S and 23S rRNAs are usually identify by BLASTn against the corresponding sequences in isolate genomes; 5S rRNA is identified by BLASTn or using Rfam/INFERNAL approach [4]. tRNA-coding genes are identified using tRNA-Scan-SE [5]. Other stable RNAs including RNase P, SsrS RNA, SRP and riboswitches are rarely predicted in metagenomic datasets. CDSs (protein-coding gene) are usually identified automatically by ab initio gene finding software, such as fgenesb, Glimmer or GeneMark [6-8]; alternatively, they can be predicted by running BLASTx against the protein databases.

Functional annotations (protein product descriptions) are usually performed automatically using RPS-BLAST hits against the Conserved Domain Database (CDD) [9], which combines information from COG (Clusters of Orthologous Groups) and Pfam with several other minor sources; this approach is complemented by BLASTp against protein databases.

Bins are sets of metagenomic sequence fragments originating from one phylogenetic group, preferably from the same strain (or phylotype - see the figure below). Scaffolds, contigs and reads are assigned to bins by a binning tool, which can use either oligonucleotide composition of DNA fragments (TETRA, PhyloPythia [10-11]) or phylogenetic affiliations of protein-coding genes (e. g. - MEGAN[12] is not a binning tool, but similar to what phylogenetic binning tools would do).

phylotypes

Phyloytypes

Data Processing

Processing of metagenomic datasets, especially those derived from high-complexity microbiomes, is characterized by significantly higher error rate than processing of isolate genomes [13]. The problems include assembly of chimeric contigs (i. e. assembly of reads originating from different taxonomic groups), under-assembly (i. e. reads that should have been assembled remain as single-read contigs), higher rate of false-positive and false-negative results of gene prediction (mostly due to gene fragmentation), and low sensitivity of binning (i. e. relatively small portion of scaffolds and contigs are assigned to bins, bins correspond to larger taxonomic groups than a species, etc.). Therefore the importance of manual inspection of the data and validation of the results of any analysis cannot be overestimated.

References

1. JAZZ assembler: Aparicio, S., et al. 2002. Whole genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310.

2. Celera assembler: Myers, E.W., et al., 2000. A whole-genome assembly of Drosophila. Science, 287(5461), 2196-2204.

3. PHRAP assembler: http://www.phrap.org/phredphrapconsed.html

4. Rfam and Infernal RNA prediction: Griffith-Jones, S., et al. 2003. Rfam: an RNA family database. Nucleic Acids Res., 31(1), 439-441.
5. tRNAscan-SE tRNA prediction: Lowe, T. M., & Eddy, S. R. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25(5), 955-964.

6. Fgenesb gene prediction: http://www.softberry.com/berry.phtml?topic=fgenesb&group=programs&subgroup=gfindb

7. Glimmer gene prediction: Delcher, A.L., et al. 1999. Improved Microbial Gene Identification with Glimmer, Nucleic Acids Res., 27(23), 4636-4641.

8. GeneMark gene prediction: Besemer, J., et al. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29(12), 2607-2618.

9. CDD database domain analysis: Marchler-Bauer, A., et al. 2007. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35, D237-D240.

10. TETRA binning tool: Teeling, H. et al. 2004. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 5, 163.

11. PhyloPythia binning tool: McHardy, A. C., et al. 2007. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 4(1), 63-72.

12. MEGAN tool for phylogenetic analysis: Huson, D. H., et al. 2007, MEGAN analysis of metagenomic data. Genome Res. 17(3), 377-386.

13. Problems of metagenome data processing: Mavromatis, K. et al. On the fidelity of processing metagenomic sequences using simulated datasets. Nat. Methods, in press. See also http://fames.jgi-psf.org/