IMG: Integrated Microbial Genomes

How IMG Computes Biological Concepts

Homologs

Using precomputed BLAST data, the IMG system finds unidirectional hits with an E-value below 10-2 in all the IMG genomes and identifies them as likely homologs to a given gene of interest. You can select a minimum percent identity for homologs by altering the "Min. Homolog Percent Identity" value on the Preferences page. Additional filtering by percent identity, bit score, and E-value are applied as needed to keep results sets manageable.

Orthologs

Orthologs are defined in IMG as genes in different genomes that are identified as homologs and are also bidirectional best hits (that is, top reciprocal BLAST hits). In addition to the E-value below 10-2 required for any homolog, the bidirectional best hit requirement gives a more conservative criterion for inclusion.

Ortholog Groups

Ortholog groups provide a general idea of the structure of ortholog relationships across genomes. The Markov Cluster Algorithm (MCL) is used to cluster bidirectional best hit relationships into groups of genes related by similarity. A conservation score is calculated to normalize the strength of similarity. This is basically the bit score between two sequences divided by bit score of the sequences when BLASTed against itself (self bit score). More precisely, it is

    cons_scorexy = bit_scorexy / max( bit_scorexx, bit_scoreyy )

where x and y are two separate sequences. The mcl tool is runned with default parameters. These groupings do not represent protein families in any rigorous fashion, but provides an initial view on the grouping of genes provided by an automatic method.

Paralogs

Paralogs are defined in IMG as genes in the same genome of interest that are identified as homologs and are also reciprocal hits (BLAST hits to each other, but not necessarily best hits). The criteria for inclusion as a paralog are an E-value below 10-5, 30% sequence identity, and a bit score greater than or equal to 50.

Paralog Groups

Paralog groups provide a general idea of the structure of paralog relationships within a genome. They use the same MCL clustering algorithm as ortholog groups on pairwise relationships.

Positional Clusters

Positional clusters include genes in a given genome of interest, within 300 base pairs of each other. By itself, positional clusters do not imply conservation. Positional clusters are used in computing the Conserved Region Score.

Conserved Regions

A Conserved Region Score is assigned to a gene by counting the additional neighboring ortholog pairs within a positional cluster. It measures the strength of region conservation between two neighborhoods. Strong conservation suggests that the genes are functionally related.

Phylogenetic Occurrence Profile

A phylogenetic occurrence profile shows the pattern of occurrence for a specific gene across multiple genomes. The occurrence profiles of multiple genes across these genomes can be then visually compared.

For each gene, a fixed length ordered vector is provided for each gene. The positions in the vector correspond to the list of selected genomes, whereby the genomes are phylogenetically ordered: an "A" or "B" or "E" in a given position indicates the presence of the gene itself if the position corresponds to the genome (Archaea for "A", Bacteria for "B", Eukarya for "E") the gene belongs to, or an ortholog otherwise; an "." in a given position indicates the absence of the gene or ortholog for that genome. Orthologs in IMG are implemented as bidirectional best hits (see Biological Concepts above).

Note that the tools based on occurrence profiles, Phylogenetic Occcurence Profile Viewer and Similar Phylogenetic Occurence Profile Search Tool , should not be confused with the Phylogenetic Profiler . The latter is used for selecting genes for an genome of interest, that have identical occurrence profiles, whereby these profiles are based on gene similarities computed using unidirectional hits. The occurrence profile similarity for the other tools is based on a less restrictive flexible matching of profiles and involves gene similarities computed using bidirectional best hits.


 

How IMG Computes Biological Concepts
IMG: Integrated Microbial Genomes

How IMG Computes Biological Concepts

Homologs

Using precomputed BLAST data, the IMG system finds unidirectional hits with an E-value below 10-2 in all the IMG genomes and identifies them as likely homologs to a given gene of interest. You can select a minimum percent identity for homologs by altering the "Min. Homolog Percent Identity" value on the Preferences page. Additional filtering by percent identity, bit score, and E-value are applied as needed to keep results sets manageable.

Orthologs

Orthologs are defined in IMG as genes in different genomes that are identified as homologs and are also bidirectional best hits (that is, top reciprocal BLAST hits). In addition to the E-value below 10-2 required for any homolog, the bidirectional best hit requirement gives a more conservative criterion for inclusion.

Ortholog Groups

Ortholog groups provide a general idea of the structure of ortholog relationships across genomes. The Markov Cluster Algorithm (MCL) is used to cluster bidirectional best hit relationships into groups of genes related by similarity. A conservation score is calculated to normalize the strength of similarity. This is basically the bit score between two sequences divided by bit score of the sequences when BLASTed against itself (self bit score). More precisely, it is

    cons_scorexy = bit_scorexy / max( bit_scorexx, bit_scoreyy )

where x and y are two separate sequences. The mcl tool is runned with default parameters. These groupings do not represent protein families in any rigorous fashion, but provides an initial view on the grouping of genes provided by an automatic method.

Paralogs

Paralogs are defined in IMG as genes in the same genome of interest that are identified as homologs and are also reciprocal hits (BLAST hits to each other, but not necessarily best hits). The criteria for inclusion as a paralog are an E-value below 10-5, 30% sequence identity, and a bit score greater than or equal to 50.

Paralog Groups

Paralog groups provide a general idea of the structure of paralog relationships within a genome. They use the same MCL clustering algorithm as ortholog groups on pairwise relationships.

Positional Clusters

Positional clusters include genes in a given genome of interest, within 300 base pairs of each other. By itself, positional clusters do not imply conservation. Positional clusters are used in computing the Conserved Region Score.

Conserved Regions

A Conserved Region Score is assigned to a gene by counting the additional neighboring ortholog pairs within a positional cluster. It measures the strength of region conservation between two neighborhoods. Strong conservation suggests that the genes are functionally related.

Phylogenetic Occurrence Profile

A phylogenetic occurrence profile shows the pattern of occurrence for a specific gene across multiple genomes. The occurrence profiles of multiple genes across these genomes can be then visually compared.

For each gene, a fixed length ordered vector is provided for each gene. The positions in the vector correspond to the list of selected genomes, whereby the genomes are phylogenetically ordered: an "A" or "B" or "E" in a given position indicates the presence of the gene itself if the position corresponds to the genome (Archaea for "A", Bacteria for "B", Eukarya for "E") the gene belongs to, or an ortholog otherwise; an "." in a given position indicates the absence of the gene or ortholog for that genome. Orthologs in IMG are implemented as bidirectional best hits (see Biological Concepts above).

Note that the tools based on occurrence profiles, Phylogenetic Occcurence Profile Viewer and Similar Phylogenetic Occurence Profile Search Tool , should not be confused with the Phylogenetic Profiler . The latter is used for selecting genes for an genome of interest, that have identical occurrence profiles, whereby these profiles are based on gene similarities computed using unidirectional hits. The occurrence profile similarity for the other tools is based on a less restrictive flexible matching of profiles and involves gene similarities computed using bidirectional best hits.


 

How IMG Computes Biological Concepts
IMG: Integrated Microbial Genomes

How IMG Computes Biological Concepts

Homologs

Using precomputed BLAST data, the IMG system finds unidirectional hits with an E-value below 10-2 in all the IMG genomes and identifies them as likely homologs to a given gene of interest. You can select a minimum percent identity for homologs by altering the "Min. Homolog Percent Identity" value on the Preferences page. Additional filtering by percent identity, bit score, and E-value are applied as needed to keep results sets manageable.

Orthologs

Orthologs are defined in IMG as genes in different genomes that are identified as homologs and are also bidirectional best hits (that is, top reciprocal BLAST hits). In addition to the E-value below 10-2 required for any homolog, the bidirectional best hit requirement gives a more conservative criterion for inclusion.

Ortholog Groups

Ortholog groups provide a general idea of the structure of ortholog relationships across genomes. The Markov Cluster Algorithm (MCL) is used to cluster bidirectional best hit relationships into groups of genes related by similarity. A conservation score is calculated to normalize the strength of similarity. This is basically the bit score between two sequences divided by bit score of the sequences when BLASTed against itself (self bit score). More precisely, it is

    cons_scorexy = bit_scorexy / max( bit_scorexx, bit_scoreyy )

where x and y are two separate sequences. The mcl tool is runned with default parameters. These groupings do not represent protein families in any rigorous fashion, but provides an initial view on the grouping of genes provided by an automatic method.

Paralogs

Paralogs are defined in IMG as genes in the same genome of interest that are identified as homologs and are also reciprocal hits (BLAST hits to each other, but not necessarily best hits). The criteria for inclusion as a paralog are an E-value below 10-5, 30% sequence identity, and a bit score greater than or equal to 50.

Paralog Groups

Paralog groups provide a general idea of the structure of paralog relationships within a genome. They use the same MCL clustering algorithm as ortholog groups on pairwise relationships.

Positional Clusters

Positional clusters include genes in a given genome of interest, within 300 base pairs of each other. By itself, positional clusters do not imply conservation. Positional clusters are used in computing the Conserved Region Score.

Conserved Regions

A Conserved Region Score is assigned to a gene by counting the additional neighboring ortholog pairs within a positional cluster. It measures the strength of region conservation between two neighborhoods. Strong conservation suggests that the genes are functionally related.

Phylogenetic Occurrence Profile

A phylogenetic occurrence profile shows the pattern of occurrence for a specific gene across multiple genomes. The occurrence profiles of multiple genes across these genomes can be then visually compared.

For each gene, a fixed length ordered vector is provided for each gene. The positions in the vector correspond to the list of selected genomes, whereby the genomes are phylogenetically ordered: an "A" or "B" or "E" in a given position indicates the presence of the gene itself if the position corresponds to the genome (Archaea for "A", Bacteria for "B", Eukarya for "E") the gene belongs to, or an ortholog otherwise; an "." in a given position indicates the absence of the gene or ortholog for that genome. Orthologs in IMG are implemented as bidirectional best hits (see Biological Concepts above).

Note that the tools based on occurrence profiles, Phylogenetic Occcurence Profile Viewer and Similar Phylogenetic Occurence Profile Search Tool , should not be confused with the Phylogenetic Profiler . The latter is used for selecting genes for an genome of interest, that have identical occurrence profiles, whereby these profiles are based on gene similarities computed using unidirectional hits. The occurrence profile similarity for the other tools is based on a less restrictive flexible matching of profiles and involves gene similarities computed using bidirectional best hits.