How to Annotate a Subsystem

Introduction

One of the objectives in constructing the SEED was to provide a framework in which annotations of genes could be studied and improved. The ability to annotate a specific subsystem (like glycolysis, the ribosome, or histidine biosynthesis or whatever) is a key component of our strategy to "Annotate a 1000 Genomes". In this project, we plan on cooperatively developing detailed annotations of several hundred subsystems using hundreds genomes in the annotation of each subsystem. Once we have the detailed encoding of a subsystem based on several hundred organisms, we believe that this encoding can be easily used to add any number of new organisms. This will allow us to automatically, and accurately, annotate the thousands of genomes that will become available during the coming years.

There are two basic styles in annotating a subsystem: the style used by a novice to produce a reasonably accurate encoding, and the style used by an expert to produce as accurate an analysis as possible. Ultimately we are, of course, seeking expert encodings for each of the cellular subsystems. However, we also believe that producing basically accurate encodings using less skilled participants is a worthy and useful objective. In fact, we consider annotation of subsystems to be a basic activitiy of a practicing biologist and believe that most biologists should learn how to do it well. Certainly students should spend some time learning both how to do it and how to exploit the results.

We will begin with trying to convey the goals and approaches to developing "reasonably accurate" encodings quickly. After covering this topic, we will move on to discuss what more is needed to develop expert annotations.

Developing a Reasonably Accurate Subsystem Annotation Quickly

The steps involved in constructing a subsystem annotation quickly are as follows (we cover the details for each step below):

Select a subsystem and specify the precise roles that make up the subsystem.
Annotate a few key organisms that are known to include the subsystem carefully. For each gene included in the subsystem, make sure the annotations are made precisely as you desire, and project these exact formulations of function to as many orthologous genes as you can.
Add the well-annotated genomes to the spreadsheet, fixing any errors that show up.
Add a somewhat larger group of organisms, again fixing errors that show up.
Add all of the remaining organisms.
Remove organisms which appear not to have the subsystem.
Make a pass to fill in as many of the cells that are missing genes.
Finally, mark the genomes that appear to be totally properly annotaed as "checked".

This is a sketch of how to make a subsystem annotation (that will contain errors). It should be viewed as a starting point for an expert annotation. Let us now go through these steps in detail.

Getting Started

The first task is to initiate a new subsystem and fill in the roles. To do this, you first have to get to the Subsystems Page. You get to this from the initial SEED search page, using a section near the bottom that says "Work on Subsystems". You will need to fill in a user ID (something like "master:JohnD"), and then click on the button that says Work on Subsystems.

Once you get the Subsystems Page, you should see a table describing the existing subsystems (if any) and a spot where you can initiate a new subsystem (look for the words To Start a New Subsystem Annotation). You will need to type in a name for the subsystem. Make the name descriptive (something like Histidine Biosynthesis or Chemotaxis).

Once you click on start new subsystem annotation you should see a blank spreadsheet that you will be filling it. It begins with slots to write Functional Roles. You have room for five functional roles, but whenever you click on update spreadsheet it will update the form, making sure that you have room to add more functional roles.

You begin by filling in the exact names you wish for the functional roles you wish to annotate -- called exactly as you want them to be annotated. Genes will be annotated as coding for proteins that have these roles, or occasionally several of these roles (in the case of multifunctional proteins). Here are some minimal guidelines that we suggest people follow in assigning text strings for each role:

Although genes can be multifunctional, functional roles are not. You should have a separate role for each "catalytic domain" or, in the case of nonmetabolic subsystems for each function encoded by a peptide.
We tend to use Swiss Prot or UniProt descriptions as functional roles. These are usually expertly curated, carefully chosen wordings. On the other hand, it is really your choice.
When you include an EC number, include it in the form (EC x.x.x.x) (e.g., (EC 2.7.1.11) for phosphofructokinase).
You may be encoding a piece of metabolism in which alternatives exist. You can make a list of roles that includes all alternatives, understanding that most organisms will not contain genes corresponding to every role. Alternatively, you can encode two subsystems (each containing one alternative). We have a utility that can be used to glue two subsystem spreadsheets together into a single spreadsheet. Sometimes it is much more convenient to work on separate small pieces and then just combine them.

Once you have typed in the roles, click on update spreadsheet and proceed to the next step.

Annotate a Few Key Organisms

Usually, there is at least one organism in which the subsystem is well-annotated, in the sense that what is actually known has been captured correctly (often this is E.coli or B.subtilis). We suggest that you begin by looking at the annotations in these key organisms, and make sure they agree (exact match -- no changes in case, spacing, punctuation, etc.) with the descriptions you used for the roles when you began the subsystem spreadsheet. As you make sure they agree, for each gene look at the closest 100 FIG Ids (i.e., look at similarities with maxN set to 100, Max Expand set to 100, and Just FID Ids checked). Check the assignments made to the similar genes, and when you can do it both accurately and quickly, make them consistent with the descriptions you are using for the roles. This may take anywhere from 10 minutes to several hours, but it will start having a dramatic effect on the consistency of annotations for this very limited set of genes.

Add the Well-Annotated Organisms to the Spreadsheet

Now, you can update the spreadsheet by selecting the set of organisms that you believe are well-annotated (we suggest starting with just one or two) and clicking on update spreadsheet. The spreadsheet should get updated, and rows should appear corresponding to the organisms you selected. The FIG Ids in each cell should correspond to the genes having the roles. Frequently, some cells will turn up as empty, although you know for sure they should be filled. This is normally due to mismatches in the role descriptions and the functions assigned to the genes (the last time this happened, there was an extra space in one of the role descriptions, but there are many ways you can make seemingly identical strings mismatch). One easy way to track down what is going on is to check the box show missing and update the spreadsheet. This will cause links to be generated for each cell that has no genes in it. By clicking on one of these links you will cause the SEED to look for candidate genes. You should pursue these links, trying to correct whatever errors led to the failure to fill in the cell. Once you have corrected all of the errors, check the box fill and update the spreadsheet. The entries that match will be added to the previously empty cells. Continue until the cells appear to be filled in correctly.

Add a Larger Set of Organisms

Now, we suggest that you add four or five somewhat diverse organisms, and see if the spreadsheet fills in as you anticipated. If not, use the show missing to set up links to pursue missing entries. You can also use show duplicates to look at cases where multiple genes are included in a cell. These are often legitimate, but if you know what you are doing, it might be useful to look for clear misannotations. If you are not familiar with the details of the subsystem, leave duplicate checking to experts.
You can use Add Genomes with Solid Hits to add all genomes for which all of the cells can be filled in. This is sometimes useful, but often specific roles are optional, and when some are missing those genomes will not automatically get selected.

Add all of the Remaining Organisms

You can easily select all of the remaining organisms and add them in a single shot. This will usually lead to many potential errors -- cells that are empty, duplicates, and cells that are filled for organisms that clearly do not have the subsystem. In each case, these may or may not be real errors. In most cases, they are things that should be examined and thought about.
As you correct annotations and update the spreadsheet, empty cells will fill in. However, if there should be two entries in a cell, but only one appears, the second will never be automatically added; only empty cells get filled in by the fill or aggressively fill operations. You need to fill in these entries manually, or erase the existing entry before updating the spreadsheet (which will cause all matching entries to get re-added if you specify fill).

Remove Organisms that Do Not Have Functioning Versions of the Subsystem

Now, you should make a pass to erase genomes that you believe do not have versions of the subsystem. To erase a genome, just erase the genome number; this will cause the whole row to go away when the spreadsheet is updated.
The basic idea of the spreadsheet is to document variants of the subsystem as they are understood. When a row is present with missing entries, it can simply reflect alternatives, uncertainty or that you believe strongly that the subsystem is present, but that there must exist a gene that has not yet been properly characterized and annotated. You should remove only those rows that represent organisms that you believe do not have the subsystem.

Make One Last Pass to Check Missing Genes

For cells that you believe represent roles that are not optional, you should check for missing entries one last time. If you cannot find the gene, you should make a note that it represents a serious difficulty: the gene may not have been called (but is there on the chromosome), it may have a form that was never characterized (these are the "gems" we are looking for; the clue in this case is when several genomes all are missing the same role), it might be in a frameshifted gene that was misannotated, or whatever. Make one last pass, and keep detailed notes (in the notes section, so others may benefit from your observations).

Mark the Prototypical genomes

The key use of annotated subsystems will be to annotate new genomes. To make this work, we need to have a detailed record of the "accedptable variants of the subsystem". As you go through the list of genomes in the spreadsheet, when you find one that is correctly annotated, mark it as checked. These entries may have empty cells for essential roles (indicating a missing gene), but you should not check a genome that has genes that were just not called or for which frameshifts prevented accurate recognition of the gene. When you get done, you have two classes of genomes: those that can be used to describe the set of variants of the subsystem (i.e., of working versions of the subsystem) and those that you believe have working versions, but also have uncorrected errors.
Once this step has been completed, you have finished a crude version of the basic spreadsheet. It represents a very valuable contribution for the following reasons:

The annotations will be far more consistent.
Known variants of the subsystem have been characterized (albeit imperfectly in many cases).
The existence of missing genes (new forms of enzymes) is usually made vivid.
We have data that can be used as relatively reliable (this is often essential for creating "learning sets" or "evaluation sets").
It forms a starting point for an expert that wishes to do the really difficult analysis required to accurately characterize the subsystem.

The Second Phases: Expert Annotation

The second phase of subsystem annotation requires extensive expertise from years of research, coupled with the capability of doing experimental verification of conjectures. It represents orders of magnitude more effort. The goal of the expert annotation should ultimately be to clarify the evolutionary history of the genes in the protein families that include those implementing the functional roles in the spreadsheet. That is, many roles will be implemented by genes that have paralogs that implement closely related functions (which are not included in the subsystem). It is necessary to clarify the precise differences in function. Further, a truly complete analysis will clarify the detailed evolutionary history: where each duplication, horizontal transfer, cluster breakup, and so forth occurred.
There is broad disagreement about whether or not characterization of the evolutionary history is essential. It is certainly true that accurate identification of function can be accomplished at far less effort than working out the details of the evolutionary history. And, effort that is spent working out the evolutionary events (which is extremely difficult and demanding work in many cases) could certainly be expended working out the functional characterizations of other subsystems. It is not our place to assign relative values to these different activities (but, in the few cases in which the evolutionary history has largely been pieced together, one gets glimpses of a new level of understanding).
We will attempt to add features to the SEED to support expert annotation of subsystems. We believe that in many cases experts will use the SEED as a framework for curating the data that will be included in a progression of review articles. That is how it is intended to work.