|
Application Build: 246 Database Build: 2008-04 |
|
|
High-Throughput GoMiner Detailed Output Descriptions
A subdirectory is created within the data directory, and its name is based upon
that of the total gene file. It contains two classes of objects:
- Several files that summarize the overall results from all of the changed gene files:
- Simple summary of the results from each experiment (ie, changed gene file) [TOTALGENESFILE.report]. Each line contains the name of the changed gene file and the number of categories whose FDR is less than or equal to a user-defined threshold. Sorted in ascending order by this number of categories so that the user can immediately ascertain which experiments are of interest for further analysis. Separate results are given for underexpressed, overexpressed, and changed genes for each changed gene file that had been input in the two column format.
- Tab-delimited files containing a matrix whose rows are categories and whose columns are names of changed gene files [total.txt.change.series.CIM]. The former is the union of categories whose FDR meet a user-defined threshold T for at least one of the changed gene files. Each entry is given as T - 0.9*FDR (or as 0 if T - FDR is negative). Since the matrix data are intended to be viewed in clustered image map (CIM) programs or Excel using the 3D columns visualization format, this transformation makes it easier view the important categories (ie, those with low FDR), since the eye is drawn to the higher 3D columns. For example, each experiment might represent a different time point in a time series. In this case, the Excel visualization would show the coming and going of categories that were important at different times during the experiment. The same tab-delimited files can be used as input to any one of several publicly available CIM programs in order to view the co-clustering of experiments and categories. The transformation of the FDR values again draws the eye to the important categories in the CIM representation. Categories are clustered together if their sets of changed and unchanged genes are similar.
- Tab-delimited text files in gene category export (tvt) format that compares the total file against itself instead of against the changed file [TOTALGENESFILE.total.tvt]. Each row contains one pair of GO category and gene, for each gene in the total file with a mapping.
- A subdirectory [CHANGEDGENESFILE.dir] corresponding to each of the changed gene files (for each file type mentioned, there is one for the underexpressed, overexpressed, and changed genes if the changed gene file was of the two column format):
- Files
- Tab-delimited text files in summary export (se) format [CHANGEDGENESFILE.change]. Each row contains a statistical summary about a category. The information, which is explained in detail, is sorted in ascending order by the one-sided Fisher exact p value. For the statistically important categories, this order is essentially identical with that for the FDR, which is given in the last column of each row.
- Tab-delimited text files in gene category export (gce) format [CHANGEDGENESFILE.change.gce]. Each row contains one pair of GO category and gene, as well as the highly redundant statistical information in the line of the summary export for the corresponding category.
- Excel format versions of the summary and gene category export format files [CHANGEDGENESFILE.change.xls], [CHANGEDGENESFILE.change.gce.xls]. The Excel version contain an additional header row that is hyperlinked to detailed explanations. The GO category name is hyperlinked to the AmiGO browser and the gene name is hyperlinked to NCBI Entrez.
- Tab-delimited text files containing a tabulation whose rows are categories and whose columns are genes. Each entry is either 0, 1, or -1, indicating absence (0) or presence (overexpressed = 1; underexpressed = -1) of that changed gene in that category [CHANGEDGENESFILE.change.gce.CIM,CHANGEDGENESFILE.under.gce.CIM,CHANGEDGENESFILE.over.gce.CIM,CHANGEDGENESFILE.change.gce.updown.CIM]. Only the first of these files is generated when the changed-genes input file is in one-column format, in which case the entries are either 1 or 0. These files can be used as input to any one of several publicly available clustered image map programs in order to view the co-clustering of categories and changed genes. This visualization allows the user to determine which categories achieved statistical significance by virtue of containing essentially the same set of changed genes, or conversely, which significant categories are essentially "orthogonal" with respect to the set of changed genes. The user can also determine which sets of genes tend to have a high degree of synteny in the significant categories. If the minimum requirements for creating the CIM are present, the images will be generated and placed in the corresponding directory, [CHANGEDGENESFILE.change.gce.CIM.dir,CHANGEDGENESFILE.under.gce.CIM.dir,CHANGEDGENESFILE.over.gce.CIM.dir,CHANGEDGENESFILE.change.gce.updown.CIM.dir]
- Subdirectories
- Transcription factor binding sites [CHANGEDGENESFILE.change.gce.TF] files. These binding sites are calculated by the
ABCC GRID Promoter Analysis tool
- A set of files for each statistically significant category [CATEGORYNAME.TF] whose columns are changed gene in the category and whose rows are transcription factor binding site|motif. Each entry indicates either absence (0) or presence (1 for overexpressed, -1 for underexpressed) of the transcription factor binding site in the promoter region of the gene.
- A file integrating all of the statistically significant categories [TF.FDR] whose columns are statistically significant categories and whose rows are the union of the transcription transcription factor binding site|motif for all the genes in these categories. Each entry is the FDR for the category. Missing values are indicated by a dash.
- A transformed version of the above file [TF.FDR.CIM] whose entries are given as T - 0.9*FDR (or as 0 if T - FDR is negative). Missing values are indicated by a dash. The transformation of the FDR values draws the eye to the important categories in the CIM representation.
- A second transformed version of the above file [TF.FDR.CIMminer.txt] whose entries are given as T - 0.9*FDR (or as 0 if T - FDR is negative). Missing values are indicated by a zero (0). The transformation of the FDR values draws the eye to the important categories in the CIM representation. Representing missing values by '0' rather than by '-' allows compatibility with CIMminer at the minor cost of ambiguity between missing values and statistically insignificant categories.
- A file integrating all of the statistically significant categories [TFcat.CIM] whose columns are statistically significant categories and whose rows are the union of the transcription factor binding site|motif for all the genes in these categories. Each entry is the count of the number of genes in the category in which the transcription factor binding site appears in the promoter region.
- A file integrating all of the statistically significant categories [TFavg.CIM] whose columns are statistically significant categories and whose rows are the union of the transcription factor binding site|motif for all the genes in these categories. Each entry is the average of the number of genes in the category in which the transcription factor binding site appears in the promoter region. The average is computed by dividing the count in the previous file by the number of genes in the category.
- Subdirectories for integration of all changed gene files [CHANGEALL.dir] Allows the user to determine the specific genes that
are in the categories of interest, integrated across all of the changed-gene files, without needing to go back and search through
a large number of individual gce files. Also permits analysis of this relationship by clustering.
- Files
- A tab-delimited file [CHANGEALL1] containing a matrix whose rows are category|changed gene pairs and whose columns are names of changed gene files. The categories are restricted to be a member of the union of categories whose FDR meet a user-defined threshold T for at least one of the changed gene files. Each entry is the 'signed FDR' of the category for its changed gene file column. The 'signed FDR' is preceded by a plus sign if the changed gene was overexpressed for the changed gene file column it is in, and is preceded by a minus sign if the changed gene was underexpressed for the changed gene file column it is in. Missing values (ie, the gene in the row was not changed for the changed gene file column) are indicated by a dash.
- A transformed version [CHANGEALL2] of the above file, in which the entries are given as T - 0.9*FDR preceded by the sign of the 'signed FDR' (or as 0 if T - FDR is negative). Missing values are indicated by a dash. The transformation of the FDR values draws the eye to the important categories in the CIM representation.
- Subdirectories
- [CHANGEALL1.dir] CHANGEALL1 is "decomposed" into separate files for each of its constituent categories.
- [CHANGEALL2.dir] CHANGEALL2 is "decomposed" into separate files for each of its constituent categories.
Debug Information
If you used the debug option, then additional information about the execution
of High-Throughput GoMiner is retained. The contents of the DEBUG file depend on
the DEBUG parameter in the config file. The most reasonable setting is 2.
If the setting is 1, then there is an excess of information that is probably not
too useful. Only the most confident user should set 0, as then error conditions
might go unnoticed.
The first thing to do after completion of a High-Throughput GoMiner run is to
look a the bottom of the DEBUG file. If it says "NORMAL TERMINATION",
you can be pretty sure everything went well. You might want to scroll through the
whole file quickly and look for things that went to STDERR but that did not generate
a problem that was picked up by the shell.
On the other hand, if there was a detectable problem, then the end of the
DEBUG file is where that will be reported, since everything stops when a problem
is detected. Usually, the error will be reported with an error code.
You can work your way backwards (or upwards) to get some hints of
where the problem specifically occurred, and either correct your dataset or
config parameters, or send us a bug report.