Here we present ISOLA, an Italian SOLAnaceae genomics resource available at [4]. ISOLA is designed as a multi-level computational environment and meets the need to collect, integrate and explore high-throughput and heterogeneous biological data with the intent that the quality of the data gathered could be enhanced.
ISOLA is currently organized into two main levels: the genome and expression levels (Figure 1). The cornerstone of the genome level is represented by the tomato genome draft sequences. The basic elements of the expression level are the Solanaceae EST collections and the oligonucleotide probe-sets of the tomato expression micro-arrays [5,6].
| Figure 1 Representation of the multilevel structure in ISOLA. Data sources, tools and methods of the platform are indicated. |
‘Basic’ tools are designed and included into the multi-level environment for enhancing data quality and increasing data information content. ‘Subsidiary’ tools lay over the existing multi-level environment exploiting the synergy between the levels.
Each level can be independently accessed through specific Web applications which allow user-driven data investigation and permit overall as well as detailed views of specific information (Figure 2). A non-stop cross-talk between the genome and the expression levels is based on data source sharing and on tools which accomplish data integration and convergence.
| Figure 2 Snapshot of the Web-based application for navigating ISOLA. ISOLA is accessible through two different gateways. |
Herein, we describe the organization and the maintenance of the platform which has been designed in order to be extended, through pre-defined entry points, to the proteome and metabolome levels.
Genome level
BAC sequences retrieval
An automated pipeline has been implemented in order to ensure a daily retrieval of new
S. lycopersicum BAC sequences from the GenBank repository, which are used to feed the genome annotation process. The current collection (May 2007) comprises 129 BAC sequences.
BAC annotation
The BAC annotation process aims to identify coding regions and other genetic elements along the
S. lycopersicum genome sequences.
The protein coding ‘gene finding’ process is exclusively based on the EST spliced-alignments to the genome sequences. To accomplish this task, ESTs from different plant sources (Solanaceae and Rubiaceae species), and the corresponding tentative consensus sequences (TCs), which have been generated by assembling ESTs in a cluster [7], are used. The available data are all described in table 1 where the two rightmost columns (ESTs/TCs mapped) report the number of ESTs and TCs per species, aligned to the 129 BAC sequences. ESTs of non-native origin (i.e. EST data compiled in the local PotatEST and SOLEST databases) are included in the analysis so to improve detection of coding regions which lack source-native EST evidence and support comparative approaches.
| Table 1 Statistics on the EST collections. |
Non-coding RNAs (ncRNAs) from the Rfam collection [8] are aligned to genome sequences too. We identified 105 RNA matches which correspond to 48 different gene loci. They represent 10 distinct RNA types whose occurrence and distribution are reported in table 2.
| Table 2 Occurrence and distribution of non-coding RNA families in the tomato genome draft sequences. |
The TIGR Solanaceae Repeats database [9] is the resource selected for the identification of repetitive sequences in the S. lycopersicum genome.
The repeats identified on the 129 BAC sequences are listed in Additional file 1 according to the TIGR Plant Repeat Database classification schema. We identified 264 matches corresponding to 71 different genome loci. All the genomic regions identified, unless the one detected on the BAC AC171733 (26388::27727) and labelled as unclassified, corresponds to the transposable element (TEs) superclass. Among these 66 are retrotransposons, while the remaining 4 are members of the transposon class. Considering the retrotransposon class, 6 of them are unclassified, 20 are Ty1-copia and 40 are Ty3-gypsy. As usual in plants genomes [10], the transposable elements are ubiquitous and heterogeneous also in tomato. The reason why no other types of repeats have been aligned to the BACs is that the tomato genome sequencing is preliminarily focused on the euchromatic regions [3], which are considered gene richer.
The tomato Genome Browser Database
The BAC sequences collected in ISOLA are annotated and released to the scientific community through the Gbrowse [
11] Web application at [
4]. Tracks showing annotations and other features are displayed and cross-linked to other local or external databases which can be explored through Web interfaces (Figure
2).
Aligning Arabidopsis thaliana RNA sequences
The availability of the full genome sequence of
Arabidopsis thaliana is a cornerstone for plant biology [
12,
13]. We aligned all the RNA sequences from the model plant Arabidopsis to the tomato genome in order to identify genes that are conserved between the two species. However, only 326 out of 31,249 RNA sequences were mapped onto the
S. lycopersicum BACs.
The majority of the RNA sequences (324) do not overlap any genome region covered by Solanaceae ESTs. The RNA sequences are annotated as tRNAs and locate 24 distinct gene loci. The sole AT3G08520 sequence, annotated as “a structural constituent of ribosome”, overlaps S. lycopersicum ESTs in correspondence of two different BACs assigned to the chromosome 7. This indicates that, given the large phylogenetic distance between tomato and Arabidopsis, mRNA sequences are hardly identified when the RNA to genome alignments are filtered out with 90% identity and 80% coverage (See Methods).
Aligning Affymetrix Tomato Genome Array probe-sets
The Tomato Genome Array is designed specifically to monitor gene expression in tomato and other Solanaceae species [
14]. The comprehensive array consists of over 10,000 probe sets to interrogate over 9,200
S. lycopersicum transcripts [
5].
To date, 4,445 out of 112,528 probes are mapped to the tomato genome. Because some probes are aligned to BAC sequences more than one time, the number of the matches identified (5,827) is higher than the number of the distinct probes aligned.
In particular, 680 oligonucleotide probes do not overlap any S. lycopersicum EST. On the other hand 197 probes are in genomic regions where non-native ESTs have been aligned: S. tuberosum (164 probes), S. habrochaites (12 probes), S. chacoense (12 probes) and both S. habrochaites and S. tuberosum (9 probes). 483 oligonucleotide probes do not overlap any EST.
Expression level
EST data processing
We collected all the EST data from Solanaceae species available in dbEST [
15] (Table
1). In particular, data are from six species that are members of the genus Solanum of which four from species of subgenus Lycopersicon; five species of the genus Nicotiana; two species of the genus Capsicum and one species belonging to the genus Petunia. We considered also ESTs from two species of the family Rubiaceae genus Coffea, because coffee and tomato share common gene repertoires, as revealed in [
16].
A specific basic tool has been designed to remove over-represented EST sequences from each of the 16 collections in order to clip the original datasets and produce non-redundant EST collections (Table 1 column 6). These EST collections are independently processed by the ParPEST pipeline [7] in order to i) group ESTs that tag the same gene and generate one tentative consensus sequences (TCs) per putative transcript and ii) determine a preliminary functional annotation of both ESTs and TCs.
EST sequence databases
The EST database architecture is relational. The database stores raw EST sequences and library details; clustering information as well as all the features which describes EST-alignments within clusters. The EST set in a cluster can be assembled into multiple TCs [
17] so that their number usually results larger than the number of clusters (Table
1). The total putative transcripts are created by combining the TCs and the singleton EST sequences (sESTs). The putative transcripts are annotated according to similarity versus protein and RNA family databases. A standard classification is provided using the Gene Ontology vocabulary [
18] and the Enzyme Commission numbers [
19]. Specific Web applications allow transcripts to be dynamically organized into enzyme classes and to be on-the-fly mapped onto the KEGG metabolic pathways [
20]. Statistics of different sequence categories per species are reported in table
1. In figure
3 we report information on the functional annotation concerning the most representative transcript collections, i.e.
S. lycopersicum and
S. tuberosum.
| Figure 3 The figure enumerates statistics on the functional annotation versus the UniProt, the Rfam, the Enzyme and the Gene Ontology databases. The number of enzymes involved into known metabolic pathways is reported too. |
Comparing Tomato Genome Array probe-sets to EST sequences
ESTs from
S. lycopersicum species are compared to i) the Affymetrix Tomato Genome Array probe-sets [
5] and ii) to the EST dataset and the cDNA clones which have been used to build the TOM1 array [
6].
Of the 112,528 probes from the Affymetrix Tomato Genome Array, 101,743 have at least one match with an EST sequence. However, all the matches accounts for 735,124 hits.
Considering the 17,015 TOM1 cDNA clones, 9,696 are the number of those sequences that align at least to one EST sequence. The total number of hits is 1,057,491.
Value-added data-gathering
The higher quality collections obtained by the local data processing are organized into dedicated repositories and are provided to the community as value-added data both through Web applications and FTP services. ISOLA provides: i) the collections of non redundant Solanaceae ESTs and the corresponding computationally defined transcripts, used to sample the species-specific transcriptome space; ii) the definition of the Solanaceae proteomes, including functional annotations and the open reading frame detection, enriching the still poor collection of Solanaceae proteins available from general databases; iii) gene models from tomato, necessary to increase the meagre gene information available to build a training set for gene predictors [
20-
22]. In order to obtain a reliable and number-consistent collection of gene models, we implemented the GeneModelEST software [
24], a
subsidiary tool in ISOLA.
GeneModelEST requires the genome coordinates of the spliced-alignments of ESTs and TCs which have been independently aligned along the tomato genome draft sequences. Hence, GeneModelEST detects non-overlapping TC sequences which are consistently supported by EST alignments and therefore are evidence of expressed genome regions. These ‘expressed’ loci are considered to select highly confident gene models. Furthermore, the software exploits the TC functional annotations stored in the EST databases, to check if the ‘expressed’ loci represent full-length products. Overlapping TCs are explored as they could represent alternative transcripts but are neglected from the selection of gene models because they provide ambiguous information.
In the current update of ISOLA, 339 S. lycopersicum TCs have been selected because consistently supported by EST evidence [24] and are displayed as additional tracks in the Gbrowse. Among these TCs, 50 cover at least the 95% of the length of the most similar protein sequence; 96 at least the 50%; 145 cover less than the 50% of the matching protein and the remaining 48 present no significant similarity with any known protein. If the TCs from other tomato species are considered, further 59 loci are located. The number accordingly increases to 262 loci if the potato TC sequences are also evaluated.
Only the TCs covering the 95% of the length of the matching protein are selected as reliable gene models for the training of gene predictors. They account for a total of 111 gene models.