Adaptation of the MapMan ontology to biotic stress responses: application in solanaceous species

Background The results of transcriptome microarray analysis are usually presented as a list of differentially expressed genes. As these lists can be long, it is hard to interpret the desired experimental treatment effect on the physiology of analysed tissue, e.g. via selected metabolic or other pathways. For some organisms, gene ontologies and data visualization software have been implemented to overcome this problem, whereas for others, software adaptation is yet to be done. Results We present the classification of tentative potato contigs from the potato gene index (StGI) available from Dana-Farber Cancer Institute (DFCI) into the MapMan ontology to enable the application of the MapMan family of tools to potato microarrays. Special attention has been focused on mapping genes that could not be annotated based on similarity to Arabidopsis genes alone, thus possibly representing genes unique for potato. 97 such genes were classified into functional BINs (i.e. functional classes) after manual annotation. A new pathway, focusing on biotic stress responses, has been added and can be used for all other organisms for which mappings have been done. The BIN representation on the potato 10 k cDNA microarray, in comparison with all putative potato gene sequences, has been tested. The functionality of the prepared potato mapping was validated with experimental data on plant response to viral infection. In total 43,408 unigenes were mapped into 35 corresponding BINs. Conclusion The potato mappings can be used to visualize up-to-date, publicly available, expressed sequence tags (ESTs) and other sequences from GenBank, in combination with metabolic pathways. Further expert work on potato annotations will be needed with the ongoing EST and genome sequencing of potato. The current MapMan application for potato is directly applicable for analysis of data obtained on potato 10 k cDNA microarray by TIGR (The Institute for Genomic Research) but can also be used by researchers working on other potato gene sets. The potato mapping file and the stress mapping diagram are available from the MapMan website [1].

(page number not for citation purposes)

Background
The output of microarray statistical data analysis is often in the form of a list (or lists) of differentially expressed genes. Depending on the null hypothesis that is being tested and the experimental treatments that have been carried out, these lists can vary in length but often enough these are too long for a rigorous manual inspection. This poses a problem of complexity of interpretation. Lists can be condensed by organising them according to their known or suspected function, but this requires gene ontologies, which cannot be automatically extended from one organism to another, especially when they are not closely related. In addition, transcriptome data analysis is being combined with proteome (or metabolome) data analysis which only adds to the complexity of interpretation. Development of new, more reliable methods of data analysis and visualization will enable easier interpretation of results and thus a greater contribution to explaining the biological problem. Various visualization tools that help the data analysts and the biologists are available today, from GenMapp, Pathway Processor and GeneXpress (reviewed in [2]) to KaPPA-View [3] and VANTED [4]. Using them, it is possible to find trends that would be less apparent from using only lists of genes [5]. While they are suitable for some organisms, their usefulness for plant organisms can be restricted because they have often been developed for microbial or animal systems and thus have categories that are irrelevant for plant systems and lack plant-specific pathways and processes [5,6].

MapMan organization
A plant-specific visualization tool, MapMan, has been developed to overcome this problem. Initially it was developed for genes from Arabidopsis thaliana that are present on the Affymetrix 22 K array [6]. Its main purpose is to organize and display experimental data sets or results onto diagrams of the users' choice [5,6]. It consists of two modules, (i) Scavenger module and (ii) ImageAnnotator.
The Scavenger module is a gene ontology, in which genes are assigned based on their annotation into largely nonredundant and hierarchically organised BINs. Each BIN consists of items of similar biological function and can be further split into subBINs, corresponding to submodes of the biological function [7]. The original BIN assignments for A. thaliana were based on publicly available gene annotations from TIGR (The Institute for Genomic Research) using a process which involved alternation between automatic recruitment and manual correction [6]. The resulting BINs are shown in Table 1; these are broken down by current versions into > 1200 subBINs.
The ImageAnnotator module uses the classifications from the Scavenger module in the form of mapping files in order to display data on various diagrams of the user's choice [5]. A mapping file for an organism includes, but is not limited to, these categories: BIN code; BIN name; identifier; description. The identifier includes gene names or clone names with their descriptions, i.e. the names which link the mapping file with the experiment file.
The ImageAnnotator also uses diagrams for data display. They are obtained with the MapMan software [1] or can be self-made and then included in MapMan as described in [5].

Potato annotation
The potato genome has yet to be fully sequenced. Thus, at present only an approximation of the potato transcriptome is available in the form of the potato gene index that contains expressed sequence tags (ESTs) from diverse tissues. The potato gene index (StGI [8]) clusters ESTs into tentative contigs (TCs) after removing low quality ESTs [9]. The information in StGI is constantly being updated, with 13 releases since July 2000. Not all gene functions within gene indices are well characterized. The StGI database contains clone sequence data (in GenBank) and the corresponding tentative consensus sequence (TC) of the clones that theoretically represent one gene.
The TIGR potato 10 k array is the only microarray platform that is currently publicly available for researchers studying potato. Together with several control clones, the microarray contains 15,264 clones. Each cDNA clone is spotted in two spatially separated replicates thus summing up to 30,528 spots on each microarray. For a general purpose microarray design it is expected that clones from every BIN should be present on the microarray.
We found MapMan implementation for potato beneficial as it will facilitate biological interpretation, support an ontology-based statistical data analysis (see e.g. [10]) and provide users a global overview of the results. We have also added a new mapping of plant's response to biotic stress in order to facilitate the studies of biotic interactions.

Results and Discussion
Given the usability of potato microarrays in various potato experimental systems, [11][12][13], we have implemented MapMan to visualize potato transcriptomic data. The sequences and the annotations of the potato gene index (Solanum tuberosum gene index, StGI) version 10 were BLASTed against Arabidopsis proteins (release TAIR 6). In this way, every potato clone was assigned up to ten best matching Arabidopsis proteins.
The Arabidopsis proteome was chosen for comparison because (i) the original MapMan Scavenger module was constructed for Arabidopsis, (ii) the Arabidopsis genome has been fully sequenced and (iii) potato is phylogenetically most similar to Arabidopsis among plant species with known genomes. BLASTing results were put into a file which was modified in order to contain the potato unigene name and its annotation from StGI, the protein domain description from StGI, the best matching Arabidopsis gene names and the corresponding E value, and information on the presence of the clone on the TIGR potato 10 k microarray, and the BIN assignment from Arabidopsis in order to match the original BIN assignment for every gene name. Potato clone annotations were checked manually with the matching Arabidopsis entry; if it differed and the E value was reasonably low (threshold 10 -15 ), the BIN assignment was left as for Arabidopsis. When E values were higher than the chosen threshold (between 10 -10 and 10 -15 ), clones were assigned manually to corresponding BINs, on the basis of sequence and literature searches.
Some of the problems encountered in converting Map-Man for a potato clone set were due to species-specific differences in the metabolism. Further, potato is a model organism for plant-pathogen interactions and for physiological processes like tuberization, dormancy and sprouting [14]. Therefore expert input is needed for more detailed BIN structuring and for preparing schemes for such specific processes.
The final mapping file had 43,408 entries that represent the 38,239 different sequences in the potato gene index; of these 15,817 (around 36%) are present on the TIGR potato 10 k microarray. The percentage of clones in a BIN compared to all StGI clones, as well as the numbers of clones from a BIN that are present on the 10 k microarray, are shown in Table 1.

Classifying biotic stress responses in potato
In order to enable easier data visualization and interpretation of gene expression in potato -virus and other biotic interactions, the genes which had been classified as potentially being involved in biotic stress (BIN 20.1) were further subdivided into respiratory burst, receptors, signalling, kinases, regulation of transcription, heat shock proteins, pathogenesis-related (PR) proteins, secondary metabolism and miscellaneous functions. A sub-subBIN comprising proteinase inhibitors was added to a subBIN of PR proteins. New subBINs reflect the pathogen's signal transduction pathway and the plant's response to infection ( Figure 1). Since some genes that are involved in response to stress, e.g. proteinase inhibitors, are involved in constitutive processes as well [15], the classification was based on the level of involvement of the gene (e.g. recognition, signalling etc.) rather than on its molecular function. 509 unigenes were mapped into BIN 20.1. Of those, slightly over 50% belong to receptors subBIN, con- From the 1582 clones assigned to BIN 35.1, that are represented on the potato 10 k microarray, 52 clones were assigned to corresponding BINs in the case where the potato annotation (together with high homology and coverage) was more specific than that for Arabidopsis. Of these 52 clones, eight were annotated to more than one subBIN. The majority of newly assigned clones belonged to BIN 20, which is not surprising since, in the current work, the emphasis was put on this group of genes. • 1705 were similar to Arabidopsis sequences coding for proteins annotated only as "expressed_proteins", and that could not be mapped into other existing bins -group II 97 clones from subBIN 35.2 were mapped to more appropriate BINs. The majority of the newly mapped clones were potato or other Solanaceae-specific (Table 3). Taking into account our special focus on genes involved in the biotic stress, the majority of the newly annotated clones (35) were mapped to BIN 20.1 and its subBINs. Examples of Solanaceae specific transcripts include genes for metallocarboxypeptidase inhibitor, a potato PR-10a protein [19], several potato cysteine proteinase inhibitors [15] and potato polyphenol oxidase.

BIN representation on the potato microarray
Since TIGR potato 10 k cDNA microarrays are used by several researchers working on potato and other Solanaceae species, we investigated how various BINs and subBINs are represented on the potato microarray.
Globally, around 36% (15,817 out of 43,408) of potato sequences from the StGI database are present on the potato microarray. Twelve BINs out of 33 (see Table 1 in Changes in expression during responses of plant samples to pathogens Figure 1 Changes in expression during responses of plant samples to pathogens. The plant's reaction to biotic stress involves a few steps: after the initial signal input from the pathogen which is recognized by the related receptors (putative R genes), transcription of the cascade of the plant defence mechanism is triggered, including oxidative stress changes. Inside the cell, signals are transmitted to lead to the production of defence molecules (PR-proteins, heat shock proteins and secondary metabolites). Genes with experimental indication of involvement in the biotic stress are gathered on the main panel (coloured with dark grey), while genes and pathways that are putatively involved in biotic stress pathway are shown on the left and right sides (coloured in light grey). a) Potato samples 30 minutes after inoculation with potato virus Y. b) Tobacco samples 24 hours after inoculation with M. secta. In both cases, the signal after infection is expressed as a ratio relative to the signal in unifected controls, converted to a log 2 scale, and displayed. The scale is shown in the figures.
bold) are completely covered (BIN and all the subBINs) on the microarray. In other words, at least one clone from every BIN and its corresponding subBINs mentioned above are present on the microarray. There were only minor discrepancies such as (i) very small subBINs with few entries that had no 10 k microarray clone representative (e.g. 26.09 misc.glutathione S transferases with 1 entry), or (ii) half of a sub-subBIN is sometimes being missing on the microarray (e.g. subBIN 17.3.1.01 hormone metabolism.brassinosteroid.synthesis-degradation.reductase). This kind of discrepancy, where some subBINs are completely missing on the microarray while others are fully present, is usually found in sub-subBINs, representing single enzymatic functions, and not at the higher hierarchical BIN splits. Consequently, we can say that, in general, BIN representation is well covered on the TIGR microarray, enabling investigation of various physiological issues. Some improvements in this aspect could be incorporated in the next versions of TIGR potato microarrays.

Experimental data
To present the functionality of potato mapping for Map-Man, we explored two experiments, by which the complexity of the pathogen vs. plant interaction was assessed. Since plant responses to pathogen attack are still far from being completely understood [20], these experiments provide a useful insight into their reaction to pathogen infection.
A simple comparative design experiment was conducted in which, in the first experiment, a potato cultivar resistant to potato virus Y NTN (PVY NTN ) was tested. To show the versatility of the biotic stress pathway, a second, previously published [21] experiment was reexamined, in which coyote tobacco (Nicotiana attenuata) plants were tested for their reaction to a herbivore insect, Manduca secta. The aim of both experiments was to investigate on plant's involvement in biotic stress response on the transcriptome level. In both experiments, plants were divided in two groups; being virus-inoculated or insect-treated and the other half being mock-inoculated or non-treated in the case of potato or tobacco experiment, respectively. Potato and tobacco plants were harvested 30 minutes and 24 hours post infection, respectively. While the potato experiment was performed in-house, the raw data for the tobacco experiment was downloaded from the SGED database, in which currently around 50 Solanaceae studies are deposited [22]. Three biological replicates for a virus and a mock infected sample were analysed using the TIGR potato 10 k microarray. Data were analyzed using R statistical software with the limma package [23]. The output of the analysis, a list of possible differentially expressed genes (p < 0.05), with their respective M values, was visualized using MapMan (Figure 1a and 1b). Instead of having to analyze each gene separately, the difference in expression can be now visualized for a whole BIN. Biological hypotheses can thus be confirmed or rejected, since a larger picture, that includes all the differentially expressed genes, is seen.
For the potato PVY NTN experiment, 575 clones were differentially expressed, of which 16 belong to the biotic stress response pathway (Figure 1a). It is easily seen that the clones belonging to the beginning (putative R genes) and the end (PR-proteins) of the pathway are mostly downregulated after viral infection. Moreover several other genes putatively involved in biotic stress (such as heat shock proteins and transcription factors), together with genes involved in abiotic stress, have mostly been downregulated. This intriguing result becomes immedeatly obvious after inspecting the provided figure or by looking at changed categories. These subBINs had a low p-value for Wilcoxon rank sum test which confirmes that the average response of a BIN is different from the response of all the other BINs [5]. It seems that 30 minutes post infection is too early for a plant to establish real defence response. At this time genes that are involved in other processes are upregulated (data not shown). This result is in concordance with previously published studies involving cucumber mosaic virus infection to Arabidopsis thaliana where at early stages of response, most of the significantly expressed genes were downregulated [24]. Additional experiments should be performed in order to confirm this hypothesis.
A list of 875 differentially expressed clones was obtained after M. secta attack on N. attenuata. Their involvement in biotic stress pathway is presented on Figure 1b. Here, in contrast to the potato experiment, genes that are represented on our biotic stress pathway (10 clones) are almost all upregulated. Moreover, several other genes that are deeply involved in plant-pest interaction, namely jasmonic acid pathway and genes involved in secondary metabolites synthesis were upregulated as discussed in [21]. However, whereas previously candidate genes had to be examined separately and functional classes had to be assigned based on the annotation, now this can be done with a few mouse-clicks by simple pasting lists [10,25]. As we see a quantitatively similar response as previously described, the adaptation of MapMan will help in classifying genes from potato and thus speed up array analysis significantly especially in the area of pathogen interaction. A more thorough examination of the results could lead to formation of interesting biological hypotheses on the mechanism of resistance/reaction to virus/insect infection. The observed difference in the two responses might be attributed to different experimental systems: different plants, stressors and most importantly, time after treatment. It seems that in early response to pathogen, plant defence have not yet been activated.

Conclusion
Tools that integrate microarray data with biological processes in which the genes are involved have been developed mostly for human and microorganism microarrays. Consequently, they do not include processes and metabolic pathways specific for plants (e.g. photosynthesis) and they can-not be used to visualize data obtained by plant microarrays. Additionally, plants are remarkably diverse and therefore additional pathways may need to be implemented for each species. Tools that are specific for plant metabolism have been developed [2][3][4] but many of them have been Arabidopsis-specific. MapMan is different as it is more flexible, enabling any organism to be mapped to the existing MapMan ontologies. It has already been adapted for use with tomato [7] and Medicago [26]. With the implementation of potato mappings to MapMan, visualization and interpretation of more complex biological data from experiments performed on potato will be easier. As current trends in molecular biology are focused on connecting results from different levels of '-omics' data analysis (e.g. transcriptomic with metabolomic data analysis), this will be possible with potato experiments too, since MapMan enables the visualization of trancriptomic and metabolomic data simultaneously [5,6]. Conversly, it will be possible to use our improved dissection of biotic stress in BIN 20.1 and the schematic representation of the plant's response to infection for MapMan mappings in other species (Arabidopsis, tomato, Medicago). Thus the results of complex analysis can be interpreted on an additional level, and biotic stress responses in different experimental systems can be compared. This can increase the understanding of plant response to pathogen or pest attack.

BLAST
The sequences and the annotations of the potato gene index (Solanum tuberosum gene index, StGI) version 10 were downloaded from the StGI database (38,239 unique potato unigene identfiers). The unigenes sequences were then BLASTed (BLASTx, version 2.2.14) against Arabidopsis proteins release TAIR 6 (file TAIR6_pep_20051108, available from [27]) under default settings. In this way, every potato clone was assigned up to ten best matching Arabidopsis genes. The complete mapping file had 43,408 entries.

Annotations
Following BLASTing, the file was modified in order to contain only the potato unigene name and its annotation from StGI, the protein domain description from StGI, if any, the best matching Arabidopsis gene names and the corresponding E value. The information on the presence of the clone on the TIGR potato 10 k microarray was included, where an extra column to the final output data was added, showing the potato clone name if present on the microarray. The BIN assignment from Arabidopsis was also included in order to match the original BIN assignment for every gene name.

Manual annotation evaluation
Every potato clone annotation which was derived from its corresponding tentative contig, was checked manually in order to compare it with the description of the matching Arabidopsis entry. If it differed from the Arabidopsis description, the E value was checked. If the E value was reasonably low (threshold was chosen at 10 -15 ), the BIN assignment was left as for Arabidopsis. When E values were higher than the chosen threshold (between 10 -10 and 10 -15 ), clones were assigned manually to corresponding BINs, on the basis of sequence (NCBI BLASTx, nr database), protein domain information (PFAM [28], SMART [29]) and TIGR annotation. An additional literature survey was performed when needed. Where appropriate, BIN assignments were changed for original Arabidopsis mappings as well.

Array analysis
A simple comparisons design analysis was applied (virus infected/insect treated versus mock inoculated/untreated samples). All calculations were done in limma software package for R with the limma package [23,30]. Two normalizations were performed, loess [31] and vsn [32]. The intersection of genes, resulting as differentially expressed both by applying loess and vsn normalization were used for further analysis.
Publish with Bio Med Central and every scientist can read your work free of charge