- Open Access
A flexible representation of omic knowledge for thorough analysis of microarray data
Plant Methods volume 2, Article number: 5 (2006)
In order to understand microarray data reasonably in the context of other existing biological knowledge, it is necessary to conduct a thorough examination of the data utilizing every aspect of available omic knowledge libraries. So far, a number of bioinformatics tools have been developed. However, each of them is restricted to deal with one type of omic knowledge, e.g., pathways, interactions or gene ontology. Now that the varieties of omic knowledge are expanding, analysis tools need a way to deal with any type of omic knowledge. Hence, we have designed the Omic Space Markup Language (OSML) that can represent a wide range of omic knowledge, and also, we have developed a tool named GSCope3, which can statistically analyze microarray data in comparison with the OSML-formatted omic knowledge data.
In order to test the applicability of OSML to represent a variety of omic knowledge specifically useful for analysis of Arabidopsis thaliana microarray data, we have constructed a Biological Knowledge Library (BiKLi) by converting eight different types of omic knowledge into OSML-formatted datasets. We applied GSCope3 and BiKLi to previously reported A. thaliana microarray data, so as to extract any additional insights from the data. As a result, we have discovered a new insight that lignin formation resists drought stress and activates transcription of many water channel genes to oppose drought stress; and most of the 20S proteasome subunit genes show similar expression profiles under drought stress. In addition to this novel discovery, similar findings previously reported were also quickly confirmed using GSCope3 and BiKLi.
GSCope3 can statistically analyze microarray data in the context of any OSML-represented omic knowledge. OSML is not restricted to a specific data type structure, but it can represent a wide range of omic knowledge. It allows us to convert new types of omic knowledge into datasets that can be used for microarray data analysis with GSCope3. In addition to BiKLi, by collecting various types of omic knowledge as OSML libraries, it becomes possible for us to conduct detailed thorough analysis from various biological viewpoints. GSCope3 and BiKLi are available for academic users at our web site http://omicspace.riken.jp.
Since microarray analysis was first developed as a technique for analyzing gene expression simultaneously [1, 2], functional investigation of genes has been actively carried out using microarrays and novel findings have been obtained. However, there is always a possibility that some gene functions to be discovered are overlooked by biologists analyzing the microarray data, because the amount of gene expression information detected by microarray is so vast that it is difficult to analyze the obtained data fully. Therefore, various methods and tools for analyzing microarray data have been developed, especially comparing microarray data with biological knowledge [3–6]. The importance of gene expression in biological networks (for example, metabolic pathways) is noted [7, 8]. Dahlquist et al.  have developed a tool which can display the gene expression profiles of microarray data on biological networks.
Regarding conceptually structured ontology of gene functions, the Gene Ontology Consortium  is providing a set of structured vocabularies for specific biological domains, which can be used to describe gene products in any organism. Doniger et al.  have developed a tool which can display the gene expression profile of microarray data on a directed acyclic graph of Gene Ontology (GO). On the other hand, Thimm et al.  have developed a tool which can display the gene expression of microarray data on metabolic pathways and other biological processes. GeneSpring (Silicon Genetics, Redwood City, CA, USA) can display microarray data on the figure of a gene positioned on a genome. However, these tools give priority to the display of a certain type of data and cannot analyze microarray data from multiple view points. It is desired that various form of biological knowledge are represented by a flexible language and can be used for microarray analyses by a single universal tool.
A number of bioinformatics tools have been developed. However, they are restricted to deal with only a few types of omic knowledge, e.g., pathways, interactions or gene ontology. Now that the varieties of omic knowledge are expanding, analysis tools need a way to handle any type of omic knowledge. Hence, we have designed the Omic Space Markup Language (OSML) format which is able to convey various elements of biological knowledge . OSML format is highly flexible, can describe a wide range of biological knowledge, and is designed to allow users to prepare their own data in the OSML format . In order to show, through examples, that OSML format data can express the metabolic pathway, directed acyclic graphs of GO terms, gene positions on genome, and protein-protein interaction, we have constructed a Biological Knowledge Library (BiKLi) by converting various biological knowledge and information sources into OSML format (Figure 1, Figure 2). BiKLi is available with GSCope3 for academic and non-profit users http://omicspace.riken.jp. To confirm the effectiveness of BiKLi and GSCope3, we have analyzed the microarray data of A. thaliana under drought stress treatments by applying BiKLi and GSCope3 . The statistical ranking function provided in GSCope3 is so useful that the metabolic pathway, genome area, or GO relating to a specific gene expression pattern can be easily discovered. Moreover, users can perform various analyses by using GSCope3 using their own original data represented by OSML.
Plant growth, viability, and fertilization are greatly affected by environmental stresses, such as drought. Plants respond and adapt to these stresses in order to survive. These stresses induce various biochemical and physiological responses in plants, which are followed by a change of gene expression. For this reason, analysis of gene expression under stress is important in plant molecular biology, biochemistry, and physiology. Hence, some biologists have studied gene expression in plants under these stressful conditions [12–15]. Moreover, studies on expression profiling under stress conditions using microarray technology have been published [16–18]. As mentioned above, the analysis of gene expression under stress is very important and many biologists are interested in this study. Hence, we analyzed the microarray data of A. thaliana under stress.
In this paper, we report and present the results of 7 K RIKEN Arabidopsis Full-Length (RAFL) cDNA microarray analysis under drought stress treatments  by using BiKLi and GSCope3, and further discuss the utility of the BiKLi for microarray data analysis.
Results and discussion
KEGG type metabolic pathway
Additional File 1 shows a list of significant correlations between the "functional Class" and clusters formed by batch-learning self-organizing maps (BL-SOM)  of the expression profile of microarray probes under drought conditions). In this table, header means as follows: A is the number of probes in the selected pathway ("functional Class") and the selected cluster, B is the number of probes out of the selected pathway and in the selected cluster, C is the number of probes in the selected pathway and out of the selected cluster, D is the number of probes out of the selected pathway and the selected cluster, P is the probability function, P' is Bonferroni Corrected P, and N is number of pathway ("functional Class"). The colour of expression means as follows: red is up-regulation while green is down-regulation. A higher colour chroma denotes higher value. The "functional Class" shows a sub-pathway in the metabolic pathway.
A strong correlation was discovered between the "carbon fixation pathway" and gene expression profile which reduced the transcription after some time passed (cluster ID is 0-1 and 0-0) (Additional File 1), especially the genes classified in cluster 0-1 and became concentrated at the "Calvin cycle"(Figure 3). This signifies an existing relationship between the "carbon fixation pathway" and down-regulated genes under drought stress conditions. Indeed, Seki et al.  have found out that many photosynthesis-related genes are down-regulated by drought stress.
A significant correlation between "starch and sucrose metabolism" and the gene expression profile which enhanced the transcription after a period of time was revealed (cluster ID is 10-2) (Additional File 1). Common genes between this "functional Class" and this cluster are polygalacturonase, glycosyl hydrolase family 32 (beta-fructosidase) (EC:126.96.36.199, At1g62660), sucrose synthase (EC:188.8.131.52, At3g43190), and trehalose-6-phosphate phosphatase (EC:184.108.40.206, At4g12430). Cushman and Bohnert  have suggested that disaccharides (e.g. sucrose, raffinose, or trehalose) probably function as osmolytes in protecting cells from dehydration. Seki et al.  have noted that galactinol synthase, raffinose synthase, sucrose synthase, and trehalose-6-phosphate synthase genes are up-regulated genes under drought stress conditions. Moreover, Seki et al.  have discussed the correlation between the data and the findings by Cushman and Bohnert .
In this study, we discovered that BiKLi and GSCope3 determined in a fast and efficient way the same findings which Seki et al. had obtained .
AraCyc type metabolic pathway
The KEGG type and AraCyc type metabolic pathways are biochemically different from each other. KEGG type metabolic pathway is focused on all organisms. On the other hand, AraCyc type metabolic pathway is focused on Arabidopsis thaliana only. Moreover, metabolic pathway can divide or branch into various patterns. Therefore, the two type pathways are not combined in this study.
Additional File 2 shows a list of significant correlations between the "functional Class" and the clusters formed by BL-SOM of the microarray probes' expression profile under drought conditions. Similar to the analysis using the KEGG type metabolic pathway dataset, a strong correlation was found between the "Calvin cycle" and the gene expression profile which later caused down-regulation of transcription (cluster ID is 0-1 and 0-0) (Additional File 2).
Another significant correlation was reported between the "serine-isocitrate lyase pathway" and the gene expression profile, which showed a weak peak two hours after the drought condition started (cluster ID is 3-0) (Additional File 2). Each gene encodes four enzymes: alanine-glyoxylate aminotransferase (EC:220.127.116.11, At2g13360), malate dehydrogenase (EC:18.104.22.168, At1g04410, At3g47520), phosphoenolpyruvate carboxylase (EC:22.214.171.124, At2g42600), and glycine hydroxymethyltransferase (EC 126.96.36.199, At4g32520). The enzymes act in the "serine-isocitrate lyase pathway" and are classified in the same cluster. However, the "serine-isocitrate lyase pathway" utilizes one carbon compound, such as formaldehyde in bacteria . In addition, these detected enzymes do not act as regulated enzymes, such as hydroxypyruvate reductase (EC:188.8.131.52), glycerate kinase (EC:184.108.40.206), malate-CoA ligase (EC:220.127.116.11), or malyl-CoA lyase (EC:18.104.22.168) in the "serine-isocitrate lyase pathway". Therefore, the correlation between "serine-isocitrate lyase pathway" and the genes with weak expression profile may not be important. However, these detected enzymes produce malate from L-glycine. Perhaps, the reason these genes are up-regulated after drought conditions started is because the drought stress induces malate biosynthesis from glycine, not the induction of "serine-isocitrate lyase pathway".
Another significant correlation was revealed between the "lignin biosynthesis" or the "suberin biosynthesis" and the gene expression profile which peaked two hours after introducing drought conditions (cluster ID is 6-2) (Additional File 2). Lignification of the cell wall and secondary wall formation make the cell strong . Therefore, it has been suggested that transcription of these genes is induced during drought stress because lignin synthesis can cause resistance to drought stress. In fact, lignin formation is observed at sites of wounding or pathogen attack, apparently in an effort to strengthen the wall at these sites of damage . Perhaps, lignin formation takes place in A. thaliana to resist drought stress.
In this study, we discovered that BiKLi and GSCope3 efficiently determined similar findings which Seki et al. had obtained . Moreover, the detection of novel information using BiKLi and GSCope3 revealed lignin formation in A. thaliana in order to resist drought stress.
Directed acyclic graphs of GO terms
The Gene Ontology (GO) Consortium provides directed acyclic graphs of GO terms. However, these graphs are very large, so we attempted to use the Plant GO slim data for the microarray data analysis. Additional File 3 shows a list of significant correlations between the "functional Class" and the clusters formed by BL-SOM of the microarray probes' expression profile during drought conditions. A strong correlation was discovered between the "transport (GO:0006810)" (Figure 4) and the gene expression profile which showed a strong peak after two hours of drought conditions (cluster ID is 8-2) (Additional File 3).
In this study, many water channel genes, classified as transports (GO:0006810), are connected to the expression profile, which increases the transcription until two hours after drought conditions started, then subsequently decreases the transcription. These water channel genes are plasma membrane intrinsic protein 2A (At3g53420), transmembrane protein (MIP family) (At4g00430), plasma membrane intrinsic protein 2C (At2g37180), putative plasma membrane aquaporin (At3g54820), plasma membrane intrinsic protein 1A (At3g61430), and plasma membrane intrinsic protein 1c (At1g01620). It is thought that transcription of these genes become active to oppose drought stress. Perhaps, the transcription peak of these genes is two hours after drought stress begins. It is considered that transcription of these genes decreases and returns to the usual state two hours after changing to drought conditions because the necessity for transcription is lost.
Another significant correlation between the "development (GO:0007275)" and gene expression profile, which increased transcription two hours after drought conditions began, was noted (cluster ID is 10-1) (Additional File 3). These genes are late embryogenesis abundant (LEA) protein family gene (At4g02380), a late embryogenesis abundant protein LEA-like gene (At5g06760), and non-apical meristem (NAM) family protein gene (At4g27410). Seki et al.  found that the LEA protein genes are induced by drought stress, which is supported by the fact that LEA proteins are involved in protecting macromolecules .
In this study, using BiKLi and GSCope3, we discovered the same findings which Seki et al. had obtained . The result was the activation of LEA protein genes by drought stress. Moreover, the novel information detected was the induced transcription of many water channel genes to oppose drought stress.
Gene position on genome
The analysis whether genes with similar expression profiles localizes at neighbourhoods on the genome or not is possible by using gene position data on the genome. Additional File 4 shows a list of significant correlations between the "functional Class" and the clusters formed by BL-SOM of the microarray probes' expression profile during drought conditions. A strong correlation was found between the area of 15,390,001–15,420,000 bp in chromosome 5 and the gene expression profile which decreased transcription after some time passed (cluster ID is 0-0) (Additional File 4). In this area, three ribulose 1,5-bisphosphate carboxylase/oxygenase (RuBisCO) small subunit genes exist . It is likely that these RuBisCO genes are co-regulated in the neighbourhood area because they have similar expression profiles and form the same enzymes (Figure 5).
The correlation between the area of 22,680,001–22,710,000 bp in chromosome 5 and the gene expression profile, which moderately increased the transcription after a period of time, was considered significant (cluster ID is 5-0) (Additional File 4). Heat shock protein 81.4 (hsp81.4) (At5g56000), heat shock protein 90 (perhaps hsp81.3) (At5g56010), and heat shock protein 81.2 (hsp81.2) (At5g56030) existed in this area and they were classified in the same cluster (Figure 6). It has been suggested that these genes have similar expression profiles and there is a possibility of a similar control affecting these genes (Figure 7), although Milioni and Hatzopoulos  already revealed that these genes are clustered within the 15 kb genomic region on chromosome 5.
Protein-protein interaction and A. thaliana gene expression
The information in protein-protein interaction is useful for examining the relationships between units of protein, and between ligand and receptor. In this study, only the relationship of the expressions between 20S proteasome genes was elucidated, specifically evident in A. thaliana. However, by using the information on protein-protein interaction in other organisms, the association of the expressions between the unclear genes in A. thaliana can be clarified. In this study, the details on protein-protein interaction on Caenorhabditis elegans, as well as the protein-protein interaction on Saccharomyces cerevisiae were built into BiKLi. It indicates that GSCope3 is capable of organizing the information on protein-protein interaction. In this study, we present the result on C. elegans protein-protein interaction (worm PPI) dataset.
Additional File 5 shows a list of significant correlations between the "functional Class" and the clusters formed by BL-SOM of the microarray probes' expression profile during drought conditions. A significant correlation was found between the "functional Class" which centred on the peptidase (T20F5.2, Y38A8.2) or proteasome component (C15H11.7, D1054.2) and the gene expression profile which showed a weak peak after two hours of drought conditions (cluster ID is 4-1). All probes included in these functional Classes and cluster 4-1 were cDNA clones of 20S proteasome subunit genes. Furthermore, most cDNA clones of 20S proteasome genes were included in cluster 6-1 or 4-1 (Figure 8, 9), which suggested that genes of 20S proteasome subunits have similar expression profiles and receive similar control. Exceptionally, the expression profiles of PAD1 (At3g51260), PBF1 (At3g60820), and PBG1 (At1g56450), which encoded for the α4, β6, and β7 subunits, respectively showed expression profiles different from the majority. It has been reported that the 20S proteasome α4 subunit of A. thaliana complements the 20S proteasome α4 subunit and α3 subunit of yeast , which suggests a complementary working relationship of α4 subunit with other subunits. Therefore, PAD1 possibly receives a different control, although most 20S proteasome subunit genes receive a similar control. However, further investigation is necessary to determine if the expression patterns of PAD1, PBF1, and PBG1 are different from the expression patterns of other 20S proteasome genes.
In this study, BiKLi and GSCope3 detected the novel information on 20S proteasome subunit genes showing similar expression profiles under drought stress.
Comparison between gene expression of A. thaliana and Z. elegans
A comparison between different microarray data can be facilitated by using GSCope3. In addition, if the genes between different organisms relate, a comparison of the microarray data between different organisms becomes possible.
Demura et al.  tried to compare gene expressions between Z. elegans and Poplar tremula × P. tremuloides clone T89, where the A. thaliana gene corresponded to the Z. elegans gene. Demura et al.  also examined the change in gene expression in the transdifferentiation of mesophyll cells into xylem cells in Z. elegans.
This experiment is quite difficult to perform on A. thaliana. In order to investigate the change in gene expression in the transdifferentiation of mesophyll cells into xylem cells in A. thaliana, comparison between Z. elegans and A. thaliana is practical. In this study, the comparison was made in order to confirm the relationship between gene expressions in the transdifferentiation of mesophyll cells into xylem cells and the response against drought stress. Furthermore, it presents an example of a feasible comparison of the microarray experiment using different organisms.
The gene expression profiles were divided into 24 subgroups. We compared the classification of the genes by BL-SOM using the microarray data of Seki et al.  with the genes based on the microarray data of Demura et al. .
Additional File 6 shows a list of the significant correlations between the "functional Class" and the clusters formed by BL-SOM of the microarray probes' expression profile during drought conditions. A significant correlation was discovered between the gene expression profiles with decreased transcription two hours after the drought condition started (cluster ID is 1-2) and the subclass with up-regulated genes during stage one (Additional File 6). Stage one corresponds to the functional dedifferentiation process during which mesophyll cells lose their photosynthetic capacity and acquire a new multidifferentiation potency . The genes included in the category are 5-methyltetrahydropteroyltriglutamate-homocysteine S-methyltransferase (At5g17920), hydroxymethyltransferase (At4g13930), and putative WD-40 repeat auxin-dependent protein ARCA (At1g48630). These genes may be key genes between drought stress response and transdifferentiation. Unfortunately, there are no novel discoveries because the A. thaliana gene and Z. elegans gene exhibit little association. If more A. thaliana genes associate to Z. elegans genes, then likeness and difference of gene expression mechanisms between transdifferentiation and stress response could be analyzed in detail.
It is proven in this study, through microarray experiment, that the relationship between gene expression and biological phenomenon is easily discovered by using the BiKLi and GSCope3. In order to speculate any relationship between a specific gene's expression pattern and the specific biological phenomenon, some researchers compare an individual gene with the biological phenomenon. However, this kind of comparison is not enough because one biological phenomenon is almost always caused by a combination of two or more genes.
In the analysis using the BiKLi and GSCope3, we use cross-tabulation in conducting statistical process. We divide the genes related to a specific biological phenomenon and other genes, and genes that have a specific expression profile and other genes. Thus, we can statistically detect the significance of a specific biological phenomenon and genes which have specific expression profiles. We therefore conclude that this analysis is more accurate.
Although we have re-analyzed the microarray data used in other analysis, we hope that when the new microarray data is assessed, many new findings may be revealed. As we look to a wider analysis in the future, we plan to enhance the content of the BiKLi.
Additionally, we are developing Genome-Phenome Superhighway (GPS)  and TraitMap system  which can use the BiKLi. The GPS has a biological network data and offers an environment where gene information can be retrieved. This network data contains biomolecular relationship data which is generated from co-citation frequencies of gene names and from arbitrary key-concept terms in literature. We plan to add the data in the BiKLi to the GPS in order to strengthen the GPS and TraitMap system in the future.
Construction of the biological knowledge library
KEGG type metabolic pathway dataset
We made a KEGG type metabolic pathway Omic Space Markup Language (OSML) file using Knowledge editor , which is referred to the KEGG database (Release 26.0)  and LIGAND database (Release 29.0) . The metabolic pathway data, especially the enzyme reactions, was constructed according to Michaels's expression . The KEGG type metabolic pathway OSML data was constructed by "functional Class" ("functional Class" is one of the elements in the OSML) , which illustrated each sub-pathway wherein the metabolic pathway is constructed by various sub-pathways. The enzyme and the metabolic compound were described as the "omicElement" ("omicElement" is one of the elements in OSML)  in the "functional Class". The enzyme code (EC) was used to show the enzyme in the metabolic pathway. For the purpose of connecting a gene and an enzyme, the use of an appendix file , wherein a gene corresponds to an enzyme in GSCope3, was acceptable. Therefore, we made an appendix file where the RAFL clone's code corresponded to EC in order to analyze A. thaliana microarray data. We made this appendix file by following this procedure: The information regarding the correspondence between the open reading frame (orf) of A. thaliana, which was shown as Arabidopsis Genome Initiative (AGI) code , and EC were acquired from the LIGAND database and AraCyc (version 2.1) . Then, the information regarding correspondence between the RAFL clone and AGI code was acquired from the RIKEN Arabidopsis Genome Encyclopaedia (RARGE) . An appendix file that showed the microarray probe corresponding to the enzyme was completed using this information. Moreover, we prepared an appendix file for the association between the AGI code and EC. The KEGG type metabolic pathway data set was composed of a KEGG type metabolic pathways OSML file (KEGG_pathway.xml), an appendix file which showed the RAFL clone's code referring to EC (RAFL_EC.apdx), and an appendix file which illustrated the AGI code corresponding to EC (AGI_EC.apdx) (Figure 1).
If revisions of the data are necessary in BiKLi, users can rewrite the OSML data of BiKLi. Moreover, we are planning to improve GSCope3 so that data of BiKLi can be modified by using Graphical User Interface (GUI).
AraCyc type metabolic pathway dataset
We made an AraCyc type metabolic pathway OSML file which referred to the AraCyc data (version 2.1) in the TAIR database . The enzymes and A. thaliana genes of each metabolic pathway were merely enumerated in the AraCyc data. The AraCyc type metabolic pathway OSML file was constructed by some "functional Classes" as the KEGG type metabolic pathway OSML file which demonstrated each sub-pathway. Therefore, the AraCyc type metabolic pathway OSML file was made so that the enzyme would be included in the "functional Class" as "omicElement". The AraCyc type metabolic pathway dataset was composed of an AraCyc type metabolic pathways OSML file (AraCyc_pathway.xml), an appendix file showing the RAFL clone's code matched to EC (RAFL_EC.apdx), and an appendix file that illustrating the AGI code corresponding to EC (AGI_EC.apdx) (Figure 1).
Directed acyclic graphs of GO terms dataset
Directed acyclic graphs of GO terms OSML file were made by referring to the flat files of gene ontology (GO), which were made by the GO consortium . The graphs were composed of terms from molecular function ontology, biological process ontology, and cellular component ontology. Moreover, compact directed acyclic graphs of GO terms ("GO slim") made by some research laboratories were transferred to directed acyclic graphs of GO terms data. We made an appendix file wherein the RAFL clone corresponded to a GO term by following these procedures: Information regarding the correspondence between AGI code and GO term were acquired from the gene ontology annotations data in the TAIR . An appendix file, which showed the microarray probe consistent with the GO term, was made using this information. Additionally, we prepared an appendix file to show the association between the AGI code and GO term. The directed acyclic graphs of GO terms dataset was composed of directed acyclic graphs of GO terms OSML files (Component.xml, Function.xml, Process.xml, Goslim_plant_2003.xml, etc.), an appendix file which presented the RAFL clone's code relating to the GO term (RAFL_GO.apdx), and an appendix file which showed the AGI code relating to the GO term (AGI_GO.apdx) (Figure 1).
Gene position on A. thaliana genome dataset
The information about the gene position on the A. thaliana genome was acquired from the MIPS Arabidopsis thaliana database (MAtDB) (version 110204) . The gene position OSML file has five "datasets" (a "dataset" is one of the elements in OSML)  corresponding to each of the five chromosomes. Each "dataset" was divided by some "functional Classes" corresponding to the area of 3,000,000 base pair (bp), 300,000 bp, and 30,000 bp. The "functional Class" demonstrated by the narrow area was described by the succession implication "part of" included in the "functional Class" pertaining to the large area. The gene position appendix file was made by describing the relation between the gene and the 30,000 bp area on the genome. The gene's position on the A. thaliana genome dataset was composed of an Arabidopsis genome OSML file (Arabidopsis_Genome.xml), an appendix file showing that the RAFL clone's code pointed to the gene position on the genome divided by 30,000 bp (RAFL_position.apdx) and an appendix file presenting the AGI code corresponding to the gene position on the genome divided by 30,000 bp (AGI_position.apdx) (Figure 1).
Yeast PPI dataset
Information about the Saccharomyces cerevisiae (yeast) protein-protein interaction (PPI) data was acquired from the paper "A comprehensive two-hybrid analysis to explore the yeast protein interactome" . The PPI information in the yeast PPI OSML was described as follows: It was assumed that the protein name was synonymous with the gene name, so the protein name was shown by the systematic orf name, e.g. YLR163C etc. One protein referred to one "functional Class" and was described as an "omicElement" in that "functional Class". Other proteins that interact with the protein were described as other "omicElements", which showed whether they are members in the "functional Class". The information regarding the association between a gene of yeast and A. thaliana was acquired from the Homologene database at the National Center for Biotechnology Information (NCBI) . The yeast gene was described with gene symbols offered in the Homologene database. Therefore, the Saccharomyces genome database (SGD)  was used to connect the gene symbol and the systematic orf name. The yeast PPI dataset was composed of a yeast PPI OSML file (yeast_PPI.xml), an appendix file which illustrated the RAFL clone's code matched to the yeast systematic orf name (RAFL_yeast_gene.apdx) and an appendix file which presented the AGI code corresponds matched to the yeast systematic orf name (AGI_yeast_gene.apdx) (Figure 1).
Worm PPI dataset
The Caenorhabditis elegans (worm) PPI data was acquired from the paper "A map of the interactome network of the metazoan C. elegans" . The PPI information was placed in a worm PPI OSML by the following procedure: It was assumed that the protein name was synonymous with the gene name, so the protein was shown by the cosmid based gene names, e.g. F07D10.1, etc. One protein corresponded to one "functional Class". The other proteins that interact with the protein were described as an "omicElement", which demonstrated whether they are members in the "functional Class". The information regarding the correspondence between a worm gene and A. thaliana was acquired from the Homologene database. The worm gene was described with gene symbols in data presented by the Homologene. Therefore, the WormBase  explained the correspondence between the gene symbol and the cosmid based gene name. The worm PPI dataset was composed of a worm PPI OSML file (Worm_PPI.xml), an appendix file matching the RAFL clone's code with the worm cosmid based gene name (RAFL_worm_gene.apdx), and an appendix file about the AGI code matched to worm cosmid based gene name (AGI_worm_gene.apdx) (Figure 1).
A. thaliana transcription dataset
An A. thaliana transcription OSML file was made according to the following procedure: The A. thaliana microarray data under drought stress was clustered by batch-learning self-organizing maps (BL-SOM)  (Figure 10). The first principal ingredient was set at eleven. An OSML file was made so that each of the 33 clusters corresponded to each "functional Class" and RAFL cDNA clones included in the same cluster were included in the same "functional Class" as the "omicElement". The genes with similar expression profiles in drought stress and in other conditions became possible by using this dataset. An A. thaliana transcription dataset was composed of an A. thaliana transcription during drought stress OSML file (Arabidopsis_drought_stress.xml) (Figure 1). In the future, we would intend to make more OSML files from other microarray data to enhance this dataset.
Z. elegans transcription dataset
Demura et al.  reported on the clustering data of a Zinnia elegans gene during transdifferentiation of mesophyll cells into xylem cells and the correspondence between data of Z. elegans genes and A. thaliana genes (AGI code). The Z. elegans genes were divided into 24 subgroups in the clustering. An OSML file was made so that these 24 subgroups could be set to "functional Class" and the AGI code corresponding to the Z. elegans gene included in this subgroup was set to "omicElement", which was contained in this "functional Class". The Z. elegans transcription dataset was composed of a Z. elegans transcription during transdifferentiation OSML file (Zinnia_transdifferentiation.xml) and an appendix file of the RAFL clone's code corresponding to AGI code (RAFL_AGI.apdx) (Figure 1).
We used the microarray data generated by Seki et al. . In the investigation that analyzed gene expression profiles, mRNAs were isolated from A. thaliana, which were subjected to dehydration, cold, or high-salinity stress treatments as "experimental" group and no treatments for the "reference". We used the data from the dehydration stress treatments experiment. They utilized a Stanford type microarray, which consists of 7000 cDNA sequences representing RIKEN A. thaliana full-length (RAFL) cDNA clones  isolated from full-length cDNA libraries . In this paper, the microarray data that was obtained was processed by the following methods: Background fluorescence was calculated on the basis of the fluorescence signal of negative control genes. Spots showing a signal value under the value of the fluorescence signal of negative control genes (+ standard deviation (SD)) in both the Cy3 ("reference" sample) and Cy5 ("experimental" sample) channels were not considered for analysis. Then each spot value of Cy5/Cy3 was calculated. To normalize the hybridization signals generated from different samples, external controls were used. Lastly, the log2 values of each spot and mean values of each probe were calculated.
Direction of GSCope3
A GSCope3 program produced by Toyoda et al. is used to analyze microarray data . GSCope3 is programmed in JAVA, therefore, it can be used in multiple platforms including Windows and Mac OS. We open this program to the public through a website  and academic users are able to freely download and use this program. GSCope3 can BL-SOM cluster the microarray data which is in cdt format . At present, the manual of GSCope3 is being made, however methods of using GSCope3 and microarray data analysis are described in the online tutorial of GSCope3 . Briefly, one OSML file in the BiKLi dataset is opened by GSCope3. Each sub-network in the dataset is expressed as a "functional Class". For example, each sub-pathway is categorized by "functional Class" in the metabolic pathway dataset and each GO term is classified by "functional Class" in the directed acyclic graphs of the GO terms dataset. Then, a suitable appendix file is opened by GSCope3. Each microarray probe is displayed by shape corresponding to each gene, protein, GO term, or gene position on the dataset by using an appendix file , in which the microarray probes are written and compared. Afterwards, if the microarray data in cdt  format or cdtx format  is opened by GSCope3, the association between the microarray data and microarray probes on the dataset (OSML and appendix file) is analyzed and the data value is displayed corresponding to the probe (Figure 2).
Functional class (network) ranking by GSCope3
After the probes are selected, the correlation of the probes and the "functional Class" can be examined in GSCope3. The ranking of the number of probes, which is analyzed by cross-tabulation using Fisher's test and included in the "functional Class", is classified into four groups: objective, not objective, selected, and not selected.
The "functional Class", which includes the selected probes, is ranked by GSCope3. Moreover, the degree of correlation between the selected probes and "functional Class" is revealed (Additional Files 1, 2, 3, 4, 5, 6). In this paper, we have examined the correlation between the probes of each cluster after BL-SOM clustering and each "functional Class".
Batch-learning self-organizing maps (BL-SOM) clustering by GSCope3 and the procedure for deciding the best classification value of BL-SOM clustering
BL-SOM is an improved method of the original SOM  with regard to the fact that the initial weight vectors are set by principal component analysis. The learning process is designed to be independent of the order of input vectors, hence, the result is reproducible . So, it is applicable to many fields of bioinformatics [51, 52].
The figure 10 shows the appearance of BL-SOM clustering of the microarray data. The first principal ingredient ID of BL-SOM cluster is shown as a number from 0 to 10 from the left horizontally. The second principal ingredient ID is the number from 0 to 3 vertically from the upper part. The cluster ID is described in the order of first principal ingredient ID then -second principal ingredient ID. For example, the section first from the left and second from the top is shown as cluster ID 0-1 (Figure 10).
To decide the best classification value of BL-SOM clustering, the following procedure was done: The microarray data was clustered by GSCope3 using the number of the first principal ingredient, ranging from three to twenty. The mean value of each cluster's average radius, which is the squared distances of the points from the centre of the cluster , and the explained variability  were calculated and plotted (Additional File 7). We decided that the eleven first principal ingredients were the best classification values because the mean value of each cluster's average radius was comparatively small and the explained variability was comparatively large.
Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270: 467-470.
Eisen MB, Brown PO: DNA arrays for analysis of gene expression. Methods Enzymol. 1999, 303: 179-205.
Luyf AC, de Gast J, van Kampen AH: Visualizing metabolic activity on a genome-wide scale. Bioinformatics. 2002, 18: 813-818. 10.1093/bioinformatics/18.6.813.
Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet. 2002, 31: 19-20. 10.1038/ng0502-19.
Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using gene ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol. 2003, 4: R7-10.1186/gb-2003-4-1-r7.
Thimm O, Blasing O, Gibon Y, Nagel A, Meyer S, Kruger P, Selbig J, Muller LA, Rhee SY, Stitt M: MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J. 2004, 37: 914-939. 10.1111/j.1365-313X.2004.02016.x.
Karp PD, Krummenacker M, Paley S, Wagg J: Integrated pathway-genome databases and their role in drug discovery. Trends Biotechnol. 1999, 17: 275-281. 10.1016/S0167-7799(99)01316-5.
Nakao M, Bono H, Kawashima S, Kamiya T, Sato K, Goto S, Kanehisa M: Genome-scale gene expression analysis and pathway reconstruction in KEGG. Genome Inform Ser. Workshop Genome Inform. 1999, 10: 94-103.
Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res. 2001, 11: 1425-1433. 10.1101/gr.180801.
The Omic Space Markup Language Version 1.0. [http://omicspace.riken.jp/publications/osml.pdf]
GSCope 3 (OSML Editor). [http://omicspace.riken.jp/osml/index.html]
Ingram J, Bartels D: The molecular basis of dehydration tolerance in plants. Annu Rev Plant Physiol Plant Mol Biol. 1996, 47: 377-403. 10.1146/annurev.arplant.47.1.377.
Shinozaki K, Yamaguchi-Shinozaki K: Molecular responses to drought and cold stress. Curr Opin Biotechnol. 1996, 7: 161-167. 10.1016/S0958-1669(96)80007-3.
Shinozaki K, Yamaguchi-Shinozaki K: Gene expression and signal transduction in water-stress response. Plant Physiol. 1997, 115: 327-334. 10.1104/pp.115.2.327.
Bray EA: Plant responses to water deficit. Trends Plant Sci. 1997, 2: 48-54. 10.1016/S1360-1385(97)82562-9.
Fowler S, Thomashow MF: Arabidopsis transcriptome profiling indicates that multiple regulatory pathways are activated during cold acclimation in addition to the CBF cold response pathway. Plant Cell. 2002, 14: 1675-1690. 10.1105/tpc.003483.
Kreps JA, Wu Y, Chang HS, Zhu T, Wang X, Harper JF: Transcriptome changes for Arabidopsis in response to salt, osmotic, and cold stress. Plant Physiol. 2002, 130: 2129-2141. 10.1104/pp.008532.
Seki M, Narusaka M, Ishida J, Nanjo T, Fujita M, Oono Y, Kamiya A, Nakajima M, Enju A, Sakurai T, Satou M, Akiyama K, Taji T, Yamaguchi-Shinozaki K, Carninci P, Kawai J, Hayashizaki Y, Shinozaki K: Monitoring the expression profiles of 7000 Arabidopsis genes under drought, cold and high-salinity stresses using a full-length cDNA microarray. Plant J. 2002, 31: 279-292. 10.1046/j.1365-313X.2002.01359.x.
Kanaya S, Kinouchi M, Abe T, Kudo Y, Yamada Y, Nishi T, Mori H, Ikemura T: Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM). Gene. 2001, 276: 89-99. 10.1016/S0378-1119(01)00673-4.
Cushman JC, Bohnert HJ: Genomic approaches to plant stress tolerance. Curr Opin Plant Biol. 2000, 3: 117-124. 10.1016/S1369-5266(99)00052-7.
Hanson RS, Hanson TE: Methanotrophic bacteria. Microbiol Rev. 1996, 60: 439-471.
Lewis NG, Yamamoto E: Lignin: occurrence, biogenesis and biodegradation. Annu Rev Plant Physiol Plant Mol Biol. 1990, 41: 455-496. 10.1146/annurev.pp.41.060190.002323.
Vance CP, Kirk TK, Sherwood RT: Lignification as a mechanism of disease resistance. Annu Rev Phytopathol. 1980, 18: 259-288. 10.1146/annurev.py.18.090180.001355.
Shinozaki K, Yamaguchi-Shinozaki K: Molecular responses to drought stress. Molecular Responses to Cold, Drought, Heat and Salt Stress in Higher Plants. Edited by: Shinozaki K, Yamaguchi-Shinozaki K. 1999, Austin, TX: RG Landes, 11-28.
Niwa Y, Goto K, Shimizu M, Kobayashi H: Chromosomal mapping of genes in the RBCS family in Arabidopsis thaliana. DNA Res. 1997, 4: 341-343. 10.1093/dnares/4.5.341.
Milioni D, Hatzopoulos P: Genomic organization of hsp90 gene family in Arabidopsis. Plant Mol Biol. 1997, 35: 955-961. 10.1023/A:1005874521528.
Velichutina I, Connerly PL, Arendt CS, Li X, Hochstrasser M: Plasticity in eucaryotic 20S proteasome ring assembly revealed by a subunit deletion in yeast. EMBO J. 2004, 23: 500-510. 10.1038/sj.emboj.7600059.
Demura T, Tashiro G, Horiguchi G, Kishimoto N, Kubo M, Matsuoka N, Minami A, Nagata-Hiwatashi M, Nakamura K, Okamura Y, Sassa N, Suzuki S, Yazaki J, Kikuchi S, Fukuda H: Visualization by comprehensive microarray analysis of gene expression programs during transdifferentiation of mesophyll cells into xylem cells. Proc Natl Acad Sci USA. 2002, 99: 15794-15799. 10.1073/pnas.232590499.
Toyoda T, Wada A: Omic space: coordinate-based integration and analysis of genomic-phenomic interactions. Bioinformatics. 2004, 20: 1759-1765. 10.1093/bioinformatics/bth165.
Heida N, Hasegawa Y, Mochizuki Y, Hirosawa K, Konagaya A, Toyoda T: TraitMap: An XML-based genetic-map database combining multigenic loci and biomolecular networks. Bioinformatics. 2004, 20: i152-i160. 10.1093/bioinformatics/bth940.
Toyoda T, Konagaya A: KnowledgeEditor: a new tool for interactive modeling and analyzing biological pathways based on microarray data. Bioinformatics. 2003, 19: 433-434. 10.1093/bioinformatics/btf874.
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999, 27: 29-34. 10.1093/nar/27.1.29.
Goto S, Nishioka T, Kanehisa M: LIGAND: Chemical database for enzyme reactions. Bioinformatics. 1998, 14: 591-599. 10.1093/bioinformatics/14.7.591.
Michal G, (Ed): Biochemical pathways. 1999, New York: John Wiley and Sons, Ins
GSCope Viewer Ver.1.4 User Manual. [http://gscope.gsc.riken.go.jp/eng_manual/manual-1.htm]
About MATDB: AGI gene codes. [http://mips.gsf.de/proj/thal/db/about/agicodes.html]
Mueller LA, Zhang P, Rhee SY: AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol. 2003, 132: 453-460. 10.1104/pp.102.017236.
Sakurai T, Satou M, Akiyama K, Iida K, Seki M, Kuromori T, Ito T, Konagaya A, Toyoda T, Shinozaki K: RARGE: a large-scale database of RIKEN Arabidopsis resources ranging from transcriptome to phenome. Nucleic Acids Res. 2005, 33: D647-50. 10.1093/nar/gki014.
Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W, Mueller LA, Bhattacharyya D, Bhaya D, Sobral BW, Beavis W, Meinke DW, Town CD, Somerville C, Rhee SY: The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res. 2001, 29: 102-105. 10.1093/nar/29.1.102.
Schoof H, Zaccaria P, Gundlach H, Lemcke K, Rudd S, Kolesov G, Arnold R, Mewes HW, Mayer KF: MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res. 2002, 30: 91-93. 10.1093/nar/30.1.91.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001, 98: 4569-4574. 10.1073/pnas.061034498.
Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2001, 29: 11-16. 10.1093/nar/29.1.11.
Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26: 73-79. 10.1093/nar/26.1.73.
Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg DS, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong SL, Zhang LV, Berriz GF, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel HW, Elewa A, Baumgartner B, Rose DJ, Yu H, Bosak S, Sequerra R, Fraser A, Mango SE, Saxton WM, Strome S, Van Den Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus KC, Harper JW, Cusick ME, Roth FP, Hill DE, Vidal M: A map of the interactome network of the metazoan C. elegans. Science. 2004, 303: 540-543. 10.1126/science.1091403.
Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J: WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 2001, 29: 82-86. 10.1093/nar/29.1.82.
Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y, Muramatsu M, Hayashizaki Y, Kawai J, Carninci P, Itoh M, Ishii Y, Arakawa T, Shibata K, Shinagawa A, Shinozaki K: Functional annotation of a full-length Arabidopsis cDNA collection. Science. 2002, 296: 141-145. 10.1126/science.1071006.
Seki M, Carninci P, Nishiyama Y, Hayashizaki Y, Shinozaki K: High-efficiency cloning of Arabidopsis full-length cDNA by biotinylated CAP trapper. Plant J. 1998, 15: 707-720. 10.1046/j.1365-313x.1998.00237.x.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.
GSCope3 Tutorial. [http://omicspace.riken.jp/osml/gscope3_1.1.htm]
Kohonen T: Self-organizing maps. 1995, Berlin: Springer
Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures. Genome Res. 2003, 13: 693-702. 10.1101/gr.634603.
Hirai MY, Yano M, Goodenowe DB, Kanaya S, Kimura T, Awazuhara M, Arita M, Fujiwara T, Saito K: Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana. Proc Natl Acad Sci USA. 2004, 101: 10205-10210. 10.1073/pnas.0403218101.
Calinski T, Harabasz J: A dendrite method for cluster analysis. Communications in Statistics. 1974, 3: 1-27.
Gordon AD: Monographs on statistics and applied probability 82. 1999, Boca, Raton: Chapman & Hall/C.RC
The author(s) declare that they have no competing interests.
YH created and supervised the Biological Knowledge Library (BiKLi). YH analyzed the microarray data by using BiKli and GSCope3. YH is the overall author of this paper. MS, and KS provided the microarray data of Arabidopsis thaliana. They also reviewed and revised this paper. YM, and KH participated in the conception of Omic Space Markup Language (OSML) and designed a part of BiKLi, especially the KEGG type metabolic pathway. NH developed websites for the tutorial of GSCope3 and GSCope3. NO participated in the creation of OSML and developed GSCope3. TS, MS, KA, and KI provided the microarray data of A. thaliana and modified the data. KL created a part of BiKLi, especially the Directed acyclic graphs of GO terms. SK provided the program of Batch-leaning self-organizing maps (BL-SOM) clustering. TD provided microarray data of Zinnia elegans and the correlation to Zinnia elegans and A. thaliana genes. AK managed the writing of the manuscript. TT is the chief of the OSML creation. In addition, he participated in the design and coordination of the study, supervised in writing the manuscript, and wrote a part of the abstract and background.
Electronic supplementary material
Additional File 5: . Ranking result of significant correlations between the "functional Class" of the "Protein-protein interaction and Arabidopsis thaliana gene expression" and the clusters formed by BL-SOM of the microarray probes of expression profile under drought conditions (HTML 1 MB)
Additional File 6: . Ranking result of significant correlations between the "functional Class" of the "Comparison between gene expression of Arabidopsis thaliana and Zinia elegans" and the clusters formed by BL-SOM of the microarray probes of expression profile under drought conditions (HTML 90 KB)
Additional File 7: The relationship between the number of the first principal ingredient in the BL-SOM clusters and the index of clustering. The first principal ingredient = 3 to 20 (clustering 3 to 120) of BL-SOM clustering of the microarray data under drought stress and the index of clustering (the "mean value of each cluster's average radius" and "explained variability") is plotted. It is used for analysis because the cluster with X = 11, clustering = 33, separated enough. (JPEG 191 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Hasegawa, Y., Seki, M., Mochizuki, Y. et al. A flexible representation of omic knowledge for thorough analysis of microarray data. Plant Methods 2, 5 (2006). https://doi.org/10.1186/1746-4811-2-5
- Gene Ontology
- Drought Stress
- Microarray Data
- Functional Class
- Similar Expression Profile