Maize microarray annotation database
© Coetzer et al; licensee BioMed Central Ltd. 2011
Received: 28 June 2011
Accepted: 1 October 2011
Published: 1 October 2011
Skip to main content
© Coetzer et al; licensee BioMed Central Ltd. 2011
Received: 28 June 2011
Accepted: 1 October 2011
Published: 1 October 2011
Microarray technology has matured over the past fifteen years into a cost-effective solution with established data analysis protocols for global gene expression profiling. The Agilent-016047 maize 44 K microarray was custom-designed from EST sequences, but only reporter sequences with EST accession numbers are publicly available. The following information is lacking: (a) reporter - gene model match, (b) number of reporters per gene model, (c) potential for cross hybridization, (d) sense/antisense orientation of reporters, (e) position of reporter on B73 genome sequence (for eQTL studies), and (f) functional annotations of genes represented by reporters. To address this, we developed a strategy to annotate the Agilent-016047 maize microarray, and built a publicly accessible annotation database.
Genomic annotation of the 42,034 reporters on the Agilent-016047 maize microarray was based on BLASTN results of the 60-mer reporter sequences and their corresponding ESTs against the maize B73 RefGen v2 "Working Gene Set" (WGS) predicted transcripts and the genome sequence. The agreement between the EST, WGS transcript and gDNA BLASTN results were used to assign the reporters into six genomic annotation groups. These annotation groups were: (i) "annotation by sense gene model" (23,668 reporters), (ii) "annotation by antisense gene model" (4,330); (iii) "annotation by gDNA" without a WGS transcript hit (1,549); (iv) "annotation by EST", in which case the EST from which the reporter was designed, but not the reporter itself, has a WGS transcript hit (3,390); (v) "ambiguous annotation" (2,608); and (vi) "inconclusive annotation" (6,489). Functional annotations of reporters were obtained by BLASTX and Blast2GO analysis of corresponding WGS transcripts against GenBank.
The annotations are available in the Maize Microarray Annotation Database http://MaizeArrayAnnot.bi.up.ac.za/, as well as through a GBrowse annotation file that can be uploaded to the MaizeGDB genome browser as a custom track.
The database was used to re-annotate lists of differentially expressed genes reported in case studies of published work using the Agilent-016047 maize microarray. Up to 85% of reporters in each list could be annotated with confidence by a single gene model, however up to 10% of reporters had ambiguous annotations. Overall, more than 57% of reporters gave a measurable signal in tissues as diverse as anthers and leaves.
The Maize Microarray Annotation Database will assist users of the Agilent-016047 maize microarray in (i) refining gene lists for global expression analysis, and (ii) confirming the annotation of candidate genes before functional studies.
Currently, there are several maize microarray platforms available, including an Affymetrix short oligonucleotide array , a Nimblegen 50-mer array , a 70-mer array from the University of Arizona Maize Oligonucleotide Array project  and the 60-mer Agilent-016047 Maize 4 × 44 K microarray .
The Agilent microarray platform  is a mature technology that yields high quality gene expression data, which can be readily analyzed using established statistical tools . The Agilent-016047 Maize 4 × 44 K microarray was custom-designed by the Walbot laboratory, with 42,034 in situ synthesized 60-mer oligonucleotide reporters (excluding controls) . Currently, the Agilent "e-array" tool  only provides the 60-mer sequences and EST accession numbers from which the reporters were designed, without detailed or up-to-date annotations. There was therefore a need to develop a strategy for annotation, and thereby build a database of annotations for this maize microarray, as well as similar custom arrays.
The maize B73 genome sequence was released in November 2009 , and this provided the opportunity to locate the reporters on the genome sequence, and provide functional annotations. Each reporter is intended to report the expression of a single gene unambiguously. However, since the reporters were designed from ESTs from different maize lines before a reference genome sequence was available , redundancy on the array, as well as imperfect reporter matches were expected.
Version 1 and Version 2 Agilent 22 K arrays [10, 11] were precursors for the 44 K Agilent-016047 array. Version 1 was designed from the December 2003 maize EST assembly of MaizeGDB and was made up of 21,782 reporters. More than 80% of these reporters were also included in Version 2 plus ~3,000 new reporters, designed from maize sequences in GenBank. Of the 20,963 gene features on Version 2, ~13,000 were sense strand reporters and ~5,000 antisense strand reporters.
For the Agilent-016047 44 K array, an updated set of 60-mer reporters were designed using Picky 2.0 . The reporter set mainly consists of validated reporters from the two precursor maize arrays described above and validated reporters from anther expressed genes detected using a spotted 70-mer array format (containing reporters to about 35,000 maize genes) [10, 13]. Additional gene reporters were based on release 16.0 of the TIGR Maize Gene Index as well as cDNA or EST sequences from GenBank (that were at the time not yet in the TIGR Maize Gene Index assembly) . According to Ma et al., the 42,034 maize gene reporters represent ~39,000 sense transcripts including a subset of genes with multiple reporters, and ~500 antisense transcripts. In addition to the 42,034 maize gene reporters, the array also contains internal quantitative 'spike-in' controls of non-maize sequences, which were not annotated in this study.
The aims of this study were to annotate the reporter set of the Agilent-016047 microarray by: (i) locating each reporter on the maize B73 genome sequence; (ii) associating each reporter to the transcript of a single gene, if possible; and (iii) assigning functional annotations to the gene represented by each reporter. Our results revealed that we could not associate all of the reporters with a single transcript with high confidence, and therefore we built a database http://MaizeArrayAnnot.bi.up.ac.za, which provides confidence scores of the genomic positions and functional annotations of reporters on the Agilent-016047 Maize array. Our annotation strategy provides guidelines for annotation of custom-designed microarray slides where partial EST information is available, and this resource will therefore be useful to maize researchers, and other researchers using custom arrays.
The gene list for the Agilent Maize Gene Expression Microarray 4 × 44 K (design ID 016047) was downloaded from Agilent's eArray tool  containing a reporter ID, a 60-mer reporter sequence and an EST accession number for each of the 42,034 reporters on the microarray. EST sequence information for 34% of the reporters was available on GenBank, and BioPython  was used to extract sequence and other relevant information from individual GenBank files. For an additional 31% of the reporters, EST sequences were obtained from the Walbot laboratory. For the remaining 35% of the reporters, no EST sequences were available, since these are likely to be derived from proprietary sources. The cDNA sequences (in FASTA format), their transcript start and end positions on the B73 RefGen v2 genome sequence as well as InterPro and Gene Ontology (GO) annotations for genes, were downloaded from the maizesequence.org FTP site . Only the protein coding transcripts in the B73 RefGen v2 Working Gene Set (WGS) were used (88,611 cDNAs representing 63,331 genes). We chose to use the WGS and not the Filtered Gene Set (FGS) since it was more inclusive of transcripts that could have been used in the reporter design. The FGS (63,540 transcripts; 39,656 genes) is a subset of the WGS in which transcripts that are "probable pseudogene", "possible transposon", "contamination" or "low confidence" have been filtered out.
The maize B73 RefGen v2 genome sequence (sequences of all 10 chromosomes, in FASTA format) were downloaded from the maizesequence.org FTP site . Lastly, the maize core bin markers  and corresponding B73 RefGen v2 base pair positions were retrieved from MaizeGDB . All sets of data were downloaded in December 2010/January 2011.
Three sets of BLASTN hits were stored for each reporter (Figure 1). Firstly, BLASTN of the reporter was implemented against the genome sequence (B73 RefGen v2), and the top BLASTN hit (if ≥ 23 contiguous matches), or multiple BLASTN hits (if ≥ 23 contiguous matches and identity ≥ 55/60) was called the 'reporter-gDNA result'. The reporters were also searched against the genome sequence using "exonerate"  to detect reporters that spanned introns. The parameters for exonerate are shown in Additional file 2. In cases where reporters had positive exonerate matches (and ≥ 23 contiguous matches) to the genome sequence, this result was recorded as the 'reporter-gDNA result'. Secondly, BLASTN of the reporter was implemented against the WGS transcripts, and the top BLASTN hit (if ≥ 23 contiguous matches), or multiple BLASTN hits (if ≥ 23 contiguous matches and identity ≥ 55/60) was called a 'reporter-WGS transcript result'. Thirdly, after confirming that the reporter matched its corresponding EST listed in the Agilent eArray database (≥ 23 contiguous matches), the top BLASTN hit (if E-value ≤ 1e-10), or multiple BLASTN hits (if E-value ≤ 1e-20) of the EST against the WGS transcript dataset was called an 'EST-WGS transcript result' (BLASTN parameters in Additional file 1). These BLASTN cut-offs were selected based on a previous study where the same cut-offs were used to align ESTs to predicted maize cDNAs . The ESTs were also searched against the gDNA using exonerate (parameters in Additional file 2), and matches with a normalised score of at least 3 (calculated by dividing the exonerate raw score by the query EST sequence length)  were recorded as the 'EST-gDNA result'.
The next step was to determine if there was agreement between the reporter-gDNA result and the reporter-WGS transcript result, in other words whether the WGS transcript that the reporter matched was derived from the same gDNA position that the reporter matched. This was recorded as the reporter-gDNA/WGS agreement result (Figure 1). Similarly, we tested whether the EST-gDNA result and EST-WGS transcript result were in agreement, and recorded this as the EST-gDNA/WGS agreement result (Figure 1).
Annotated by sense gene model: Reporters that match a WGS transcript and genomic location of the same gene model (single reporter-gDNA/WGS agreement result);
Annotation by antisense gene model: Reporters that match a transcript and genomic location of the same gene model, but align to the antisense direction of the transcript (single reporter-gDNA/WGS agreement result);
Annotation by gDNA: Reporters that match a unique location on the maize B73 genome, but this location is not currently annotated as a gene model (single genomic result);
Annotation by EST: Reporters that do not match a WGS transcript, but that are derived from an EST that matches a WGS transcript and its genomic location (single EST-gDNA/WGS agreement result);
Ambiguous annotation: Reporters with more than one sense gene model, antisense gene model or EST result (More than one reporter-gDNA/WGS transcript agreement or EST-gDNA/WGS transcript agreement result);
Inconclusive annotation: Reporters that match more than one transcript, but not the genomic location of the corresponding gene models. Reporters that match more than one genomic location, but no corresponding transcripts. Reporters with no valid hits.
The "reporter-WGS transcript result" for each reporter (described above) was used to assign a functional annotation to each reporter. The functional annotations for each of the 88,611 cDNA sequences in the WGS of the B73 RefGen v2 genome sequence were obtained by BLASTX  searches (with default parameter settings; Additional file 1) against the NCBI non-redundant peptide database (nr). The top three hits (and corresponding statistics) were stored in a relational database. Blast2GO  was used to associate each WGS transcript (and therefore the corresponding reporters) with Gene Ontology (GO) terms, using default settings.
The Maize Microarray Annotation Database interface was written using Turbogears , a Python web application framework. A central MySQL database is used to store sequence and annotation information. SQLAlchemy , an object relational mapper for Python and toolkit for SQL, is implemented within the Maize Microarray Annotation Database when a user queries the database.
An annotation file with the genomic positions for each reporter that could be matched to the genome was generated (Additional file 3). This can be uploaded to the MaizeGDB genome browser  and viewed as an annotation track in the context of the B73 RefGen v2 genome sequence.
Reporters "with measurable signal" were identified as those with a signal to noise ratio (SNR) > 3 in at least one of fifty Agilent-016047 microarrays hybridized with cDNA from maize leaves of a segregating population (data not shown).
Number of reporters placed in the genomic annotation groups of the maize Agilent-016047 microarray
Number of reportersa
Reporters with Signal/Noise > 3 (maize leaves)b
Proportion of reporters with Signal/Noise > 3c
sense gene model
antisense gene model
The reporters annotated by gene model in the sense or antisense direction represent 46.7% of the genes in the B73 maize Filtered Gene Set (FGS). Within this group, there were 4,330 reporters that aligned to the antisense direction of a gene model (Table 1). Natural antisense transcripts (NATs) contain sequences complementary to the sense transcripts of protein-coding genes . Between 7 and 30% of genes in animal and plant genomes encode overlapping cis-NATs . Many NATs are conserved, implying regulatory functions for these transcripts in gene expression. According to Ma et al., 14.3% of the pollen transcriptome consists of detectable antisense transcripts. It should be borne in mind that, in some cases, a reporter with an antisense annotation could in fact correspond to a sense transcript if the EST from which it was designed was incorrectly oriented or the genomic annotation was in the wrong strand. These errors are expected to be corrected in future annotation versions of the maize B73 genome sequence.
The 1,549 reporters "annotated by gDNA" (Table 1) represent reporters that had significant matches to the maize genome sequence but did not match current transcripts in the WGS of the maize B73 RefGen v2. These reporters will possibly be linked to gene models in future versions of the B73 genome due to improvements in gene prediction algorithms or availability of mRNA-seq data from different tissues of maize B73 plants. Although these reporters are not currently associated with functional annotations, their placement on the B73 genome sequence is useful for eQTL studies in maize. This category of reporters showed a lower proportion (64%) with measurable expression in maize leaves compared to reporters annotated by sense gene model (88%; Table 1).
The 3,390 reporters "annotated by EST" (Table 1) were derived from ESTs that showed sequence similarity to a WGS transcript from B73 (E-value < 1e-10), however the reporter itself did not have a significant hit to the WGS transcript. The reporters on the maize Agilent-016047 microarray were designed from ESTs from various maize lines (Additional file 4). Reporters "annotated by EST" are most likely derived from maize lines other than B73, although the source is not known for all reporters since this information could be retrieved for only 34% of the reporters (Additional file 4A). The region of the transcript that corresponds to the reporter is therefore predicted to be divergent between B73 and the line from which the reporter was derived.
The 2,608 "ambiguous" reporters (Table 1) each represent more than one gene model, which are mostly members of the same gene family. Interpretation of expression data from these reporters should be done with caution, as it is possible that the signal is due to cross hybridization from more than one family member. As an example, reporter A_92_P037799 represents four members of the cytochrome P450 gene family on chromosomes 2, 3, 6 and 8, as shown in the multiple sequence alignment (Additional file 5).
There were 6,489 reporters with "inconclusive annotation" (Table 1), and thus interpretation of expression data from these reporters should be made with caution. This group contained a relatively low proportion of reporters with signal in a maize leaf microarray experiment conducted in our laboratory (see Methods), namely 61% compared to 88% of reporters "annotated by sense gene model" (Table 1). Re-sequencing of six maize lines from China identified several hundred genes that were not present in B73, but could be annotated as plant proteins , and therefore it is possible that reporters with "inconclusive annotation" may represent transcribed genes from other maize lines. A subset of the reporters with inconclusive annotation and no hits against the B73 genome sequence had EST sequences available (1,727) and these were searched against GenBank using BLASTX. Only 892 had significant hits (E-value ≤ 1e-10) and 553 matched plant proteins.
Prior to our work, the reporters on the Agilent-016047 maize array could be visualized in the context of the B73 maize genome sequence at MaizeGDB  based on the Walbot laboratory annotations. However, there are several limitations of this annotation track, namely: (i) the positions given are based on RefGen v1, whereas the sequence is RefGen v2; (ii) the positions are based on MegaBLAST hits to the gDNA, but no matches to transcripts are given; (iii) the reporters are named using a UID which is different from the Agilent e-array ID; and (iv) three confidence categories are given, however some reporters have up to 500 hits. Therefore we have produced an updated annotation track that is compatible with MaizeGDB (Additional file 3) that reports the positions of all reporters on the array except those with inconclusive annotation or annotation by EST. An example of three reporters that match one gene model is shown in Additional file 6.
The Maize Microarray Annotation Database has an interactive web interface http://MaizeArrayAnnot.bi.up.ac.za/, providing the user with three main functionalities namely "Search Agilent slide", "BLAST sequences" and "Get sequences from GenBank" (Additional file 7). Most users are likely to use the "search Agilent slide" function, since they would be interested in downloading annotations for a list of reporters that are differentially expressed in a microarray experiment. In order to search the Agilent slide, the user can provide Reporter IDs, EST Accession numbers or gene names. The outputs from a query are reporter information, EST information, gene information, genomic and functional annotation information as well as the evidence for the annotation results. The following can be downloaded: DNA sequences (reporter, EST or WGS transcript sequences in FASTA format), a table with all annotation information, and/or multiple sequence alignments. Searching by WGS gene name makes it possible to see whether there is more than one reporter for a gene. On average, there are ~1.6 reporters per gene. Users can also retrieve nucleotide sequences by submitting GenBank accession numbers for ESTs, or BLAST sequence(s) against the Agilent slide to identify which reporters represent the query sequence best.
List of studies using the Agilent-016047 maize microarray
Criteria to define a 'Signal'
Ma et al. 
anther and pollen
> 24 K
< 18 K
3 out of 4 hybridization signals > background (99% confidence)
Casati et al. 
median reporter intensity > background
Skibbe et al. 
reporter intensity > 2.6 × SD of background
Wang et al. 
Rajhi et al. 
signal/noise > 3
We analysed expression data from hybridization of maize leaf cDNA from a segregating population to fifty Agilent-016047 arrays to assess the number of reporters with measurable signal in our hands. The data showed that ~32,000 reporters had a consistent signal to noise ratio (SNR) greater than 3, whereas ~10,000 reporters were deemed non-hybridizing to leaf transcripts (Table 2). These six studies demonstrate that in all cases a large proportion of the reporters on the Agilent-016047 arrays give measurable signals in tissues as diverse as anthers, leaves and roots.
Case studies using the Maize Microarray Annotation Database
Table (see Reference)
Genomic Annotation Groups#
Ma et al. 
Ma et al. 
Casati et al. 
Skibbe et al. 
Skibbe et al. 
Wang et al. 
Rajhi et al. 
Rajhi et al. 
We suggest that annotation of reporters with the Maize Microarray Annotation Database can be useful for refining lists of "differentially expressed" reporters for subsequent global analyses (e.g. GO enrichment using tools such as MADIBA ). In addition, the database is also essential to confirm the annotation of candidate genes identified from a microarray experiment before detailed functional analyses (e.g. gene knockouts) are carried out. To this end, we have provided, as Additional files 8, 9, 10, 11, 12, 13, 14 and 15, our annotations of the data tables from the case studies listed in Table 3.
The importance of correct annotation of microarrays is illustrated by the study of Gertz et al. who performed a similar analysis on the 44 K Agilent human expression arrays and found that many reporters had inconclusive annotations. Out of 42,683 reporters, 25,505 (60%) were considered 'fully valid' according to their analyses. In another study, an Agilent mouse 44 K array was re-annotated resulting in improved annotations for more than 10,000 reporters on the array . Furthermore, gene models are constantly being updated as new experimental and annotation data accumulates. Therefore re-annotation of reporters is required as illustrated by a study in which a dozen mammalian GeneChip arrays were re-annotated . This would be of particular importance in maize where the genome sequence was recently released  and is currently only at version 2 of annotation.
A reporter-by-reporter validation of the 4 × 44 K Agilent-016047 maize microarray was performed. In total, 71% of the reporters correspond to a transcript with a defined gDNA position and represent 46.7% of the genes in the B73 FGS. All results have been included in a database http://MaizeArrayAnnot.bi.up.ac.za/, which provides confidence scores of the genomic positions and functional annotations of reporters on the Agilent-016047 Maize array. The database facilitates interpretation of maize gene expression data. Scientists embarking on expression profiling in maize are likely to find this array an attractive option, since the combination of our annotation database with established analysis methods  facilitates data interpretation. In addition, our strategy can be applied when annotating any custom-designed array from a species for which the genome sequence is available.
The Maize Microarray Annotation Database is publicly available at http://MaizeArrayAnnot.bi.up.ac.za/.
Basic Local Alignment Search Tool
Expression Quantitative Trait Locus
Expressed Sequence Tag
Filtered Gene Set
Genomic Deoxyribonucleic Acid
Natural Antisense Transcript
Working Gene Set.
We thank the Walbot laboratory for kindly providing EST sequence information and their initial annotations. We thank an anonymous reviewer for constructive comment. The financial assistance of the Technology Innovation Agency (TIA)(South Africa) and the National Research Foundation (NRF)(South Africa) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to TIA or the NRF.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.