A high-density Diversity Arrays Technology (DArT) microarray for genome-wide genotyping in Eucalyptus

Background A number of molecular marker technologies have allowed important advances in the understanding of the genetics and evolution of Eucalyptus, a genus that includes over 700 species, some of which are used worldwide in plantation forestry. Nevertheless, the average marker density achieved with current technologies remains at the level of a few hundred markers per population. Furthermore, the transferability of markers produced with most existing technology across species and pedigrees is usually very limited. High throughput, combined with wide genome coverage and high transferability are necessary to increase the resolution, speed and utility of molecular marker technology in eucalypts. We report the development of a high-density DArT genome profiling resource and demonstrate its potential for genome-wide diversity analysis and linkage mapping in several species of Eucalyptus. Findings After testing several genome complexity reduction methods we identified the PstI/TaqI method as the most effective for Eucalyptus and developed 18 genomic libraries from PstI/TaqI representations of 64 different Eucalyptus species. A total of 23,808 cloned DNA fragments were screened and 13,300 (56%) were found to be polymorphic among 284 individuals. After a redundancy analysis, 6,528 markers were selected for the operational array and these were supplemented with 1,152 additional clones taken from a library made from the E. grandis tree whose genome has been sequenced. Performance validation for diversity studies revealed 4,752 polymorphic markers among 174 individuals. Additionally, 5,013 markers showed segregation when screened using six inter-specific mapping pedigrees, with an average of 2,211 polymorphic markers per pedigree and a minimum of 859 polymorphic markers that were shared between any two pedigrees. Conclusions This operational DArT array will deliver 1,000-2,000 polymorphic markers for linkage mapping in most eucalypt pedigrees and thus provide high genome coverage. This array will also provide a high-throughput platform for population genetics and phylogenetics in Eucalyptus. The transferability of DArT across species and pedigrees is particularly valuable for a large genus such as Eucalyptus and will facilitate the transfer of information between different studies. Furthermore, the DArT marker array will provide a high-resolution link between phenotypes in populations and the Eucalyptus reference genome, which will soon be completed.


Background
A number of molecular marker technologies have been developed and used for species of Eucalyptus in the last 20 years [1]. Each of these technologies allowed important advances in the understanding of the multifaceted genetics, evolution and breeding of this vast genus that includes over 700 species, some of which are globally important plantation forestry species [2]. Molecular markers have been used to resolve phylogenetic issues [3], describe the genetic structure of natural populations [4,5], solve questions related to the management of genetic variation in breeding populations [6] and build linkage maps [7][8][9] that in turn have led to the identification of QTLs for important traits [10][11][12][13]. Nevertheless, the genotyping density achieved even with technologies such as AFLP [14] remains at a few hundred markers per sample and because AFLP is gel-based it is relatively labour-intensive. Multiplexing has allowed moderatelevel throughput in microsatellite studies. However, the transferability of microsatellites across species is notoriously poor and needs to be investigated and optimized before microsatellites can be used in a new species [1]. Wider genome coverage and higher throughput genotyping methods are necessary to increase resolution and speed for a variety of applications. Diversity Arrays Technology (DArT) [15] provides a promising alternative to satisfy the requirements of throughput, genome coverage and transferability. DArT is a complexity reduction, DNA hybridization-based method that simultaneously assays hundreds to thousands of markers across a genome. DArT preferentially targets low-copy genomic regions, allows automation of data acquisition and is cost competitive. Although developed some years ago, this marker technology has recently gained increasing attention [16][17][18][19][20]. We report the development of the first version of a high density operational DArT genotyping microarray with over 7,000 markers and demonstrate its potential for diversity and linkage mapping studies in species of Eucalyptus across the two most important subgenera.

Results and Discussion
This paper describes the various steps that were taken in developing the eucalypt DArT array (Figure 1). The first step was to find a successful method for reducing genome complexity. Once this was done, a prototype microarray was developed and tested. The DArT array was subsequently expanded and again tested for redundancy. The final step was to validate the operational microarray for genome-wide genotyping in Eucalyptus.

Genome complexity reduction
The first necessary step in the development of DArT markers (Figure 1) is choosing a genome complexity reduction method (see http://www.diversityarrays.com/ molecularprincip.html). The DArT genome complexity reduction method is based on restriction enzyme (RE) digestion of total genomic DNA, adapter ligation and amplification of adapter-ligated fragments. DNA extraction was done with a CTAB protocol [21]. Seven methods of genome complexity reduction were tested for their performance in Eucalyptus (Additional File 1). DNA samples were prepared by digestion with the rare cutting PstI RE as a primary cutter in combination with a frequently cutting enzyme (TaqI, BstNI, MspI, HpaII, BanII, MseI or AluI) as a secondary cutter. PstI is sensitive to CpG methylation, thereby excluding heavily methylated repetitive DNA from the representation. Adapters, complementary to the "sticky-ends" of the fragments generated by PstI digests were ligated (protocol modified slightly from the original [15,16]), to allow PCR amplification of only the PstI fragments that had not been cut with the secondary enzyme. A desirable genome complexity reduction method will produce a smear of products with few or no distinct bands when representations are visualised on agarose gels following electrophoresis. Strong banding indicates the amplification of repetitive sequences and such representations are unsuitable for DArT development [22]. The genomic representations produced by the digestion with PstI in combination with either TaqI (PstI/   (Figure 1, Additional File 1).

Test screening of clones for polymorphic DArT markers
The second step (Figure 1) entailed the construction of small genomic libraries for each of the selected complexity reduction methods and the screening of the resulting DNA clones (probes) to reveal polymorphic markers. For library construction, two sets of pooled DNA samples were utilized separately: the first from 12 E. grandis and the second from 12 E. globulus trees. Each pooled sample was digested with both enzyme combinations: PstI/TaqI and PstI/BstNI. Four testing libraries were generated, each with 384 randomly picked clones, with a total of 1,536 DArT clones to be screened for polymorphism. The cloned DNA fragments were printed onto glass slides for the first test array in duplicates (randomly positioned within the array) as is normally done for DArT. Genomic representations of each of the 12 E. grandis and 12 E. globulus individuals were prepared to generate 'targets' that were hybridized to the arrays. For each species, the 12 genotypes were assayed with two technical replicates per genotype. Each target was labeled with a green fluorescent dye (Cy3-dUTP) and red fluorescent dye (Cy5-dUTP), and then mixed with a blue fluorescently-labeled polylinker from the vector used for cloning the DNA fragments in the libraries that provided a reference value for the quantity of amplified DNA fragment present in each 'spot' of the microarray, as well as an in-built quality control for spots on the microarrays. This mixture was hybridized to a 1,536-clone microarray, that was scanned for blue, green & red fluorescence and data were extracted using DArTSoft version 7.44. DArTSoft localizes the individual spot features of the microarrays and then compares the relative intensity (blue versus green) and (blue versus red) values obtained for each clone across all slides/targets to detect the presence of clusters of higher and lower values corresponding to marker scores of '1' (high) and '0' (low) respectively. The quality parameters used in this study were: Call Rate (percentage of targets that could be scored as '0' or '1') and Reproducibility value (reproducibility of scoring between replicated target assays) [16]. The results of the DArTSoft analysis for the two arrays prepared using DNA clones derived from either PstI/TaqI or PstI/BstNI digestions were compared with regard to the frequency of clones revealing polymorphic DArT markers. The criteria used for declaring a clone as revealing a polymorphic marker were Reproducibility > 97% and Call Rate > 80%. From the analysis of the two species hybridized in duplicate to the two arrays, the complexity reduction method using PstI/TaqI was found to yield a higher proportion (21.7%) of candidate polymorphic markers according to the above criteria compared to the PstI/BstNI method (14.3%).

Prototype Eucalyptus DArT microarray
The PstI/TaqI genome complexity reduction method was used in the development of the prototype Eucalyptus DArT microarray ( Figure 1). The initial test array, with 1,536 clones, was expanded by picking an additional random set of 14,592 discovery DArT clones, this time derived from a total of 14 libraries ( were used for library construction resulting in a broader sample of DNA sequences, therefore increasing the probability of sampling genomic segments that could reveal polymorphic markers across a wider range of genetic backgrounds [16]. A total of 16,128 clones were printed twice on each slide and were hybridized with DNA from each of 284 individuals ("targets"; Table 2) representing eight different species with replication, following the methods described above. The results were analyzed with DArTSoft and assessed using the threshold criteria of Reproducibility > 97% and Call Rate > 80%. This analysis revealed 7,677 clones (47.6%) as robust polymorphic markers ( Table 3). The Call Rate average was 95.3% and the Reproducibility average was 99.7% (this value was calculated on the basis of duplicate genotyping assays for all test samples).
Testing Corymbia targets on the array composed primarily of Eucalyptus probes (and vice versa) showed very clearly that the overall array signal of Corymbia targets was low and uncorrelated to signal from Eucalyptus species (and vice versa). Because of this poor transferability across genera, we abandoned the development of DArT for Corymbia. As clones used to build an array are picked at random from the libraries, clone redundancy (i.e. DNA fragments with the same or very similar/overlapping sequence) is an issue. Redundancy of the polymorphic DArT clones was evaluated with the software package DArT ToolBox http://www.diversityarrays.com/ by constructing a Hamming distance matrix between clones, followed by distance binning, in which all clones with zero distance were placed into the same bin. This was done using the 284 samples used as targets listed in Table  2. This estimation of clone redundancy based on similar score pattern enabled the selection of unique or low redundancy clones prior to the availability of sequence information for the clones. The redundancy estimation based on distance binning of the 7,677 polymorphic markers resulted in 2,652 unique bins, i.e. 34.5% nonredundant marker scoring patterns (Table 4). With a limited number of effective scores for calculating the dis-tance matrix for markers and a clear genetic structure in the materials used for initial marker discovery, there was a high likelihood of unique sequences being grouped to a single bin, especially in large bins. Therefore, a total of 4,608 clones were selected for re-arraying, keeping approximately 30% of the potentially redundant markers, with frequency of retention proportional to the bin size. In order to verify the redundancy estimation, we sequenced re-arrayed clones that belonged to nine bins that had at least 30 clones. Sequencing results revealed that on average 53% of the DArT clones in these large bins represented unique DNA sequences. Binning results were therefore, as anticipated, conservative and yielded an overestimation of redundancy (Table 5).

Interim and operational Eucalyptus DArT microarrays
In order to enrich the Eucalyptus DArT array for polymorphic markers, four additional genomic libraries were constructed that provided a total of 7,680 new clones that were screened for polymorphism ( Table 6). Two of these libraries contained DNA from 62 eucalypt species and were built by pooling equimolar DNA quantities from one individual of each species and cutting either with PstI or PstI/TaqI. The PstI representation allowed markers that were present at low frequency in the PstI/TaqI representation to be cloned and therefore minimized redundancy in the final clone set. Most species (56) were from subgenus Symphyomyrtus (representing 14 of the 15 sections and missing only a minor one); the other species were from three other subgenera (Alveolata, Eucalyptus, and Minutifructus). Screening these new libraries for polymorphism ( Figure 1) was performed using a set of 190 individuals from seven different Eucalyptus species (E. grandis, E. urophylla, E. camaldulensis, E. globulus, E. dunni, E. pilularis and E. nitens) with targets in full replication (Table 7). DArTSoft and DArT ToolBox were used to identify robust markers and to estimate redundancy as described for the first array (with the same parameters and thresholds). DArTSoft detected 5,653 polymorphic markers among the 7,680 clones (73.6%). The average Call Rate and Reproducibility were similar to the first array with 93.7% and 99.7% respectively. However, a sig-   (Table 3), most likely due to the greater genetic diversity that was captured in the genomic representations from the four new libraries. A consolidated analysis of redundancy based on binning was carried out to minimize redundancy between the 7,677 polymorphic clones selected initially for the prototype microarray and the additional 5,653 clones. From a total of 13,300 clones, 4,051 bins were found in the interim array (Tables 3 and  4). On the basis of polymorphism analysis and the additional redundancy assessment, 1,920 new clones were selected from the 7,680, to create a second re-arrayed library. The two re-arrayed libraries (the first one with 4,608 clones and the second with 1,920 clones), were supplemented with 1,152 clones developed primarily from a genomic library of BRASUZ1, the Eucalyptus grandis tree whose genome is being sequenced (Table 6), to constitute an operational DArT genotyping array for Eucalyptus with 7,680 markers. As expected, not all the 7,680 clones were found to yield polymorphic markers since the 174 samples assayed did not represent the total genetic diversity used to construct the array. As a second validation, an assessment of DArT marker segregation and rate of polymorphism was carried-out with 94 samples in full replication, including 15-16 samples from each of six mapping pedigrees. Most of these individuals were not used in library construction and represented a test of the level of polymorphism that could be expected in diverse linkage mapping experiments. There were 2,211 polymorphic markers per pedigree on average ( Table 8). The number of shared polymorphic markers (polymorphic in two pedigrees) amongst the six mapping pedigrees varied from a minimum of 859 to a maximum of 1,328 (Table 8). A total of 5,013 markers (65.3%) out of the 7,680 clones showed segregation within at least one mapping population, when data from the six pedigrees were consolidated (Table 9). Table 9 shows the number of DArT markers that were exclusively polymorphic in one pedigree only (1,154 markers or 23%) through to those that were polymorphic in an increasing number of pedigrees up to all six pedigrees (150 markers: 3%).

Conclusions
This eucalypt DArT array is one of the best performing DArT arrays yet developed (DArT Pty Ltd, unpublished results). The high frequency of polymorphic markers is likely to be a function of the high level of sequence variation in the Eucalyptus genome [23] and, to a much lesser extent, a function of its relatively small genome size and low proportion of repetitive DNA [1]. Interestingly, the high level of sequence diversity in Eucalyptus species [23] could be a serious impediment to the development of highly multiplexed SNP platforms that usually require reasonably long stretches of sequence without secondary SNPs. It may prove challenging to find good targets for SNP assay design which would be invariable across a range of Eucalyptus species. In this context, DArT analysis is not constrained by high sequence polymorphism and is therefore very suitable for genotyping thousands of  genetic markers in highly outbred organisms such as Eucalyptus.
DArT generated a substantially larger number of robust polymorphic markers for Eucalyptus species than previous technologies. Although co-dominant microsatellites are significantly more informative at the single locus level they are low-throughput and expensive per data-point. Comparing DArT with RAPD or AFLP analysis would be more appropriate as they are all dominant markers. The complicating issue, however, is the ascertainment bias that takes place when selecting RAPD primers, AFLP primer/enzyme combinations or DArT polymorphic probes. This bias is exacerbated by the specific target population that is used when selecting polymorphisms and by the rigor of the experimenter in declaring these polymorphisms. It is important to note that the DArT array developed in this study provides at least two orders of magnitude more polymorphic markers in a single assay than RAPD or AFLP analysis. In Eucalyptus, while a selected RAPD primer can provide up to 10 robust polymorphic bands in a single gel run and a selected AFLP combination can provide on average 30 to 40 polymorphic markers, a single DArT assay provides 1,000 to 4,000 polymorphic markers from the7,680 probes present on the current array. In addition, the standard probe set selected for routine DArT genotyping allows comparisons of markers across a range of species and populations while both AFLP and RAPD markers are much less amenable to integration across laboratories and even less so across different species.
The high level of DArT marker multiplexing was validated in a large collection of eucalypt species and individuals. The results indicated that the DArT genotyping array will deliver thousands of polymorphic markers for population diversity studies and provide a very efficient platform with which to generate high-density linkage maps with a substantial proportion of markers shared across pedigrees. This array will be especially useful for applications that benefit from access to a large number of markers. The cost per data point (per sample per marker) will of course depend on the application and the facility generating the data. Using the fully costed service provided by the technology development partner, DArT Pty Ltd, the cost per data point for polymorphic markers is expected to vary between one and five cents US (assuming an assay cost per sample of 50 USD, not counting shipping and DNA extraction costs). In linkage mapping studies, an application where one of the lowest degrees of polymorphism is expected because diversity comes essentially from only two parents, we expect that a minimum of 1,000 polymorphic markers could be mapped at a cost of approximately five cents US per polymorphic marker. The per sample cost is much cheaper than current SNP genotyping platforms assaying an equivalent number of markers (e.g. Illumina GoldenGate). The inhouse use of DArT arrays would involve purchasing the equipment necessary to spot high density arrays, hybrid-  ization chambers and a multi color scanner and therefore would require a very high throughput operation to make such investment worthwhile. Another significant advantage of the DArT markers is their transferability across species, which is particularly valuable when dealing with a genus like Eucalyptus with over 700 species, of which many are foundation species in their forest ecosystems, and several are commercially useful in either temperate or sub-tropical regions of the world. This transferability will allow the detailed comparison of linkage maps and QTL positions across studies. However, this transferability appears to have limits as we obtained poor transferability across eucalypt genera (Corymbia to Eucalyptus). We will address the phylogenetic consequences of this finding and the performance of the DArT array across the full range of Eucalyptus species in a related study (Steane et al. submitted).
A limitation of the DArT technology compared to multi allelic microsatellites is their dominant inheritance, which precludes studying aspects of within-individual variation, although methodologies are being developed that can mitigate this [24]. Dominant markers are also less informative for constructing linkage maps, unless a large number of them are available and population sizes are large, in which case they can be as useful as co-dominant markers. Finally, clustering of DArT markers across the genome could potentially be an issue due to the reduced representation method by which that DArT probes are developed. However, this is not exclusive to the DArT technology and an assessment of this will only be possible by linkage mapping DArT markers in multiple pedigrees and/or physically mapping them on the upcoming Eucalyptus reference genome.
To better characterize the genomic content of this array, all 7,680 DNA clones on the operational DArT array are being sequenced. The availability of DNA sequences for the DArT markers will facilitate the integration of high-density maps and QTL locations with the Eucalyptus genome assembly. The operational DArT array constitutes a powerful tool with which to undertake high resolution genetic analyses required for applications such as fine QTL mapping, genome-wide selection and complex phylogenetic and evolutionary investigations.

Methods
For the development of the Eucalyptus DArT microarray, DNA samples from many different species and provenances were used both in the prototype and technology development steps (Tables 1, 2, 6 and 7). DNA was extracted from either fresh leaf tissue or bark cambium in three different laboratories (Australia, South Africa, Brazil) all using a CTAB protocol [21]. DNA quality was checked on agarose gels with DNA digested with the restriction enzyme HindIII together with undigested DNA to check that (1) undigested DNA formed a tight band of high molecular weight without RNA contamination; (2) fully-digested DNA formed a smear of mid-to low molecular weight. DNA concentration was adjusted to 50-100 ng/μL, targeting a concentration of 75 ng/μL.

Methods of genome complexity reduction to generate genomic representations
Digestion and adapter ligation were performed simultaneously on 75 ng of genomic DNA in a 10 μL aqueous solution containing 2 Units of each restriction enzyme, 80 Units of T4 DNA Ligase and 0.05 μM adapter (5'-CAC GAT GGA TCC AGT GCA-3' annealed with 5'-CTG GAT CCA TCG TGC A-3'). Reactions were incubated at Table 9: Informativeness of DArT markers from the operational array for genetic mapping based on sampling six different pedigrees (see Table 8  37°C for 2 hours, followed by 2 hours at 60°C as required by the enzyme combinations. 1 μL of digestion/ligation reaction product was used as a template for PCR amplification in a 50 μL reaction using DArT PstI primer (5'-GAT GGA TCC AGT GCA G-3') with the following cycling parameters: 94°C for 1 min, followed by 30 cycles of 94°C for 20 sec, 58°C for 40 sec, 72°C for 1 min, and finished with an extension at 72°C for 7 min. Initial assessment of the tested methods was performed by resolving 5 μL of amplification product in a 1.2% agarose gel stained with ethidium bromide.

Construction of small clone DArT libraries
The genomic representations of each species/complexity reduction method combination were pooled and cloned using the TOPO TA Cloning Kit (Invitrogen) as specified by the manufacturer's instructions. Individual bacterial colonies were picked into 384-well plates containing LB medium with 4.4% glycerol, 100 μg/mL ampicillin and a mixture of salts to facilitate PCR from the LB cultures (unpublished observation) and grown at 37° for 18 hours. A PCR amplification was performed using 0.5 μL of bacterial culture template, 0.2 μM "M13 Forward" and "M13 Reverse" primers (Invitrogen), and the following PCR program: 95° for 4 min, 57° for 35 sec, 72° for 1 min, followed by 35 cycles of 94° for 35 sec, 52° for 35 sec and 72°f or 1 min and a final step of 72° for 7 min. The PCR products were dried at 37°C and washed with 70% ethanol before being dissolved in "DArTspotter" spotting buffer, designed for use with poly-L-lysine coated micro-array slides (Wenzl et. al. in preparation, available from DArT Pty Ltd). Arrays were spotted using a MicrogridII arrayer (Biorobotics) on poly-L-lysine coated glass microarray slides (Erie Scientific). Slides were aged on the bench for 24 hours before being immersed in Milli-Q water at 95°C for 2 min, to denature the DNA spotted onto the slides, then in Milli-Q water with 0.1 mM DTT and 0.1 mM EDTA at 20°C, and finally being dried by centrifugation at 500 × g for 7 min and vacuum desiccation for 30 min.

Fluorescent labeling of genomic representations
Genomic representations of the 12 samples of E. grandis and E. globulus were prepared as described above for library construction, to generate 'targets' for hybridizing to the arrays. The products of amplification were precipitated individually with isopropanol, washed with 70% ethanol and air dried at room temperature for 12 hours. For each species the 12 genotypes were assayed with two replicates per genotype. Targets were labeled in a 10 μL reaction volume with 2.5 nM of Cy3-dUTP or Cy5-dUTP (Amersham Bioscience), 2.5 units of Klenow exofragment of E. coli Polymerase I (New England Biolabs) and 25 μM random decamers in 1 × NEB Buffer 2 (New England Biolabs). The labelling reactions were incubated at 37°C for 3 hours.

Test hybridization to microarrays
The labeled targets were mixed with a hybridisation buffer containing a 50

Microarray imaging and data extraction
Microarrays were scanned using a TECAN LS300 confocal laser microarray scanner at a resolution of 20 μm per pixel with sequential acquisition of 3 images for each microarray slide, using the following laser/emission-filter combinations: 488 nm laser/520 nm filter (for imaging the fluorescent signal from the FAM-labeled polylinker region of the pCR 2.1 TOPO vector); 543 nm laser/590 nm filter (for imaging the fluorescent signal from the hybridized target labeled with Cy-3); 633 nm laser/670 nm filter (for imaging the fluorescent signal from the hybridized target labeled with Cy-5). The use of a third fluorescent dye is not absolutely required and DArT assays can be performed on any two-color scanner as reported in early DArT papers. However, the third dye provides significantly higher sample throughput together with lower assay cost because two samples can be processed on a single array instead of just one as is the case when using a two-color scanner. The signal from the FAM-labeled vector polylinker provided a reference value for quantity of amplified DNA fragment present in each 'spot' of the microarray. The resulting images were analyzed using DArTSoft version 7.44, a program created by Diversity Arrays Technology Pty Ltd for microarray image data extraction, polymorphism detection, and marker scoring (Cayla et al. in preparation). DArTsoft localized the individual spot features of the microarrays from the 16 bit TIFF images generated by the laser scanner and spots with insufficient or absent reference signals were rejected from further analysis. A relative hybridisation intensity value was then calculated for all accepted spots as log [Cy-3 signal/FAM signal] for the targets labelled with Cy-3, and log [Cy-5 signal/FAM signal] for targets labelled with Cy-5. DArTSoft then compared the relative intensity values obtained for each clone across all slides/ targets to detect the presence of clusters of higher and lower values corresponding to marker scores of '1' and '0' respectively. Targets with relative intensity values that could not be assigned to either of the clusters were recorded as unscored. For each clone, the software gener-ated a range of quality parameters to assist in selection of polymorphic clones. The quality parameters used in this study were: Call Rate (percentage of targets that could be scored as '0' or '1') and a Reproducibility value (reproducibility of scoring between replicated target assays). Two replicates per clone were spotted on each array. The operational array has 15,360 spots in total, comprising two randomly positioned spots for each one of the 7,680 clones. The DArT array is available to the public through Diversity Arrays Technology Pty Ltd http://www.diversityarrays.com/.