Three minimum tile paths from bacterial artificial chromosome libraries of the soybean (Glycine max cv. 'Forrest'): tools for structural and functional genomics

Background The creation of minimally redundant tile paths (hereafter MTP) from contiguous sets of overlapping clones (hereafter contigs) in physical maps is a critical step for structural and functional genomics. Build 4 of the physical map of soybean (Glycine max L. Merr. cv. 'Forrest') showed the 1 Gbp haploid genome was composed of 0.7 Gbp diploid, 0.1 Gbp tetraploid and 0.2 Gbp octoploid regions. Therefore, the size of the unique genome was about 0.8 Gbp. The aim here was to create MTP sub-libraries from the soybean cv. Forrest physical map builds 2 to 4. Results The first MTP, named MTP2, was 14,208 clones (of mean insert size 140 kbp) picked from the 5,597 contigs of build 2. MTP2 was constructed from three BAC libraries (BamHI (B), HindIII (H) and EcoRI (E) inserts). MTP2 encompassed the contigs of build 3 that derived from build 2 by a series of contig merges. MTP2 encompassed 2 Gbp compared to the soybean haploid genome of 1 Gbp and does not distinguish regions by ploidy. The second and third MTPs, called MTP4BH and MTP4E, were each based on build 4. Each was semi-automatically selected from 2,854 contigs. MTP4BH was 4,608 B and H insert clones of mean size 173 kbp in the large (27.6 kbp) T-DNA vector pCLD04541. MTP4BH was suitable for plant transformation and functional genomics. MTP4E was 4,608 BAC clones with large inserts (mean 175 kbp) in the small (7.5 kbp) pECBAC1 vector. MTP4E was suitable for DNA sequencing. MTP4BH and MTP4E clones each encompassed about 0.8 Gbp, the 0.7 Gbp diploid regions and 0.05 Gbp each from the tetraploid and octoploid regions. MTP2 and MTP4BH were used for BAC-end sequencing, EST integration, micro-satellite integration into the physical map and high information content fingerprinting. MTP4E will be used for genome sequence by pooled genomic clone index. Conclusion Each MTP and associated BES will be useful to deconvolute and ultimately finish the whole genome shotgun sequence of soybean.


Background
The construction of a fingerprint-based physical map in soybean (Glycine max L. Merr.) has relied on three large insert genomic libraries [1,2]. Two large insert DNA libraries were constructed in the pCLD04541, a 27.6 Kbp oriT based low copy per cell, T-DNA with nptII, binary, tetracycline resistance conferring plasmid vector [Genbank # AF184978 ; 3]. Partial digestion of high molecular weight genomic DNA was with BamHI (B) or HindIII (H) [1]. One large insert DNA libraries was constructed in the pECBAC1 vector [4], a 7.5 kbp oriS based single copy per cell, chloramphenicol resistance conferring plasmid vector. Partial digestion of high molecular weight genomic DNA was with EcoRI (E) [2]. The mean insert sizes across libraries were 125 ± 5 kbp (B), 135 ± 5 kbp (H) and 157 ± 10 kbp (E). All three libraries contained a substantial proportion (10-20%) of BAC clones with inserts of 160-240 kbp. Larger clones than 240 kbp existed [3] but most were shown to be chimaeras and contaminants that had to be removed for physical map development [5].
The physical map of soybean was developed from sequencing PAGE separation of HindIII, HaeIII restriction digest fragments [6]. About 35-50 bands per clone could be used for contig builds by FPC [7]. The use of three BAC libraries generated with three different restriction enzymes in two different plasmid vectors has avoided biases in genome representation [2]. The fingerprint method has several advantages over agarose gel and capillary sequencing gel fingerprinting [8].
Build 1, the first publicly available physical map tool for soybean, was based on 30,000 BAC fingerprints. The data were available for a short time [9]. Build 2 appeared to resolve clone identification issues from Build 1, and consisted of 5,597 contigs ( Table 1). The contigs were built at varying cut-off stringencies (e -18 -e -28 ). Build 3 contigs derived from build 2 contigs by manual editing that included merging potentially overlapping contigs and splitting contigs with more than 12 BACs per unique band [2,6,9,10]. In addition to 2,901 contigs (Table 1), build three incorporated DNA markers and was the first functional build available to public soybean research. Build 3 contigs were provided for viewing through a genome browser interface at SoyGD [10] in the context of soybean genome information including DNA markers, BAC end sequence, QTL and EST hybridizations [11,12].
There were three reasons to create an improved build of the soybean genome (build 4); detection of contig merge and split errors during manual re-examination of the soybean libraries for MTP selection [5]; the detection of potential neighboring well contaminants that cause false merges [5]; and the concern that the build 3 map encom-passed 1.4 Gbp, about 0.3 Gbp more than the soybean genome [2,6,9].
The result of Build 4 was 2,854 contigs (Table 1). Build 4 contigs were provided for viewing through a separate window to build 3 in the SoyGD genome browser interface [10]. Build 4 is shown in the context of soybean genome information including DNA markers, BAC end sequence, QTL and EST hybridizations [11,12]. Development of robust contig sets and increased demands for libraries and filter sets from the research community prompted the development of minimally redundant tile paths (MTP).
MTPs derived from physical maps have been used widely among genomes where physical maps exist [7,13]. MTPs may be used to develop whole genome sequences [14][15][16][17] and to examine synteny among genomes [18]. MTPs were used to identify chromosomal rearrangements [19] and to determine the timing of DNA replication among chromosomes [20]. MTPs are useful for genetic map development [21], gene map development [22,23] and positional cloning [12,24]. In each area of use the ability to use MTP BAC clones for cell transformation can be important.
Binary vectors have been used to develop BAC libraries for fungi, plants and animals [17,25,26]. The libraries facilitate Agrobacterium-mediated transformation with large DNA fragments to fungal [27], plant [25,28] and animal [29] cells. The stability of different clones varies with insert size, origin of replication, cloning site and Agrobacterium strain [2,6,30]. Therefore to realize the potential of clones for functional genomics, a heterogeneous series of clones covering an interval should be used. Overlapping sets of MTP derived from physical maps built from BAC fingerprints can provide a heterogeneous series of clones covering an interval [2,6].
The creation of soybean minimum tiling path libraries (MTPs) was undertaken in order to reduce the number of clones required for functional and structural genomics. Uses have included EST integration, high information content fingerprinting (HICF), high throughput marker integration, BAC-end sequence synteny analysis and soybean genome sequence.

Results and discussion
Three MTP sub-libraries were created. All three MTP can be viewed in Gbrowse at SoyGD (Figure 2). Filters and plates containing the MTPs are available on request at significantly lower cost than whole libraries.

MTP2
The MTP2 consisted of thirty seven 384 well plates (13 HindIII,16 EcoRI and 8 BamHI) created from 5,597 contigs (supplementary Table 2). The 14,208 clones of MTP2 appeared to have a mean insert size of 140 ± 5 kbp as judged by 35 useful bands per fingerprint, and by agarose gel electrophoresis after digestion with NotI, HindIII, EcoRI or BamHI (not shown). The 140 Kbp insert size is intermediate between the insert sizes of the three contributing libraries and may be a conservative estimate. There-fore, MTP2 encompassed at least 2 Gbp compared to the soybean haploid genome of 1 Gbp. Each region of the genome should be represented twice. However, since build 2 and 3 did not distinguish regions by ploidy some highly conserved tetraploid and octoploid regions may be represented 4-8 fold.
Example of procedure used to select clones for the build 4 minimum tiling path 1. Two columns of contig data were sorted in excel, one with the lowest "Left" position, One with the highest "Right" position.
2. If the Left and Right positions overlapped, no further work was required. For contig 411, B07M12 covers positions 10-56, clone H62K08 covers positions 63-125. the gap from 56 to 63 is covered by H39P24 3. In the case of contig 412, H42K07 spans both the left and right positions 4. In order to reduce redundant sequencing, each selected clone was compared with endsequenced clones. B07M12 is a previously sequenced clone, so it is replaced by B12G18 although this results in a loss of 3 band positions, this was deemed an acceptable loss (3 bands was the maximum acceptable loss for these steps) MTP2 has been used since 2003 for BAC-end sequencing, EST hybridization, DNA micro-satellite marker integration into the physical map and positional cloning [11,12,21,23,31].

MTP4BH
The second and third MTPs, called MTP4BH and MTP4E, were each based on build 4. Each was semi-automatically selected from 2,854 contigs. MTP4BH was 12 plates that contained 4,608 HindIII or BamHI insert clones. The inserts of selected clones were of mean size 173 kbp as judged by 43 useful bands per fingerprint, and by agarose gel electrophoresis after digestion with NotI, HindIII or BamHI ( Figure 3). The 173 Kbp insert size is much larger (38-48 Kbp) than the mean insert sizes of the 2 contributing libraries suggesting large insert BACs were selected and small insert BACs avoided. Therefore, MTP4BH encompassed about 0.8 Gbp compared to the soybean haploid genome of 1 Gbp. MTP4BH encompassed about 0.7 Gbp from the 4,032 clones selected from diploid regions found in the contigs 1-2204. MTP4BH contained about 0.1 Gbp from the 576 clones selected to represent the tetraploid and octoploid regions found in the 646 contigs in the 8000 and 9000 series. Each unique region of the genome should be represented once since build 4 did distinguish regions by ploidy.
MTP4BH inserts are present in the large (27.6 kbp) T-DNA vector pCLD04541. MTP4BH was suitable for plant transformation and functional genomics. BACs are readily transferred to Agrobacterium tumefaciens by triparental matings or electroporation (not shown). Plant cells, roots and whole regenerants resistant to kanamycin and containing BAC DNA as judged by PCR and southern hybridization have been selected.
Two plates of MTP4BH clones were derived from MTP2 and therefore were used for BAC-end sequencing, EST integration. The 12 plates of the MTP4BH were used for HICF and have been sent for BAC end sequencing. Plates 11 and 12 contained MTP2 BAC end sequenced clones [31]. A thirteenth plate contains 386 clone picked redundantly from 6 octoploid contigs to contrast with the low redundancy of the MTP4BH plates in uses. The plate has been used to test other methods of HICF fingerprinting (with Dr. MAC Luo, unpublished) and some clone pairs were used for DNA sequencing (With Dr. Gane Wu, unpublished) to determine what methods might separate highly conserved homeologous regions.

MTP4E
MTP4E was 4,608 BAC clones with large inserts (mean 175 kbp) in the small (7.5 kbp) pECBAC1 vector. MTP4E was suitable for DNA sequencing. MTP4BH and MTP4E clones each encompassed about 0.8 Gbp, the 0.7 Gbp dip-loid regions and 0.05 Gbp each from the tetraploid and octoploid regions.
MTP4E was 12 plates that contained 4,608 EcoRI insert clones. The inserts of selected clones were of mean size 175 kbp as judged by 43.5 useful bands per fingerprint, and by agarose gel electrophoresis after digestion with NotI, or EcoRI. The 175 Kbp insert size is 18 Kbp larger than the mean insert sizes of the 2 contributing libraries suggesting large clones were selected. Therefore, MTP4E encompassed slightly more than 0.8 Gbp compared to the soybean haploid genome of 1 Gbp. MTP4E encompassed about 0.7 Gbp from the 4,032 clones selected from diploid regions found in the contigs 1-2204. MTP4E contained about 0.1 Gbp from the 576 clones selected to represent the unique portion of the 0.35 Gbp tetraploid and octoploid regions found in the 646 contigs in the 8000 and 9000 series. Each unique region of the genome should be represented once, since build 4 did distinguish regions by ploidy. A thirteenth plate contained 386 clones that were picked redundantly from 6 octoploid contigs to contrast with the low redundancy of the MTP4E plates in uses. MTP4E will be used for genome sequence by pooled genomic clone index.

Sources of error in build 2-based MTP construction
The high number of contaminated clones present during the build 2 automated contig procedure, and build 3 contig merges and splits created a high number (1040) of "Q" scores ( Table 1). Each of these Q scores represents a clone in a contig that has bands that do not correctly match other clones in the contig. A macro-based procedure for contaminated and chimaeric clone identification was created [31]. From the list generated, each contig was inspected in order to find an alternate route across the contig without using these contaminated clones. The reanalysis of almost every contig created an MTP with an estimated 3-4 thousand extra clones. The inclusion of almost every chimaeric clone in every contig was balanced by the inclusion of a non-chimaeric BAC from the same region.

Benefits of MTP2
As the first available MTP for any legume, MTP2 clones were widely used. The HindIII and BamHI-based plates were used for BAC end sequencing [31] and for high throughput EST array hybridization [23,32]. The primary benefits of this MTP were early availability and comprehensive nature through the 2 fold redundancy and use of all three libraries.

Sources of error in MTP4
Merging errors may have occurred when contig to contig merges were performed during build 4. Errors accumulate when aligning the contigs as dictated by the FPC program   Acceptance of 3-4 band loss during MTP semi-automated selection allowed MTP4 to be as small, minimal and nonredundant as the fingerprint database could allow. MTP4 was 12 plates compared to the 38 plates of MTP2. The 3-4 bands lost per clone and ~300 clones removed, represents ~900 unique bands, or ~3.6 Mbp (<0.4%) of the genome that is not present in MTP4. However, small numbers of unique bands found in only one clone are often derived from chimaeric inserts composed of a small insert and a large insert from different genomic regions fused during ligation. Small insert contaminants are found in BAC libraries [1]. Therefore, the actual loss in genome coverage is expected to be significantly less than 3.6 Mbp Separation of MTP4 into two parts MTP4BH and MTP4E reduced the representation of the MTP4 sets. The cloning efficiency of different soybean genomic regions varies. The restriction enzyme used [2] and the cloning vector selected [4,8] are major determinant in cloning efficiency. When both MTP4BH and MTP4E were used, the MTP4 genome representation exceeded MTP2 genome representation, but in 24 plates compared to 38 plates.

Benefits of build 4-based MTP
Separation of MTP4 into MTP4BH and MTP4E provided several benefits. Most contigs (98%) are represented in both sets, producing a robust MTP4. Antibiotic differences between the two vectors are more easily managed. Separation within MTP4BH included the previously sequenced  HindIII and BamHI clones that were placed in plates 11 and 12. Inclusion of the anchored clones within the unsequenced clones would have resulted in over 1500 redundant end sequences. All plates except 11 and 12 were end sequenced (Genbank # DX406713 to DX414412; 7,700 sequences). If the 4,608 clones were sequenced, by STC, PGI or related methods [33][34][35] reads should be from non-redundant BACs that provide largely original sequence data.
The two additional plates of clones from several repeated contigs were included regardless of sequence status or inclusion MTP2. The clone sets provide a contrast to the low redundancy of the MTPs. Singleton clones are represented by the inclusion of 117 H and B BACs with band numbers in the range of 40-42. If non-chimaeric and unique, the clones represent ~4.6 million base pairs (Mbp) of DNA, or ~0.5% of available genomic DNA.

Conclusion
The utility of minimum tiling paths has been demonstrated first by the use of subsets of the build 2-based MTP for BAC-end sequencing [31], EST integration [23,32], new micro-satellite integration into the physical map [21] and high information content fingerprinting (unpublished). In progress with the MTPs are further BAC end sequencing (for MTP4E), integration of fingerprints with an emerging physical map of soybean cv. Williams, development of new large DNA marker sets and genome sequence anchoring. The quality of the physical map will be tested by the release of the whole genome shotgun sequence by DOE in 2007-2008. Currently, BAC end derived microsatellite markers are providing excellent tools for placing contigs on the genetic and physical map and testing contigs already placed by satellite to BAC anchoring ( Figure 4).
An interesting by-product of two, semi-overlapping MTPs is that a method for determining the efficiency and accuracy of a procedure can be provided, without the need to pre-select a set of clones. An example was provided by comparisons of MTP4 clones in which, by definition, most clones do not overlap, in a HICF fingerprint map. The inclusion of the same or an overlapping clone will provide answers in that procedure relating to accuracy, and repeatability. The use of an MTP based on a contig build also creates an efficient, well-spaced set of STS and markers [21]. The STS provide an opportunity to sample regions of the genome more equally than with markers located by genetic recombination alone.
Tools for MTP selection from paleopolyploid genomes will be developed further. The goal would be to completely automate the procedure for MTP selection using available FPC program output. A second goal was the cre-ation of local automation protocols that would use available robotics to assemble new MTPs from future builds. The usefulness of creating a new MTP for each major build is demonstrated. New MTP development becomes exceedingly complex if previous MTP clones are used. The cost of duplication of effort can be high and should be avoided.
MTP2 and MTP4 are largely unordered. In future clones will be re-racked in map order so that neighboring clones in plate rows are neighbors on the soybean chromosomes. The chromosome arrays are useful for DNA sequencing by pooled genomic array [33] the analysis of synteny (30,35), to identify chromosomal rearrangements and to look at the timing of replication of chromosomes [20]. Among soybean germplasm, and other inbreeding legumes, chromosomal rearrangements are common.
Plant transformation with MTP4BH clones is an important future goal. In collaboration with Dr. Zhanjuan Zhang (U. of Missouri) the challenges that exist will be addressed. First, BIBAC vectors use the nptII gene for selection of kanamycin resistance. The nptII gene has allowed selection of soybean transformants and regenerants [36] but at lower efficiency than hygromycin or Roundup selectable markers. Second, the insert stability can depend on size, gene content, repeat content, ploidy and Agrobacterium genotype. In future functional complementation with dominant genes in Forrest will be attempted, including pubescence color [37] and significant soybean disease resistance and susceptibility alleles on isolated BIBAC clones [12,24].

Build 2 minimum tiling path
Selection of clones was based on visual inspection of contigs. During this inspection, clear evidence of well contamination [5] was discovered (Figure 1; step 4; multiple BamHI plate 17 clones). Each contig was processed to produce a list of clones. The clone lists were used to locate any BAC fingerprints that may have represented more than one clone or fragment of DNA. Each contig that had a potentially contaminated or chimaeric clone was re-examined, with a list being generated for all clones that were possibly contaminated. MTP generation is biased toward larger clones so that many potentially contaminated clones were discovered. Possible contaminants were retained in the dataset but a secondary route of clones covering the same region was added to the list.
The edited list of MTP2 clones was sorted alphabetically, in order to allow sequential picking from stock library plates. Groups of 96 clones were printed on an 8 × 12 columnar page. Pages were used as guides to pick from the 384 well plates into 96-well plates. As each plate was picked, it was incubated at 37°C, 230 rpm for [16][17][18][19][20][21][22][23][24] hours. Once the cell growth was clearly established, a 96 pin replicating tool was used to replicate the cells into the final 384 well master plate containing freeze media and LB broth. The 384 well plate was grown overnight at 37°C, with no shaking. After incubation, each plate was replicated, to create a working copy and stored at -80°C.

Build 4 minimum tiling path
Selection of clones was based on numeric position generated by the FPC file program and represented in the FPC file as "Bands Left" and "Bands Right" [7]. The separate MTP4BH and MTP4E plates represented the best path within each contig using only clones from the selected libraries. Separation of MTP4 into two parts was driven by the different antibiotic resistances among the libraries and different uses of the plasmid vectors.
The MTP4BH and MTP4E lists of clones were sorted alphabetically, in order to allow sequential picking from stock library plates. Groups of 192 clones were printed on a 16 × 12 columnar page and used as a guide to pick into a 384-well plate (2 pages per plate). As each plate was inoculated, it was incubated overnight at 37°C, with no shaking. After incubation, each plate was replicated to create a working copy and stored at -80°C.

Pulsed field gel electrophoresis
Isolation of BAC DNAs and restriction digestion with NotI, BamHI, HindIII or EcoRI were as described in Meksem et al., [1]. Band sizes were estimated by comparison to lambda genome concatemers.

Abbreviations
BAC, bacterial artificial chromosome (or large insert) clone.
BIBAC, T-DNA binary bacterial artificial chromosome (or large insert) clone.
Contig, contiguous overlapping set of BAC clones.
HICF, high information content fingerprinting MTP, minimum tiling path.
Diagram of the process to use repeat motifs in BAC end sequence to simultaneously anchor contigs and improve the genetic map Figure 4 Diagram of the process to use repeat motifs in BAC end sequence to simultaneously anchor contigs and improve the genetic map. PanelA: Two contigs from a conserved duplicated region each contain the same mapped genetic markers (black symbols), 98% similar sequences (grey symbols) or distinct sequences (white). Panel B: New markers made from BES can distinguish the regions when amplifying genomic DNA [21] more efficiently than pooled BAC DNAs of reducing complexity among the anchors described previously [11].
A B .