A high-throughput amplicon-based method for estimating outcrossing rates

Background The outcrossing rate is a key determinant of the population-genetic structure of species and their long-term evolutionary trajectories. However, determining the outcrossing rate using current methods based on PCR-genotyping individual offspring of focal plants for multiple polymorphic markers is laborious and time-consuming. Results We have developed an amplicon-based, high-throughput enabled method for estimating the outcrossing rate and have applied this to an example of scented versus non-scented Capsella (Shepherd’s Purse) genotypes. Our results show that the method is able to robustly capture differences in outcrossing rates. They also highlight potential biases in the estimates resulting from differential haplotype sharing of the focal plants with the pollen-donor population at individual amplicons. Conclusions This novel method for estimating outcrossing rates will allow determining this key population-genetic parameter with high-throughput across many genotypes in a population, enabling studies into the genetic determinants of successful pollinator attraction and outcrossing. Electronic supplementary material The online version of this article (10.1186/s13007-019-0433-9) contains supplementary material, which is available to authorized users.


Background
The rate at which individuals in a population outcross has a major impact on the genetic structure of the population and its responses to natural selection [1,2]. While outcrossing maximizes the heterozygosity in a population, selfing or inbreeding between relatives increases homozygosity. This, in turn, has a number of consequences, such as the phenotypic expression of recessive deleterious mutations, also known as inbreeding depression [3], and a reduced rate of effective recombination, as crossing over between homozygous chromosomes does not lead to the formation of genetically recombinant gametes [4]. Over time, such a reduced effective rate of recombination leads to an increased length of haplotype blocks in linkage disequilibrium and of linked selection [4]. In addition, inbreeding reduces the effective population size, and as a result, the relative importance of genetic drift increases compared to that of selection [5].
The reduced efficacy of purifying selection also increases the risk of fixation of deleterious mutations and influences species extinction rates [6]. Thus, the outcrossing rate is a key determinant of several population-genetic parameters with a major influence on long-term evolutionary trajectories of populations [2,7].
In contrast to animals, where outbreeding enforced by dioecy is seen in the majority of species [8], most flowering plants are hermaphrodites [9]. While many plant lineages have evolved genetic self-incompatibility and other mechanisms to enforce or promote outbreeding, mixed mating is very common in plants [10]. In mixed-mating species, a fraction of the progeny of a plant is derived from selfing, while the rest is the result of outbreeding. Therefore, estimating the outcrossing rate of plants in a population is an important aspect of studies in plant reproductive systems.
Classically, the outcrossing rate is estimated by genotyping a large number of progeny individuals from a focal individual for several microsatellite or SNP markers and determining the fraction of genotypes that cannot have been produced by selfing at each marker [11,12]. From

Open Access
Plant Methods *Correspondence: michael.lenhard@uni-potsdam.de 1 Institute for Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, House 26, 14476 Potsdam-Golm, Germany Full list of author information is available at the end of the article these data, rates of outcrossing and other parameters of the breeding system can then be estimated [13,14]. While this approach can provide a rich and nuanced picture of the breeding system in a population, it is laborious and thus not readily amenable to be used in a high-throughput manner. Examples for questions that require such a high-throughput approach would be the following. How does the breeding system of a species depend on different environmental conditions? Are rates of outcrossing stable within a population over different years? And how does variation in floral characteristics influence outbreeding rates? Answering this kind of question requires estimating outcrossing rates for a large number of focal individuals, which would be prohibitive when done by genotyping many progeny individuals per focal plant.
Concrete examples for the last of the three mentioned question are studies to determine the relevance of different traits presumed to help in pollinator attraction, such as large and showy petals, emission of floral scent, and nectar amount and composition [15]. These traits often undergo large changes, when the breeding system changes from predominant outbreeding to selfing [16]. This transition is generally accompanied by the evolution of the so-called selfing syndrome, comprising a reduction in flower size, especially that of petals, in scent and nectar production and in the ratio of pollen to ovules per flower. One example where the genetic basis of the evolution of the selfing syndrome is being studied is the genus Capsella [17,18]. This genus contains three diploid species, two of which (C. rubella and C. orientalis) represent independently derived selfers that have diverged from an outbreeding ancestor represented by presentday C. grandiflora between 100,000 and 200,000 years ago and between one and two million years ago, respectively [19][20][21]. Several loci have by now been identified that have contributed to the reduction in petal size and in floral scent in C. rubella compared to C. grandiflora [22][23][24]. While this is starting to shed light on the molecular basis and evolutionary history of selfing-syndrome traits, understanding the ecological consequences of changes in presumed pollinator-attraction traits remains a major challenge. That said, several key biological materials have become available as part of the process of gene identification -such as quasi-isogenic lines (qILs) that only segregate for a very small chromosomal segment containing a causal gene, but are essentially isogenic otherwise-that will enable rigorously testing the effect of a given trait change on the interaction with pollinators and herbivores, including on the outcrossing rate. However, as outlined above, such studies would greatly benefit from a high-throughput method for estimating outcrossing rates from many individual plants differing in a gene and thus a trait of interest.
Against this background of work on Capsella, we set out to establish and evaluate a high-throughput compatible method for estimating and comparing outcrossing rates. This method is based on Illumina sequencing of PCR fragments amplified from pooled progeny individuals of a plant and estimating the outcrossing rate from the frequency of non-maternal haplotypes.

Plant material and growth conditions
Two Capsella qILs only differing in about 12 kb around the locus for loss of benzaldehyde emission were compared here [23]. A self-compatible Capsella grandiflora line and qILs segregating for petal size were used as sources to provide pollen for outbreeding [17,24]. At the beginning of April 2017, seeds were sown on a soilcompost mix and watered with GA-supplemented water (gibberellic acid stock: 16.5 mg in 1 ml ethanol, 1:5000 dilution for watering) until germination. After sowing, seeds were stratified for 4 days in 4 °C and then transported into an open greenhouse. Seedlings were pricked out into individual pots when they showed 4-6 leaves at 3 weeks after germination and were planted out in the plots just before starting to flower in mid-May and bird or insect protection nets were installed. The set-up is shown in Fig. 1. Plants flowered during June and July and were bagged in July to be harvested in August. When the oldest fruits started to ripen, individual plants were bagged and allowed to ripen for two more weeks before collection and preparation for DNA extractions. Due to weather conditions and infection with herbivores low numbers of seeds were harvested from each plant. Seeds from all plants with the same genotype within one plot were pooled and cleaned. To be able to define parental haplotypes, we also collected leaf material from the two qILs and subjected it to the same analysis as the seed samples.

DNA extraction from pooled seeds
This method was optimized for extracting DNA from Capsella seeds. Depending on seed number/size it may need to be adapted. Other extraction methods such as CTAB [25] did not give any results.

Design of PCR primers
Next-generation sequencing libraries relying on PCRbased approaches are known to be prone to numerous biases linked to amplicon length and GC-content, sequence heterogeneity at primer-annealing sites as well as copy number variation [26][27][28][29]. To limit amplification biases, we focused our analysis on low polymorphic genes present in a single copy within genomes.
To this end, we retrieved sequences from single-copy nuclear genes from C. rubella reference genome (https ://phyto zome.jgi.doe.gov/pz/porta l.html) based on [30]. Conserved sites within the Capsella genus were identified by comparing the sequence of 50 C. rubella and 193 C. grandiflora genomes. Primers were designed using Primer 3 plus in order to amplify amplicons of approximately 300 bp and to anneal to their templates at 55 °C. We added 33 and 34 nt sequences complementary to the forward and reverse index primers to each of the gene-specific forward and reverse primers, respectively. Sequencing quality will also depend on the heterogeneity of sequences as it is used during the first amplification cycles for cluster identification and phasing/pre-phasing calibration [31]. Low-sequence heterogeneity may impair the distinction between different clusters and considerably limit the output of the sequencing run. A common solution to such issues is to co-sequence amplicon libraries with a heterogeneous control library (e.g. usually prepared from the bacteriophage PhiX genome and mixed at variable proportions, between 15 and 50%) with the drawback that a large number of reads will be lost as they will not correspond to the sequence of the target locus. Here, we introduce sequence heterogeneity by designing an additional primer pair for each target loci containing an additional nucleotide between the gene-specific sequence and those complementary to the indexing primers (Tables 1, 2).

PCR amplification and library generation
The method for PCR amplification and library preparation is based on a recent protocol for amplicon-based microbiome characterization [32]. Further details and troubleshooting information can be found in [32].
This protocol is optimized for low sample DNA concentrations due to the availability and quality of starting material. For best results, minimize PCR cycles and select the samples with the highest concentrations in step 1c. Sample dilution might result in bottlenecking for low abundance alleles as described [32].

Primary PCR
a. Create a sample dilution plate.
Vortex DNA samples and spin down in a centrifuge. Prepare a 384-well plate by pipetting 18 µl water in quadrant 2 (A02), 3 (B01) and 4 (B02). Dispense 10 µl of the undiluted sample into quadrant 1 (A01). Generate a tenfold dilution series by transferring 2 µl of each sample to quadrant 2. Pipet up and down ten times for mixing. Repeat for quadrants 3 and 4. In the end, you will have your undiluted sample in quadrant 1, 1:10 dilution in quadrant 2, 1:100 dilution in quadrant 3 and 1:1000 dilution in quadrant 4. 1_Carubv10013869m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG CCA CTC ATC CAT TCG GAAAT   1_Carubv10013869m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GTT GGG GAC AAG GTG CTA ATC   2_Carubv10023806m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TAC CGA CCA CAT AGG CATCA   2_Carubv10023806m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GAA TGG CCG ATT CTG CTT TTA   3_Carubv10018138m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG CAA GCC AAA GTT TGA TGCTT   3_Carubv10018138m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GAC TCG TCT GCA GTC ATG GTG   4_Carubv10001640m_F  TCG TCG GCA GCG TCA GAT GTG TAT 2T_Carubv10023806m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TTA CCG ACC ACA TAG GCA TCA   2T_Carubv10023806m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GTA ATG GCC GAT TCT GCT TTT A   3A_Carubv10018138m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG ACA AGC CAA AGT TTG ATG CTT   3A_Carubv10018138m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GAA CTC GTC TGC AGT CAT GGT G   4T_Carubv10001640m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TGG AAG CGG ATG GTT ACA AAA   4T_Carubv10001640m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GTA GGC CAA GCT CAC TCA CAT T   5A_Carubv10001924m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG ATG GGT TCA GAT TGA GCG TAA   5A_Carubv10001924m_R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GAA ACT TGA TCC TCT TTG GTA CTGG   6T_Carubv10023818m_F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TTT CTT TTT CTG AGA TTC CAT TGCT   6T_Carubv10023818m_R  GTC TCG TGG  Prior to PCR, test primers individually and pooled for amplification of target genes. Figure 2a shows amplification products for 11 out of 12 primer pairs, in Fig. 2b [32].

Primer name Sequence
Compare the amplification curves for the different dilutions of the same DNA sample. Choose a sample concentration which is in the mid-to-late    For more detailed information about cherry-picking using a robot please refer to the protocol provided by Gohl and colleagues [32]. a Undiluted sample (blue) is in mid-exponential phase and was used for further steps. 1:10 dilution is in early-exponential phase, the two lowest dilutions show no amplification. b Undiluted sample (blue) is in early-to-mid-exponential phase and was used for further steps. Other dilutions did not amplify. Submit the library to your sequencing facility or follow instructions provided by [32].

Amplicon sequencing
Sequencing can be performed on a NextSeq sequencing platform using a NextSeq Reagent Kit MidOutput (300 cycles). The library is compatible with the standard sequencing primers, final amplicon structure is shown in Fig. 4.

Sequence analysis and estimation of outcrossing rates
Read pairs were associated with amplicons by parsing for the presence of forward and reverse primer at the beginning of reads 1 and 2 respectively using cutadapt version 2.1 [33]. Primer sequences were removed, including preceding T or A nucleotides that were added for heterogeneity (see Table 1). Reads resulting from primers without those T or A nucleotides were cut by one additional nucleotide at the end to get identical read lengths for both primer versions. Only reads corresponding to the expected length were kept: sequencing length-length of gene-specific sequence part of the primer-one nucleotide for T or A. Obtained sequences for read 1 and 2 were combined into one fragment to be treated as haplotype further on. Haplotype occurrences were counted. Haplotypes with a frequency below 1% were excluded from further analyses as they are likely to be a consequence of random sequencing errors; we expect this not to shift Fig. 4 Structure of the final amplicon. Gene-specific sequences are amplified during primary PCR (red), read1 and read 2 sequences (orange) are added to the gene-specific primers. Indices (green) and i5/i7 sequences (blue) are added during indexing PCR haplotype proportions. Haplotypes with a frequency above 1%, but only present in one single sample were also excluded as they are likely to be a consequence of an early PCR amplification error. Roughly half of the fragments were filtered out that way (see Additional file 2: Table S2). The sums of remaining fragments per sample and amplicon were then used as a baseline to calculate (1) the proportions of parental (P1 or P2) haplotypes per sample and amplicon and (2) the proportions of P1 and P2 haplotypes for the polymorphic amplicon 6_Carubv10023818 m. Data analyses were done using R (https ://www.r-proje ct.org). Illustrations were done using the R/lattice package (http://lmdvr .r-forge .r-proje ct.org).

Results
Plants of two Capsella qILs differing in about 12 kb around the CNL1 locus that underlies the loss of benzaldehyde emission in C. rubella [23] were grown at the field site of the Botanical Garden of the University of Potsdam. The arrangement of the plants is shown in Fig. 1. Each block contained two sets of six plants each of both genotypes arranged on opposite sides of a central area with inbred, self-compatible C. grandiflora plants and nearisogenic lines differing for the SAP locus that affects petal size [24]. The plants in this central area served as pollen donors with different background genotypes. Three such blocks were protected from birds and rodents by a bird net, while three such blocks were covered by an insectproof net. We designed 12 primer pairs in exons of highly conserved genes that anneal to invariant nucleotide stretches across a large number of C. grandiflora and C. rubella genotypes. Primers were chosen in exons flanking an intron to maximize the sensitivity for detecting nonmaternal genotypes, as intron sequences are generally more variable than exonic sequences. As shown in Fig. 2, 11 out of the 12 primer pairs successfully amplified and two pools of six and five primer pairs were set up to minimize the number of required PCR reactions.
Genomic DNA was extracted from approximately 300 pooled seeds from the 12 scented and the 12 non-scented qIL plants per block. A previously described qPCR-based approach [32] was used to determine the optimal template concentration for the primary PCRs with the two pools of six and five primer pairs described above. Example results for this test are shown in Fig. 3. In most cases, the undiluted sample was used for further steps.
Barcoding indices were introduced via a second, indexing PCR, resulting in final amplification products with the structure shown in Fig. 4. Products from the two primary PCRs (with the six-and five-primer pair pools, respectively) were combined in equimolar ratios after this indexing PCR. An example of a final library pool as determined by Tape Station electrophoresis is shown in Fig. 5.
Libraries were sequenced in 2 × 150 bp paired-end mode, and the two reads were linked to create 300-bp fragments for analysis. Based on these, we defined major haplotypes as those present with at least 1% frequency in more than one sample. This excludes low-frequency sequencing errors, but also PCR-errors from early cycles, as these should be unique to individual samples. After such filtering, around 50% of fragments remained for analysis, except for sample 26, where only 35% of  Table S2). We assigned these remaining fragments to either parental or non-parental haplotypes by comparison to the results from the leaf samples of the qILs processed in parallel. Across all amplicons, the frequency of non-parental reads was consistently higher in the samples from plants grown under the bird-protection nets only, i.e. accessible to insects, than from those grown under insect-proof nets (Fig. 6); in fact, in the latter no non-parental haplotypes could be detected for eight of the amplicons in five out of the six samples. In the samples from the bird nets, the frequency of non-parental haplotypes reached between 10 and 20% for eight of the amplicons, with very consistent estimates across the individual samples. For three of the amplicons, the frequency of non-parental haplotypes was below 10% in the bird-net samples, again with consistent estimates across the individual samples. We ascribe this difference between the two groups of amplicons to haplotype sharing with the other plants in the blocks that served as pollen donors for the outbred seeds, with at least some of the genotypes sharing the parental haplotypes with our lines at the three amplicons in question, thus rendering many outcrossing events undetectable. Given the simple genetic structure of the pollen-donor populations (two NILs and one inbred C. grandiflora-like line), the extent of haplotype sharing is most likely the same for the eight amplicons with the higher estimates. Thus, the average values across these eight amplicons (Fig. 7) represent the basis of our best estimate of the outcrossing rate in our samples; in particular, given the diploid nature of the plants, the frequency of non-parental haplotypes has to be multiplied by two to obtain an estimate for the fraction of outcrossed seeds, i.e. the outcrossing rate. These values are shown in Table 3. While  Table 1. Paired results for each of the three replicated blocks under the bird nets ('bird') and insect nets ('insect') are shown, with replicates numbered 1 to 3. 'G' and 'R' indicate samples homozygous for the C. grandiflora allele or the C. rubella allele in the CNL1 region, respectively. For amplicon 6, only those haplotypes were counted as non-parental that were distinct from those found in either of the parental lines. The results of statistical comparisons between the two genotypes under one type of net and between all samples under bird versus under insect nets are given in Additional file 3: Table S3 these values were higher for the samples from plants with the C. grandiflora CNL1 haplotype than with the C. rubella haplotype for two of the replicates under the bird net, this was reversed in the third replicate. Thus, overall there was no consistent difference in the estimated outcrossing rate between benzaldehyde-emitting and nonemitting plants in this one trial.
In summary, the strong and consistent difference in apparent outcrossing frequencies between samples under the two types of nets strongly supports the validity of our analysis method and indicates a surprisingly high rate of insect-mediated outcrossing in these self-compatible Capsella genotypes.
We also found that the two parental lines differed at one of the amplicons (number 6, locus Carubv10023818 m), indicating that their genomic background was not fully isogenic. In principle, this difference could allow estimating outcrossing between the two parental lines. When considering only the two parental haplotypes, the plants with the C. rubella haplotype in the CNL1 region appeared to have received more pollen with the alternative haplotype at amplicon 6 (i.e. from the plants with the C. grandiflora haplotype at CNL1) than vice versa under the bird nets (Fig. 8). While this difference could suggest asymmetric pollen flow between the two lines, it could also be due to differential haplotype sharing with the other plants in the plots, in particular, if the plants carrying the C. grandiflora CNL1 haplotype shared their amplicon-6 haplotype with more of the other pollendonor plants. To circumvent this issue of differential haplotype-sharing, only haplotypes found in neither of the two parental lines were counted for the analysis at amplicon 6 shown in Fig. 6.

Discussion
In this study, we have described a method for estimating outcrossing rates that lends itself to the analysis of many samples with high throughput and have validated the method using a common-garden experiment comparing the outcrossing rate between open-pollinated and insect-excluded plants. Our results clearly show that the method successfully detects insect-mediated outcrossing events and provides consistent estimates of outcrossing rates across replicated samples. While the analysis approach presented assumes that the maternal genotypes at the tested amplicons are known, the method can easily be adapted to the case when these are not, by preparing a parallel set of amplicon-sequencing libraries from genomic DNA of the mother plants to be analyzed.
The described method offers a major advantage regarding the time and effort required to estimate the outcrossing rate for many samples. For example, obtaining the estimates for the twelve samples in our study using the classical method would have required running more than 10,000 individual PCR reactions and analyzing the products by electrophoresis (assuming 100 progeny seeds were genotyped per sample). At the same time, the  3,4,5,8,9,11,12 are plotted for the twelve samples. Statistical analysis by paired t test between the two genotypes under bird nets or between the two genotypes under insect nets did not detect any significant difference (p > 0.05 in both cases). By contrast, the difference between all samples under bird nets versus all samples under insect nets was highly significant at p < 0.001 based on a Welch two-sample t-test For example, when demanding on average a 10,000-fold coverage for each of ten amplicons per sample, hundreds of samples could be analysed in parallel using a single NextSeq mid-output run. In principle, this would allow very fine-scale descriptions of how the outcrossing rate differs across a population to determine the effect of environmental influences or trait variation. Compared with the single progeny-based approach these advantages concerning throughput come at a cost regarding the up-front investment in primer design and the precision of the estimates of outcrossing rates. As for primer design, one obvious technical source of error is differential amplification efficiency of different haplotypes due to mismatches in the primer-binding sites. For individual-based measurements, only very large differences in amplification efficiencies will cause an error, causing, for example, certain heterozygotes to be called as homozygotes for the more efficiently amplifying allele. By contrast, even slight amplification biases for different haplotypes can cause substantial error in the estimated outcrossing rates using the method described here. To circumvent this issue, some up-front sequencing of the haplotypes in the population in question may be necessary to identify highly conserved primer-binding sites. Following the strategy taken here, a cost-effective method for doing so would be to sequence PCR amplification products from highly conserved genes containing one or more introns from many pooled individuals in the population. Analysis of these sequences should allow identifying invariant primer-binding sites flanking suitably polymorphic regions or adapting the primer design by incorporating polymorphic bases into the primers, if no fully invariant regions can be found. The strategy for primer design is outlined in Additional file 4: Figure S1.
As for the precision of the estimates, the present method will necessarily underestimate the true outcrossing rates, and it will do so more strongly than the individual-based method in most cases. Concerning each single amplicon, the true outcrossing rate will be underestimated by the combined frequency of the one or two maternal alleles in the pollen population, as any outcrossing event involving a pollen-carrying one of the maternal alleles will be undetectable when considering a single amplicon. The individual-based method is better able to deal with this complication than the poolbased approach. This is because for the individual-based method a single marker with a non-maternal allele or haplotype is enough to classify an individual as resulting from outbreeding. By contrast, this information is necessarily lost in our pool-based approach; in a hypothetical example, if there were ten such outbred individuals, each with the diagnostic non-maternal haplotype in a different amplicon, these would all be detectable in the individual-based approach, but would only be counted as a single outbreeding event in the pool-based approach. Such a scenario of maternal haplotype-sharing at many of the amplicons will result for example from bi-parental inbreeding [13,14].
The above bias means that the estimate closest to the true outcrossing rate will be obtained from the amplicon for which the combined frequency of the Fig. 8 Frequency of the two alternative parental haplotypes in reads for amplicon 6 (Carubv10023818 m). Frequencies of the two alternative parental haplotypes at locus Carubv10023818 m (termed 'P1 (G)' and 'P2 (R)') is plotted for the samples carrying the C. grandiflora allele (CNL1_G) or the C. rubella allele (CNL1_R) in the CNL1 region, respectively. Samples are separated according to the protection net they were under. 'none' indicates samples from selfed parental plants grown in the absence of animal pollinators. Note that only the two parental haplotypes were considered for this analysis. Statistical analysis by paired t-test between the two genotypes under bird nets or between the two genotypes under insect nets did not detect any significant difference. Similarly, comparison of all samples under bird nets with all samples under insect nets by a Welch two-sample t-test did not find a significant difference