Development of a universal and simplified ddRAD library preparation approach for SNP discovery and genotyping in angiosperm plants
© The Author(s) 2016
Received: 17 May 2016
Accepted: 26 July 2016
Published: 4 August 2016
The double digest restriction-site associated DNA sequencing technology (ddRAD-seq) is a reduced representation sequencing technology by sampling genome-wide enzyme loci developed on the basis of next-generation sequencing. ddRAD-seq has been widely applied to SNP marker development and genotyping on animals, especially on marine animals as the original ddRAD protocol is mainly built and trained based on animal data. However, wide application of ddRAD-seq technology in plant species has not been achieved so far. Here, we aim to develop an optimized ddRAD library preparation protocol be accessible to most angiosperm plant species without much startup pre-experiment and costs.
We first tested several combinations of enzymes by in silico analysis of 23 plant species covering 17 families of angiosperm and 1 family of bryophyta and found AvaII + MspI enzyme pair produced consistently higher number of fragments in a broad range of plant species. Then we removed two purifying and one quantifying steps of the original protocol, replaced expensive consumables and apparatuses by conventional experimental apparatuses. Besides, we shortened P1 adapter from 37 to 25 bp and designed a new barcode-adapter system containing 20 pairs of barcodes of varying length. This is an optimized ddRAD strategy for angiosperm plants that is economical, time-saving and requires little technical expertise or investment in laboratory equipment. We refer to this simplified protocol as MiddRAD and we demonstrated the utility and flexibility of our approach by resolving phylogenetic relationships of two genera of woody bamboos (Dendrocalamus and Phyllostachys). Overall our results provide empirical evidence for using this method on different model and non-model plants to produce consistent data.
As MiddRAD adopts an enzyme pair that works for a broad range of angiosperm plants, simplifies library constructing procedure and requires less DNA input, it will greatly facilitate designing a ddRAD project. Our optimization of this method may make ddRAD be widely used in fields of plant population genetics, phylogenetics, phylogeography and molecular breeding.
KeywordsRAD-seq ddRAD MiddRAD Genotype-by-sequencing Next-generation sequencing
Restriction-site associated DNA sequencing technology (RAD-seq) is a reduced representation sequencing technology by sampling genome-wide single enzyme loci developed on the basis of next-generation sequencing [1, 2]. The technology breaks genome into a certain size of DNA fragments by employing a restriction endonuclease (usually a low-frequency cutter) combined with the ultrasonic shearing method, then the fragmented DNA is enriched for constructing a sequencing library so that sequences beside the cleavage site can be acquired for high-throughput sequencing . Because RAD-tags are DNA fragments beside a specific restriction site from the whole genome, so they can generally reflect the sequence characteristics of the entire genome. It is now possible to obtain hundreds to thousands of single nucleotide polymorphism (SNP) markers within a species or between closely related species through RAD-seq. Until now RAD-seq has been successfully applied to SNP marker development, high-density genetic map construction, QTL mapping, population genetics and phylogenetic research on eggplants, chickpeas, sesames, soybeans, cucurbit bottle gourds, bamboos, beetles, and other organisms [4–13]. But on one aspect experimental procedure of this technology is much complex and it requires a Covaris ultrasonicator and some other specialized instruments, so personnels under professional training are usually required to master the technique; on the other hand, random physical shearing methods implemented in the library construction process will result in losing lots of DNA, thereby leading to out control of the final tag number [3, 14]. So several laboratories have improved and simplified the traditional RAD-seq method, from which a variety of low cost, high throughput reduced representation sequencing methods are available. At present, reduced representation sequencing methods developed from the RAD-seq mainly includes GBS series techniques and RAD series techniques . GBS and RAD-seq techniques share several basic steps while differ only in the order or details of enzyme digestion, adapter ligation, barcoding and size selection. Each alternative RAD method has both advantages and drawbacks. RAD series control number of the tags by both choosing the enzyme and size selection while GBS series techniques or close derivatives control the number of tags only by selecting different enzymes (though some GBS users may also add a size selection step in their modified GBS protocol , the original intention of GBS is to reduce library preparation workflow without size selection). GBS series techniques include single and double enzyme GBS [16, 17], both of which employ simple library constructing processes, but they can only enrich small fragments less than ~350 bp . It’s easy to sequence through the short fragment with pair-end sequencing mode as the sequencing length is gradually becoming longer, which will result in a waste of data and the potential to discover more SNPs. Furthermore, fragments of various lengths will increase the potential for amplification bias [19, 20] and cause a decline in the data quantity and data quality. RAD series mainly includes 2b-RAD , ddRAD  and ezRAD . 2b-RAD adopts a kind of type II restriction endonuclease to digest the genome, producing only ~33 bp fragments, which lack of biases due to fragment size selection but may restrict the potential for discovering more SNPs. ezRAD is the only protocol that relies on illumina authoritative kits to construct the library with customer support but the cost is still not as low as the author claimed . ddRAD can tune fragments number by employing two different enzymes and size selection, and the process of constructing a library is quite simple while genomic DNA it requires is of the highest quality in all the RAD methods .
All RAD protocols have been proved to be powerful tools for SNP discovery and genotyping of model and non-model species. However, startup of them all usually involves pre-experiment of (1) testing candidate endonuclease that could produce a suitable RAD or GBS library , and (2) purchasing some relatively expensive consumables and apparatus (e.g. Agilent 2100 Bioanalyzer). This requires a significant initial investment for labs focused on traditional genotyping methods (e.g. SSR genotyping). Besides, many labs (e.g. phylogenetic bio-labs) are probably focusing on different model or non-model plant species, once a pair of enzymes selected and adapters purchased for one target species, they have to consider if these consumables could be applied to another species efficiently to be studied even some commonly used enzyme pairs could produce hundreds to thousands of markers across wide-range species. Enzyme pairs simulation of the original ddRAD protocol is mainly based on animal genomes and it is hard for us to know if performance of the enzyme pairs is good as well in plant species as Santiago et al. found that a given restriction enzyme may have strikingly variable recognition-sequence frequencies among broad eukaryotic taxonomic groups, and only phylogenetic related species could produce similar recognition-sequence frequencies . In another study, Burford et al. found some enzyme pairs work more consistently than others across a wide range of taxonomic groups after optimizing ddRAD protocol and testing several restriction enzyme pairs for five genera of insects and fish . Here, we sought to test the universality of several commonly used enzyme pairs across most angiosperm plants, simplify the ddRAD protocol and reduce the overall costs. Our protocol is generally according to the protocol described by Peterson et al. , but with some modifications as we first tested several combinations of enzymes by in silico analysis of 23 plant species covering 17 families of angiosperm (16 orders, two classes) and one family of bryophyta (one orders, one class) and found AvaII + MspI enzyme pair produced consistently higher number of fragments in a broad range of plant species. Furthermore, we removed two purifying and one quantifying steps, shortened the adapters and replaced expensive instruments by conventional experimental apparatuses which make it possible to do ddRAD sequencing with no additional investment beyond the cost of library preparation and sequencing itself.
To assess the performance of this approach, we got empirical results from the model species Oryza sativa L. japonica and Zea mays L. We also explored repeatability by testing the effectiveness of the method in non-model species Phyllostachys edulis and Alloteropsis semialata (R. Br.) Hitchc. Finally, we managed to reconstruct phylogenetic relationships of two woody bamboos genera, Dendrocalamus and Phyllostachys with data generated by the protocol. This generalized approach, using the fixed enzyme pair and standard library preparation protocol, will allow researchers to apply ddRAD-seq technology to a wide array of plants and research questions. We expect that this optimized protocol could be efficiently implemented in any small or middle-sized laboratory with few people and limited funds.
Plant material and DNA samples
In this project, we used Oryza sativa L. spp. japonica and Z. mays L. to estimate the robustness of our Protocol B. Besides, a total of six species of Poaceae including four temperate woody bamboo species (Chimonocalamus pallens, Phyllostachys edulis, Phyllostachys rubicunda T. H. Wen and Phyllostachys vivax McClure), one tropical woody bamboo species (Dendrocalamus latiflorus) and one grass species Alloteropsis semialata (R. Br.) Hitchc. were used in our protocols as well. Leaves of temperate woody bamboos were mostly collected from plants grown in Kunming Botanical Garden (N25°07′04.9″, E102°44′15.2″) and leaves of tropical woody bamboos, O. sativa, Z. mays and A. semialata were collected from plants grown in our greenhouses. All necessary permits were obtained before collecting the material. Fresh leaves of all species were obtained and then dried rapidly in silica gel. The DNA was extracted with a modified CTAB method .
Choosing restriction enzymes and adapter design
At first, we selected six kinds of enzyme pairs that could recognize restriction sites of different lengths including eight bases + six bases (SbfI + EcoRI), eight bases + four bases (SbfI + MluCI), six bases + four bases (EcoRI + MspI, PstI + MspI), 4.5 bases + four bases (AvaII + MspI), four bases + four bases (NlaIII + MluCI), of which EcoRI + MspI was adopted by the original ddRAD protocol and PstI + MspI was used by the two-enzyme of GBS protocol [17, 22]. Restriction enzymes included in this study are listed in Additional file 2: Table S2. Then we in silico digested genome sequences of 23 plant species covering 17 families of angiosperm (16 orders, two classes) and one family of bryophyte (one orders, one class) of different genome size with RestrictionDigest . For each enzyme pair, we recorded the total number of fragments and the number of fragments between 400–700 bp that could produce in each species. The species adopted for analysis are listed in Additional file 2: Table S1. Genome scaffolds of these species were downloaded from Plantgdb . Then the distribution of DNA fragments was screened by agarose gel electrophoresis after digestion of genomic DNA of some species.
Chemosynthetic oligonucleotides of P1 and P2 adapters will account for almost half of the cost due to the need for high-performance liquid chromatography (HPLC) purification and 5′-end phosphorylation. In our protocols, original P1 adapters are shortened from 37 to 25 bp (barcode length is assumed to be 5 bp) to reduce the cost of the synthesizing DNA oligos. Besides, a different barcode-adapter system containing 20 pairs of barcodes varying in length was devised, which can be used with integer times (20 * n), rather than the original 48 kinds of barcodes with equal length (see Additional file 1). This will not only increase the flexibility of barcodes for projects with diverse samples but also improve the quality of bases near the restriction site.
Protocols of MiddRAD for next-generation sequencing
Then we adopted protocol B to construct libraries for three bamboo species (contains two D. latiflorus individuals, one P. rubicunda individual and one P. vivax individual) to explore the applicability of MiddRADseq-derived genotypes/markers in resolving phylogenetic problems. The library constructing process is according to Protocol B and fragments selected were set to 600–700 bp. Reagents and enzymes used were mainly purchased from New England Biolabs Inc. (R0153S, R0106S), Vazyme Biotech Co., Ltd. (C301-01) and SunShineBio Co., Ltd. (SN124). Libraries were then sequenced in Cloud Health Genomics Ltd. Sequencing platform was Illumina HiSeq X Ten (San Diego, CA, USA) with sequence length PE150 bp.
To evaluate the shortened adapters and redesigned barcodes, we constructed four MiddRAD sub-libraries according to protocol B for 40 offsprings of a D. latiflorus F1 population and sequenced the final library with a single illumina HiSeq X ten lane (PE150 bp). The coefficient of variation (CV = standard deviation/mean) of data generated by each barcode and each sub-library were analyzed to evaluate the newly designed barcodes, indexes and shortened P1 adapters.
Sequence quality analysis, SNP calling and genotyping
Raw reads were demultiplexed by process_radtags program in Stacks software version 1.24 [30, 31]. Average sequence quality per read and GC–content were checked using FastQC version 0.11.3 . Adapter reads were searched by Cutadapt 1.9.1 . Reads containing correct restriction sites in read1 and read2 were obtained by searching restriction sites sequences in the raw reads respectively. Clean data were produced by removing the adapter reads and reads with ambiguous or low quality (below a Phred score of Q10) bases. To determine the mapping ratio of sampled reads to the genome, clean reads of O. sativa, Z. mays and A. semialata were mapped onto the rice, maize and sorghum genome scaffolds, CDS-DNA and repeats region respectively, while clean reads of temperate bamboo individuals onto the P. edulis reference genome, CDS-DNA region and repeats region and reads of tropical bamboo individuals onto the D. latiflorus survey genome (Zhenhua Guo et al. unpublished data) with bowtie . Rice, maize and sorghum genome scaffolds, CDS-DNA and repeats region were downloaded from Plantgdb  while P. edulis reference genome, CDS region, and repeats region were downloaded from BambooGDB . To obtain the number of tags, clean reads (we only used read1 for analysis) of all individuals were first trimmed 140 bp (when read length is PE150) and clustered with ustacks/pstacks program, then the reducing efficiency was determined by calculating the percentage of total tag length in total nuclear genome length.
To estimate the performance of MiddRAD protocol, tags of rice and maize produced by empirical sequencing results were compared with those predicted from in silico analysis to show how actual data meet the in silico expectations.
In order to identify SNP markers and genotypes for inferring phylogeny of three woody bamboo species, the Stacks software pipeline was implemented for the processing of Illumina sequence read data and screening SNPs that are fixed-within a species while vary among different species. Sequence trimming was first performed using process_radtags program to remove adapter reads and reads with bases below a Phred score of Q10 within a 15 bases sliding window. Clean sequences were truncated to a final length of 140 base pairs (excluding the barcode but containing enzyme recognition site) prior to clustering. For each sample, the ustacks program was used to merge short-read sequences into tags/loci using removal algorithm and deleveraging algorithm (−m10, −M3). Then a catalog was built from all samples by the cstacks program (−n5). Tags from each sample were matched against the catalog to determine alleles with sstacks program and the populations program was used to output SNPs in Phylip format. The minimum number of taxa required for an informative unrooted phylogenetic tree is three. The major parameters m (minimum number of identical reads required to form a stack), M (maximum number of nucleotides mismatches allowed between stacks before fusing stacks into a locus) and n (number of mismatches allowed between loci in the catalog) were tuned to get the matrix with a variable number of SNPs. Furthermore, to validate the genotyping accuracy, we presented linkage map results of one D. latiflorus F1 mapping population (Guoqian Yang et al., unpublished data) according to MiddRAD protocol and 55 genotypes (eight markers, seven individuals and one genotype was missing during the SNP calling pipeline) were randomly selected and verified by independent Sanger sequencing.
Phylogenetic tree construction of three woody bamboo species
We inferred ML phylogenies for each data matrix using RAxML version 8.0.0 . ML searches were conducted in RAxML with the GTRGAMMAI model for sequence data, and a rapid bootstrapping analysis with 100 bootstrap replicates was conducted. Phylogenetic accuracy was determined by comparing inferred trees with published reference phylogenies. The reference phylogeny for woody bamboos from [38, 39], was estimated using parsimony analyses (MP) and Bayesian inference (BI).
Choosing universal restriction enzyme pairs across angiosperm plants
Total number of fragments produced by in silico digestion of 23 species
SbfI + EcoRI
SbfI + MlucI
NlaIII + MlucI
AvaII + MspI
EcoRI + MspI
PstI + MspI
Number of fragments between 400–700 bp produced by in silico digestion of 23 species
SbfI + EcoRI
SbfI + MlucI
NlaIII + MlucI
AvaII + MspI
EcoRI + MspI
PstI + MspI
A comprehensive evaluation of the library quality and data quality
The performance of our protocols was evaluated from both the experimental results and data analysis results. From the experimental perspective, library concentration should meet the criteria for sequencing and fragments selected should be in the expected range. From data analysis perspective, the library should produce sufficient high-quality data for downstream analysis.
We first quantified concentration of the libraries and screened fragments distribution to evaluate the quality of protocol A and protocol B. Concentration of library A (constructed according to protocol A) was between 5–9 ng/ul, while concentration of library B (constructed according to protocol B) was between 20–30 ng/ul, both of which could meet the requirements for Illumina sequencing. Fragments distribution of library A and library B screened by the agarose gel electrophoresis is well within the expected range (Additional file 2: Figure S3a). Fragments distribution results for library B had been further confirmed by the Agilent 2100 Bioanalyzer (Additional file 2: Figure S3b) while library A got no peaks from Agilent 2100 because its concentration is lower than 10 ng/ul. Therefore, we believe that fragments distribution can be determined by using agarose gel electrophoresis instead of the highly sensitive but expensive Agilent 2100 Bioanalyzer. Both protocols could produce libraries that can be sequenced on the Illumina sequencing platform.
Summary of alignment statistics of sequencing data
Per Bases Quality Score is a major index of the sequence quality, the higher the Quality Score, the lower probability of sequencing error occurs. Q20 and Q30 represent the sequencing error probability of 1 and 0.1 %. Illumina sequencing was found to favor the more GC-balanced regions, leading to few or no reads from the many GC-poor regions and GC bias can be introduced at several processes of Illumina sequencing, e.g. PCR amplification of the library, cluster amplification, and the sequencing step . So if GC-content is around 50 %, we can conclude no bias exists in library preparation and sequencing process. Adapter reads ratio is the percentage of reads with adapters in raw reads and is an indicator of data quality. Adapter reads should be removed in the subsequent analysis. Through percentage of reads containing correct restriction sites, we can determine whether the enzyme digestion reaction works in the right way. Comprehensive analysis of data quality distribution, GC-content, adapter reads ratio and correct restriction sites ratio showed that both MiddRAD protocols could produce high-quality data (We did not compare our data with original ddRAD data as the original ddRAD protocol did not supply their raw data, but the quality of our data is self-explaining). However, protocol A produced too many (nearly half of raw reads) reads with adapters which indicated many short fragments may exist in the selected gel, so we did not continue testing protocol A in model plants and took protocol B as the final protocol of MiddRAD.
Comparison of empirical and simulated data and inference tags origin from the genome
MiddRAD-seq data summary in rice and maize
Genome size (Mb)
% of repeats in genome
GC content (%)
AvaII + MspI
AvaII + MspI
Expected RAD tag size range (bp)
Expected no. of RAD tags
Tags density per 100 kb
% of tags in CDS
Observed no. of tags
Tag average depth
Tags per 100 kb
Simplification ratio (%)
% of tags in CDS
% of tags in repeats
Evaluation of protocol B on more plant species and genotypes validation
To test the universality of our protocol and the restriction enzymes on more plant species, we used Protocol B to construct libraries for P. edulis and A. semialata which represent two subfamilies of Gramineae (Bambusoideae and Panicoideae). We first inspected library quality. The ultimate library concentration was between 20–30 ng/ul, and fragments distribution was well within the expected range when screened by the agarose gel electrophoresis. Both libraries met the qualification for sequencing.
A comprehensive data analysis of P. edulis and A. semialata
Raw reads no.
Percentage of adapter reads (%)
Percentage of reads with correct restriction sites (%)
GC content (%)
Clean reads no.
Average tag depth
Next, we mapped clean reads of P. edulis onto the P. edulis genome scaffolds, CDS-DNA, and repeats region and clean reads of A. semialata onto the sorghum genome scaffolds, CDS-DNA, and repeats region respectively. For P. edulis, overall scaffolds mapping rate was 79.93–83.80 %, reads hits to the CDS-DNA accounted 2.38–2.87 % (Table 3). Yet reads localization on the repeats region accounted for 0.11–0.17 %. For A. semialata, the overall scaffolds alignment rate was 3.37–3.71 %, reads hits to the CDS-DNA accounted 1.08–1.11 %. Yet reads localization on the repeats region accounted for 0.00–0.22 %. Overall scaffolds alignment rate is relatively low for A. semialata is because of the sequence differences between A. semialata and Sorghum bicolor.
At last, clean data of both individuals were clustered into tags. The number of tags obtained from P. edulis was 128,803 with an average depth 30.18X which correlated well with the expectation. While A. semialata got 98,869 tags, with an average depth of 96.18X which was not within expectation as the sorghum genome was used as the reference. Since P. edulis has an average genome size of ~2000 Mb and A. semialata has an average genome size of ~600 Mb (personal communications with Ms. Yang Yang), the tags we got accounted for 1.80 % of the P. edulis nuclear genome and 4.61 % of the A. semialata nuclear genome.
To verify how genotyping accuracy is maintained in the MiddRAD protocol, we have recently constructed a linkage map of D. latiflorus (Guoqian Yang et al., unpublished data) according to MiddRAD protocol and got a high-quality map of 2365 markers with an average map distance 1.56 cM. The 36 linkage groups generated were corresponding to the 36 haploid chromosomes of D. latiflorus and 52 of the 55 selected genotypes (94.55 %) were agreed with independent Sanger sequencing results (Additional file 2: Table S3). We believe that genotypes from MiddRAD-seq derived data should be of high genotyping accuracy as the fundamental of constructing a high-quality linkage map with tight map distances are the correct genotypes of most markers/loci [22, 45].
Evaluation of shortened adapters and new barcodes
Phylogenetic tree construction of three bamboo species
In this study, we tested the universality of several commonly used enzyme pairs across the angiosperm plants, simplified the ddRAD protocol and reduced the overall costs. MiddRAD library construction protocol has optimized the following areas compared with original ddRAD protocol. (1) MiddRAD protocol tests the universality of several commonly used enzyme pairs and three pairs of restriction endonucleases are maybe universal in digesting plant genomic DNA. We recommend AvaII + MspI > EcoRI +MspI > PstI +MspI when designing plant ddRAD projects; (2) In MiddRAD protocol several expensive consumables and apparatuses are replaced by conventional experimental apparatuses, for example, the magnetic beads purification method is replaced by a simple column purification to get rid of the dependence on the magnet, DNA fragments are selected by cutting low melting point agarose gel rather than the automatically select device Pippin-Prep, and low melting point agarose gel electrophoresis is used to screen fragments distribution instead of an expensive Agilent 2100 Bioanalyzer; (3) MiddRAD removes 3 steps in ddRAD protocol, namely purifying the enzyme-digested products, quantifying the DNA concentration before ligation and purifying ligation products after pooling samples, all of which simplify the process of constructing a library; (4) In MiddRAD protocol original P1 adapters are shortened from 37 to 25 bp (barcode is assumed to be of 5 bp length), which will reduce the cost of the synthesizing adapter oligos partially; (5) In MiddRAD a new barcode-adapter system containing 20 pairs of barcodes varying in length were devised, which can be used with integer times (20 * n), rather than the original 48 kinds of barcode with equal length. This will not only reduce the cost of synthesizing DNA oligos but also increase the flexibility for projects with different samples and help improve the quality of bases near the restriction site. The comparison of MiddRAD with most commonly used RAD and GBS sequencing methodologies and associated costs are listed in Additional file 2: Table S4.
Since the thorough library construction process minimizes the purification times, random DNA loss is greatly reduced. The highly simplified process allows library preparation be accomplished with as low as 50 ng genomic DNA. Meanwhile, we found that reducing the two purification steps did not reduce the quality of the data by sequence quality analysis. Through data analysis of 40 individuals from a single lane, omitting the step quantifying DNA concentration of each individual before pooling samples does have influence on the amount of data among each individual (CV = 0.2146) but in our experience even if quantifying DNA concentration of each individual, pooling equal quantity DNA of each individual is still impossible as different volumes of liquid may adhere to the tips when using the pipette. As adequate data (>6 M reads) could be generated for each individual, we suggest deleting this quantifying step as the GBS protocol does (CV = 0.23) . The redesigned adapters and variable length barcodes have high recognition efficiency on various individuals and could produce high-quality data which is similar to the results Burford et al. got . Some people may worry that the combination of restriction endonucleases may easily cut the repeats region with high GC-content as is shown in maize about 10 % of reads fall into repeats region. Nevertheless, we still have enough reads left for analysis. Researchers may strictly follow the protocol without re-selecting novel combination of enzymes (but have to adjust the size-selection range) if they do not want to invest too much on pilot experiments. As synthetic adapters can be used in diverse plant species and transferred across labs, our protocol will greatly reduce the overall costs.
A possible drawback of our method is that degraded DNA will not produce adequate data because once one of the enzyme sites was impaired the whole tag will be lost. Nonetheless, as long as the DNA provided shows a clear major band when detected by the agarose gel electrophoresis, it could usually generate sufficient amount of data for analysis. In addition, the final library may be a pool of tens to a hundred of samples and we only designed 20 kinds of barcodes, so it is inevitable to cut gel several times when performing the procedure. In order to maintain the consistency of the selected fragments, electrophoresis conditions must be strictly controlled, and practice cutting the gel is needed before the formal experiment begins. Besides, electrophoresis time should be long enough (1–2 h) to prevent that size selection maybe ‘leaky’.
To demonstrate the applicability of MiddRADseq-derived markers in no-model species, we used MiddRAD data to resolve phylogenetic relationships of two woody bamboos genera, Dendrocalamus and Phyllostachys. Dendrocalamus is a tropical woody bamboo genus while Phyllostachys belongs to temperate woody bamboos. Our tree is congruent well with the current taxonomy. In comparison to previous studies in this clade, which used chloroplast regions  or nuclear DNA regions , the ddRAD data set is prominent for its simpleness in getting an amount of data (over 200 loci in the smallest data set). Though RAD-seq has been demonstrated to be feasible in clades as old as 40–60 million years with simulated RAD tags of Drosophila  and bona fide sequence of American oak , RAD sequences are usually considered useful for phylogenetic reconstruction in younger clades in which sufficient numbers of orthologous restriction sites are retained across species . However, RAD-seq is now receiving increased attention at deeper evolutionary time scales, such as genus- or family-level phylogenetics even the problem of efficiently obtaining sequence data across many individuals exists . Our study demonstrates the utility of MiddRAD data for reconstructing phylogenetic relationships in a group that spans 43–47 million-year-old divergences . What we should bear in mind is that the performance of RAD or MiddRAD depends in part on the level of divergence between species. Determining orthologous RAD tags between samples should also be taken carefully in the future phylogenetic analysis with RAD sequencing . The data set and analyses we provide here are a novel step forward in the use of ddRAD data to address questions in woody bamboo phylogenetic reconstruction. We show that it is possible to assemble genome-wide RAD-tags into phylogenetic matrices without the use of a reference genome.
In this study, we first tested the universality of several commonly used enzyme pairs across 23 plant species and found AvaII + MspI enzyme pair produced a consistently higher number of fragments in a broad range of angiosperm plant species. Then we simplified the ddRAD protocol and designed a new barcode-adapter system that could reduce the overall costs. At last, we demonstrated the use of MiddRAD-seq data in resolving phylogenetic relationships of two woody bamboos genera. This protocol could help botanist quickly get ideal experimental data at a relatively low cost and without being specially trained. We expect that the protocol could be implemented efficiently nearly in any ordinary molecular laboratory without relying on large sequencing centers or next-generation sequencing companies.
restriction-site associated DNA sequencing
double digest restriction associated DNA
modified double digest restriction associated DNA
IIB restriction endonucleases restriction-site associated DNA
genotyping by sequencing
forward read of pair-end fastq data
reverse read of pair-end fastq data
- Repeats region:
repetitive sequences in a genome
single nucleotide polymorphism
- Rare-cutter enzyme:
a restriction enzyme with a recognition sequence which occurs only rarely in a genome
- Common-cutter enzyme:
a restriction enzyme with a recognition sequence which occurs frequently in a genome
double-stranded product by annealing two oligos
short DNA sequence for identifying different individuals
ZHG and DZL organized the project. GQY performed the experiments, analyzed the data, and wrote the paper; YMC, CG, GQY and YG harvested leaves of samples and extracted DNA. ZHG, DZL, JPW, LL, LZ and XYW reviewed and edited the manuscript. All authors read and approved the final manuscript.
We are grateful to Ms. Yang Yang (Kunming Institute of Botany, Chinese Academy of Sciences) for kindly providing the Oryza sativa and Alloteropsis semialata material. We would like to thank Ms. Zhao-Li Ding (Kunming Institute of Zoology, Chinese Academy of Sciences) and Ms. Li Zhong (Yunnan University) for advice and supports on Illumina sequencing. We would like to thank Prof. Guo-Fan Zhang (Institute of Oceanology, Chinese Academy of Sciences) for helpful suggestions in this project. We would also like to thank Mr. Hui-Fu Zhuang, Dr. Peng-Fei Ma and Dr. Yu-Xiao Zhang at Kunming Institute of Botany for computational supports.
The authors declare that they have no competing interests.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the NCBI SRA repository, SRP077085 and hyperlink to datasets in http://www.ncbi.nlm.nih.gov/home/submit.shtml.
This project was supported by the National Natural Science Foundation of China (31470322 and 31430011).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA, Johnson EA. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 2008;3(10):e3376.View ArticlePubMedPubMed CentralGoogle Scholar
- Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet. 2011;12(7):499–510.View ArticlePubMedGoogle Scholar
- Etter PD, Bassham S, Hohenlohe PA, Johnson EA, Cresko WA. SNP discovery and genotyping for evolutionary genetics using RAD sequencing. In: Etter PD, Bassham S, Hohenlohe PA, Johnson EA, Cresko WA, editors. Molecular methods for evolutionary genetics. Berlin: Springer; 2011. p. 157–78.Google Scholar
- Barchi L, Lanteri S, Portis E, Acquadro A, Valè G, Toppino L, Rotino GL. Identification of SNP and SSR markers in eggplant using RAD tag sequencing. BMC Genomics. 2011;12(1):304.View ArticlePubMedPubMed CentralGoogle Scholar
- Deokar AA, Ramsay L, Sharpe AG, Diapari M, Sindhu A, Bett K, Warkentin TD, Tar’an B. Genome wide SNP identification in chickpea for use in development of a high density genetic map and improvement of chickpea reference genome assembly. BMC Genomics. 2014;15:708.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu K, Liu HY, Yang MM, Tao Y, Ma HH, Wu WX, Zuo Y, Zhao YZ. High-density genetic map construction and QTLs analysis of grain yield-related traits in Sesame (Sesamum indicum L.) based on RAD-Seq techonology. BMC Plant Biol. 2014;14:274.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu P, Xu SZ, Wu XH, Tao Y, Wang BG, Wang S, Qin DH, Lu ZF, Li GJ. Population genomic analyses from low-coverage RAD-Seq data: a case study on the non-model cucurbit bottle gourd. Plant J. 2014;77(3):430–42.View ArticlePubMedGoogle Scholar
- Wang XQ, Zhao L, Eaton DAR, Li DZ, Guo ZH. Identification of SNP markers for inferring phylogeny in temperate bamboos (Poaceae: Bambusoideae) using RAD sequencing. Mol Ecol Resour. 2013;13(5):938–45.View ArticlePubMedGoogle Scholar
- Cruaud A, Gautier M, Galan M, Foucaud J, Sauné L, Genson G, Dubois E, Nidelet S, Deuve T, Rasplus J-Y. Empirical assessment of RAD sequencing for interspecific phylogeny. Mol Biol Evol. 2014;31(5):1272–4.View ArticlePubMedGoogle Scholar
- DaCosta JM, Sorenson MD. ddRAD-seq phylogenetics based on nucleotide, indel, and presence-absence polymorphisms: analyses of two avian genera with contrasting histories. Mol Phylogenet Evol. 2016;94:122–35.View ArticlePubMedGoogle Scholar
- Qi ZC, Yu Y, Liu X, Pais A, Ranney T, Whetten R, Xiang QY. Phylogenomics of polyploid Fothergilla (Hamamelidaceae) by RAD-tag based GBS-insights into species origin and effects of software pipelines. J Syst Evol. 2015;53(5):432–47.View ArticleGoogle Scholar
- Zhang N, Zhang L, Tao Y, Guo L, Sun J, Li X, Zhao N, Peng J, Li X, Zeng L. Construction of a high density SNP linkage map of kelp (Saccharina japonica) by sequencing Taq I site associated DNA and mapping of a sex determining locus. BMC Genomics. 2015;16(1):189.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou L, Wang SB, Jian J, Geng QC, Wen J, Song Q, Wu Z, Li GJ, Liu YQ, Dunwell JM. Identification of domestication-related loci associated with flowering time and seed size in soybean with the RAD-seq genotyping method. Sci Rep. 2015;5:9350.View ArticlePubMedPubMed CentralGoogle Scholar
- Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat Rev Genet. 2016;17(2):81–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu H, Bayer M, Druka A, Russell JR, Hackett CA, Poland J, Ramsay L, Hedley PE, Waugh R. An evaluation of genotyping by sequencing (GBS) to map the Breviaristatum-e (ari-e) locus in cultivated barley. BMC Genomics. 2014;15:104.View ArticlePubMedPubMed CentralGoogle Scholar
- Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One. 2011;6(5):e19379.View ArticlePubMedPubMed CentralGoogle Scholar
- Poland JA, Brown PJ, Sorrells ME, Jannink JL. Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS One. 2012;7(2):e32253.View ArticlePubMedPubMed CentralGoogle Scholar
- Poland JA, Rife TW. Genotyping-by-sequencing for plant breeding and genetics. Plant Genome. 2012;5(3):92–102.View ArticleGoogle Scholar
- DaCosta JM, Sorenson MD. Amplification biases and consistent recovery of loci in a double-digest RAD-seq protocol. PLoS One. 2014;9(9):e106713.View ArticlePubMedPubMed CentralGoogle Scholar
- Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ. A large genome center’s improvements to the Illumina sequencing system. Nat Methods. 2008;5(12):1005–10.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang S, Meyer E, McKay JK, Matz MV. 2b-RAD: a simple and flexible method for genome-wide genotyping. Nat Methods. 2012;9(8):808–10.View ArticlePubMedGoogle Scholar
- Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One. 2012;7(5):e37135.View ArticlePubMedPubMed CentralGoogle Scholar
- Toonen RJ, Puritz JB, Forsman ZH, Whitney JL, Fernandez-Silva I, Andrews KR, Bird CE. ezRAD: a simplified method for genomic genotyping in non-model organisms. PeerJ. 2013;1:e203.View ArticlePubMedPubMed CentralGoogle Scholar
- Puritz JB, Matz MV, Toonen RJ, Weber JN, Bolnick DI, Bird CE. Demystifying the RAD fad. Mol Ecol. 2014;23(24):5937–42.View ArticlePubMedGoogle Scholar
- Glaubitz JC, Casstevens TM, Lu F, Harriman J, Elshire RJ, Sun Q, Buckler ES. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS One. 2014;9(2):e90346.View ArticlePubMedPubMed CentralGoogle Scholar
- Herrera S, Reyes-Herrera PH, Shank TM. Predicting RAD-seq Marker numbers across the eukaryotic tree of life. Genome Biol Evol. 2015;7(12):3207–25.View ArticlePubMedPubMed CentralGoogle Scholar
- Burford Reiskind MO, Coyle K, Daniels HV, Labadie P, Reiskind MH, Roberts NB, Roberts RB, Vargo EL, Schaff J. Development of a universal double-digest RAD sequencing approach for a group of non-model, ecologically and economically important insect and fish taxa. Mol Ecol Resour. 2016. doi:10.1111/1755-0998.12527.Google Scholar
- Doyle JJ. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem Bull. 1987;19:11–5.Google Scholar
- Wang J, Li L, Qi H, Du X, Zhang G. RestrictionDigest: a powerful Perl module for simulating genomic restriction digests. Electron J Biotechnol. 2016. doi:10.1016/j.ejbt.2016.02.003.Google Scholar
- Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA. Stacks: an analysis tool set for population genomics. Mol Ecol. 2013;22(11):3124–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH. Stacks: building and genotyping loci de novo from short-read sequences. G3 (Bethesda). 2011;1(3):171–82.View ArticleGoogle Scholar
- Andrews S. Fast QC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 20 Dec 2015.
- Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011;17(1):10–2. doi:10.14806/ej.17.1.200.View ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.View ArticlePubMedPubMed CentralGoogle Scholar
- Plantgdb. http://www.plantgdb.org/. Accessed 28 Sept 2015.
- BambooGDB. http://www.bamboogdb.org/page/download.jsp. Accessed 28 Sept 2015.
- Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Triplett JK, Clark LG, Fisher AE, Wen J. Independent allopolyploidization events preceded speciation in the temperate and tropical woody bamboos. New Phytol. 2014;204(1):66–73.View ArticlePubMedGoogle Scholar
- Sungkaew S, Stapleton CM, Salamin N, Hodkinson TR. Non-monophyly of the woody bamboos (Bambuseae; Poaceae): a multi-gene region phylogenetic analysis of Bambusoideae s.s. J Plant Res. 2009;122(1):95–108.View ArticlePubMedGoogle Scholar
- Plant DNA C-values database. http://www.data.kew.org/cvalues/. Accessed 25 Dec 2015.
- Chen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS One. 2013;8(4):e62856.View ArticlePubMedPubMed CentralGoogle Scholar
- Sun XW, Liu DY, Zhang XF, Li WB, Liu H, Hong WG, Jiang CB, Guan N, Ma CX, Zeng HP, et al. SLAF-seq: an efficient method of large-scale de novo SNP discovery and genotyping using high-throughput sequencing. PLoS One. 2013;8(3):e58700.View ArticlePubMedPubMed CentralGoogle Scholar
- International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature. 2005;436(7052):793–800.View ArticleGoogle Scholar
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326(5956):1112–5.View ArticlePubMedGoogle Scholar
- Cartwright DA, Troggio M, Velasco R, Gutin A. Genetic mapping in the presence of genotyping errors. Genetics. 2007;176(4):2521–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Rubin BER, Ree RH, Moreau CS. Inferring phylogenies from RAD sequence data. PLoS One. 2012;7(4):e33394.View ArticlePubMedPubMed CentralGoogle Scholar
- Hipp AL, Eaton DA, Cavender-Bares J, Fitzek E, Nipper R, Manos PS. A framework phylogeny of the American oak clade based on sequenced RAD data. PLoS One. 2014;9(4):e93975.View ArticlePubMedPubMed CentralGoogle Scholar
- Eaton DA. PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics. 2014. doi:10.1093/bioinformatics/btu121.PubMedGoogle Scholar
- Zhang XZ, Zeng CX, Ma PF, et al. Multi-locus plastid phylogenetic biogeography supports the Asian hypothesis of the temperate woody bamboos (Poaceae: Bambusoideae). Mol Phylogenet Evol. 2016;96:118-29.View ArticlePubMedGoogle Scholar
- Takahashi T, Nagata N, Sota T. Application of RAD-based phylogenetics to complex relationships among variously related taxa in a species flock. Mol Phylogenet Evol. 2014;80:137–44.View ArticlePubMedGoogle Scholar