Targeted identification of genomic regions using TAGdb
© Marshall et al; licensee BioMed Central Ltd. 2010
Received: 11 May 2010
Accepted: 20 August 2010
Published: 20 August 2010
The introduction of second generation sequencing technology has enabled the cost effective sequencing of genomes and the identification of large numbers of genes and gene promoters. However, the assembly of DNA sequences to create a representation of the complete genome sequence remains costly, especially for the larger and more complex plant genomes.
We have developed an online database, TAGdb, that enables researchers to identify paired read sequences that share identity with a submitted query sequence. These tags can be used to design oligonucleotide primers for the PCR amplification of the region in the target genome.
The ability to produce large numbers of paired read genome tags using second generation sequencing provides a cost effective method for the identification of genes and promoters in large, complex or orphan species without the need for whole genome assembly.
The availability of a reference genome sequence is a goal of many researchers in crop genomics. The first plant genomes to be sequenced were Arabidopsis and rice [1, 2], applying standard Sanger sequencing of tiled genomic fragments maintained in bacterial artificial chromosome (BAC) vectors. Plant genome projects are rapidly changing pace with the application of new technology and the genomes of several plant species have now been determined  with researchers quickly adopting second generation sequencing to gain insight into their favourite genome. Many crop genomes are large, complex and often polyploid, making genome sequencing a major challenge [4, 5]. Without a complete genome sequence, researchers are often limited to the available sets of expressed sequence tags (ESTs) or genome survey sequences.
The advent of second generation sequencing enables the production of large quantities of genome sequence data at relatively low cost. Second generation sequence data takes the form of vast numbers of relatively short sequence reads, often produced as pairs with a known orientation and approximate distance between the pair. While the assembly of this data to produce a representation of the genome requires highly redundant sequencing and a large number of overlapping sequence reads, only relatively low coverage is required for the identification of genes and gene promoters. However, there is a challenge to store, interrogate and visualise the quantity of sequence tag data required for such analysis . We have developed TAGdb, a web based tool for the identification of Illumina GAII paired read sequences that match a query sequence. When combined with PCR amplification and sequencing, it is possible to determine the sequence of specific local genomic regions. This tool is applicable for gene and promoter discovery in a wide range of species and greatly facilitates comparative genomics and molecular marker discovery in orphan crops or those with large and complex genomes.
Construction and content
TAGdb is a web-based query tool for aligning query sequences to an existing database of paired short read data. The system has been developed using Perl and MySQL and runs on a public web server (http://flora.acpfg.com.au/tagdb/). The interface allows researchers to upload or input a FASTA formatted nucleotide sequence up to 5000 base pairs long for comparison with one or more paired read sequence libraries. The input sequence is aligned with short reads of significant identity using MEGABLAST , and the results visualised using a custom web interface. Each submitted job has a unique identifier, and an email is sent to the user once the job has completed. The processing time per search varies depending on the length of the input sequence and number of matching reads, but generally, searches are completed and results are returned within 20 seconds to 5 minutes. The database currently hosts data for Brassica, wheat, barley, Pongamia pinnata and Nicotiana alata, and details of these data sets are available on the help pages. Additional Illumina paired short read sequence data may be hosted on request.
TAGdb output for the AtWD40 genomic region
The number of individual and paired reads from different species matching the AtWD40 genomic region.
Sequencing the B. rapa WD40 (BrWD40) genomic region
A large number of tags matching the AtWD40 region were identified from B. rapa with fewer tags identified for B. oleracea and B. nigra, reflecting the relative abundance of reads for these species in the dataset. No tags were identified for Pongamia, wheat or barley. Given the size of the wheat genome and the relatively low sequence coverage of this genome in our dataset, it is not possible to conclude whether an ortholog of AtWD40 is absent in wheat. The higher sequence coverage for wheat chromosome arm 7DS, barley and Pongamia, and the number of orthologous tags identified when these datasets are searched with rice and soybean genes respectively (data not shown), suggests that there is no orthologous gene with sequence identity to AtWD40 in Pongamia, barley, or on wheat chromosome arm 7DS.
Assembly of the B. rapa tags with the AtWD40 genomic sequence demonstrated that the majority of reads aligned with the coding region (Figure 1). This reflects the greater conservation of sequence due to evolutionary constraints within coding regions. PCR amplification using primers specific to tags 5', 3' and within the coding region of BrWD40, amplified single products (Figure 3). Sequencing these products demonstrated that they corresponded to the B. rapa region orthologous to AtWD40 (Figure 4).
We have developed an online system for the identification and visualisation of second generation paired sequence tags matching to query sequences. While relatively simple in its concept, the system provides a powerful means to interrogate the vast quantity of data produced by the latest sequencing technologies in a user-friendly and intuitive manner, enabling the identification and cloning of novel genes and the surrounding genomic regions.
We have demonstrated the application of TAGdb for gene and promoter discovery in genomes where complete genome sequences are unavailable. We highlight the ability to amplify and sequence less conserved genomic regions, such as promoter sequences, using paired sequence tags where only one tag may align significantly to a query sequence. This tool can be applied for any species where paired read sequence data is available. While the current datasets are limited to a few species, the generation of short paired read sequence data is becoming increasingly common and this approach is likely to become a standard method for the discovery of genes, promoters and genetic variation in a wide range of species. While the current tool is specifically designed for Illumina paired reads, similar data produced by other sequencing platforms may also be hosted.
Availability and requirements
TAGdb is freely available at http://flora.acpfg.com.au/tagdb/.
Identification of Brassica WD40 orthologs
PCR primer sequences and tags used for the amplification and sequencing of the BrWD40 genomic region.
Primer sequence (5'-3')
Primer position relative to AtWD40 start codon (bp)
B. rapa product size (bp)
The authors would like to acknowledge funding support from the Grains Research and Development Corporation (Project DAN00117) and the Australian Research Council (Projects LP0989200, LP0882095 and LP0883462). Support from the Australian Genome Research Facility (AGRF), the Queensland Cyber Infrastructure Foundation (QCIF), the Australian Partnership for Advanced Computing (APAC) and Queensland Facility for Advanced Bioinformatics (QFAB) is gratefully acknowledged.
- Arabidopsis Genome I: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, London, 408: 796-815. 10.1038/35048692.Google Scholar
- Goff SA, Ricke D, Lan TH, Presting G, Wang RL, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H: A draft sequence of the rice genome (Oryza sativa L. ssp japonica). Science. 2002, 296: 92-100. 10.1126/science.1068275.View ArticlePubMedGoogle Scholar
- Imelfort M, Batley J, Grimmond S, Edwards D: Genome sequencing approaches and successes. Plant Genomics. Edited by: Somers D, Langridge P. 2009, Gustafson J: Humana Press (USA), 345-358. full_text.View ArticleGoogle Scholar
- Edwards D, Batley J: Plant genome sequencing: applications for crop improvement. Plant Biotechnology Journal. 2009, 7: 1-8. 10.1111/j.1467-7652.2008.00392.x.View ArticlePubMedGoogle Scholar
- Imelfort M, Edwards D: Next generation sequencing of plant genomes. Briefings in Bioinformatics. 2009, 10: 609-618. 10.1093/bib/bbp039.View ArticlePubMedGoogle Scholar
- Batley J, Edwards D: Genome sequence data: management, storage, and visualization. Biotechniques. 2009, 46: 333-336. 10.2144/000113134.View ArticlePubMedGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. Journal of Computational Biology. 2000, 7: 203-214. 10.1089/10665270050081478.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Geneious v4.6. [http://www.geneious.com/]
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.