Analysing complex Triticeae genomes – concepts and strategies
© Spannagl et al.; licensee BioMed Central Ltd. 2013
Received: 17 May 2013
Accepted: 10 July 2013
Published: 6 September 2013
Skip to main content
© Spannagl et al.; licensee BioMed Central Ltd. 2013
Received: 17 May 2013
Accepted: 10 July 2013
Published: 6 September 2013
The genomic sequences of many important Triticeae crop species are hard to assemble and analyse due to their large genome sizes, (in part) polyploid genomes and high repeat content. Recently, the draft genomes of barley and bread wheat were reported thanks to cost-efficient and fast NGS technologies. The genome of barley is estimated to be 5 Gb in size whereas the genome of bread wheat accounts for 17 Gb and harbours an allo-hexaploid genome. Direct assembly of the sequence reads and access to the gene content is hampered by the repeat content. As a consequence, novel strategies and data analysis concepts had to be developed to provide much-needed whole genome sequence surveys and access to the gene repertoires. Here we describe some analytical strategies that now enable structuring of massive NGS data generated and pave the way towards structured and ordered sequence data and gene order. Specifically we report on the GenomeZipper, a synteny driven approach to order and structure NGS survey sequences of grass genomes that lack a physical map. In addition, to access and analyse the gene repertoire of allo-hexaploid bread wheat from the raw sequence reads, a reference-guided approach was developed utilizing representative genes from rice, Brachypodium distachyon, sorghum and barley. Stringent sub-assembly on the reference genes prevented collapsing of homeologous wheat genes and allowed to estimate gene retention rate and determine gene family sizes. Genomic sequences from the wheat sub-genome progenitors enabled to discriminate a large number of sub-assemblies between the wheat A, B or D sub-genome using machine learning algorithms. Many of the concepts outlined here can readily be applied to other complex plant and non-plant genomes.
The Triticeae tribe comprises some of the most economically important crops including bread wheat, barley and rye. Bread wheat ranked third in world crop production with 681 million tons in 2011 , making it an indispensable source for our everyday diet. Domestication history of Triticeae dates back several thousand years. They consequently have a complex genetic history .
The genomes of many Triticeae species including wheat and barley appear to be extremely challenging to assemble and analyse due to their genome size, high repeat content, complex transposable element structure and, in part, polyploid genome [3, 4].
With an estimated genome size of ~5 Gb the barley genome is significantly larger than the human genome, however exceeded by the bread wheat genome with ~17 Gb. Bread wheat contains an allo-hexaploid genome with three sub-genomes, namely the A, B and D sub-genome. It has been speculated that the bread wheat genome originated from hybridization between cultivated tetraploid emmer wheat (AABB) and diploid goat grass (DD) about 8000 years ago .
Complementing the genome size, many Triticeae genomes show a very high degree of repetitive elements (~80% in bread wheat ). These repetitive stretches can span several 100kbs and have a complex architecture and composition. Consequently assembly of long scaffolds or even whole chromosome sequences from NGS survey sequences using current technology is still an open problem [Review ]. Although synteny is pronounced in grasses and in general the gene order appears to be well conserved , repeat and TE activities as well as structural rearrangements contribute to the formation of pseudogenes, gene fragments and changes in local gene order .
With the availability of economic and rapid NGS technologies whole-genome sequence surveys of many grass genomes including a number of Triticeae species were generated recently [9, 10]. While the direct assembly of reads into pseudo-chromosomes or scaffolds is hampered by the genomes’ repetitiveness and size, the gene inventory, gene order and chromosomal positioning of genetic elements such as genes and markers is of high interest not only for breeders but also helps to shed light on the evolutionary history of the respective plants and the Triticeae in general.
Consequently, numerous novel strategies and concepts were developed over the last few years to order [11, 12], analyse  and compare complex Triticeae genomes even in the light of the challenges and limitations described. Here, we highlight a few of these concepts that were applied to analyse the recently published genome sequences of barley  and bread wheat  and describe the methodology used in more detail.
Many of the concepts and strategies described and discussed here are not restricted to the Triticeae but can be applied to other complex grass and plant genomes that have not been sequenced and analysed so far due to their genome size and/or polyploid nature.
In a first step, orthologous gene clusters were computed for the reference genomes of three grass genomes (Brachypodium distachyon, Sorghum bicolor and Oryza sativa) originating from different grass sub-families plus publicly- available barley full-length cDNAs. The OrthoMCL software version 1.4  was used to calculate pairwise sequence similarities between all input protein sequences using BLASTP . Markov clustering of the resulting similarity matrix defines the orthologous cluster structure.
A total of 86,944 coding sequences from these four grasses were clustered into 20,496 gene families. 9,843 clusters contained sequences from all four genomes.
In a second step, one representative gene model was selected from each of the 20,496 orthologous gene clusters, using the following strategy: 1. BLASTX  of all contigs from the bread wheat LCG assembly against all grass genes used in the OrthoMCL analysis; 2. Select the gene in each cluster that concentrates the most distinct wheat contigs; 3. If genes pool the same number of wheat contigs, select the one with the longest protein sequence as the representative gene model.
Genes with associated wheat contigs were identified for all but 445 gene clusters resulting in a total of 20,051 representative gene models.
The 454 sequence reads are clustered to the orthologous gene representatives using conserved sequence similarity. Thereby, pre-processing of the raw 454 sequencing data is a helpful approach to reduce computational complexity. Especially repetitive sequences, which represent up to 80% of the grass genomes [18, 19], considerably extend search space, and thus memory as well as time requirements. Moreover, increased gene copy numbers caused by repetitive mechanisms and transposable element activity complicate and hampers downstream gene family analysis. For a fivefold whole genome shotgun dataset from hexaploid wheat using this strategy, 77% of raw sequence data was removed and 24 Gb out of 83 Gb of sequence kept for the orthologous group assemblies.
Afterwards reads were aligned to the orthologous gene representatives using BLASTX  and reported alignments filtered for minimum alignment length (minimum of 30 amino acids (AA)) and minimum alignment identity. Thereby, different alignment identity thresholds were applied accounting for the different evolutionary distances between bread wheat and the reference species used for the orthologous group analysis. Overall, 4 million (6%) of the wheat repeat-filtered sequencing reads passed the applied alignment criteria, and approximately two-third (68%) of these matched a single OGR. For the remaining 454 reads, that map multiple representatives, only the first-best match (FBH), were considered. Almost all (19,483) of the selected orthologous representatives were detected by at least one 454 sequence read indicating the good representation of the majority of wheat genes in the whole genome shotgun sequence dataset.
Sequence information and quality scores were extracted for the 4 million aligned 454 reads from the original sequencing library files. For each orthologous group, individual assemblies were computed using the Newbler de novo assembly software, which generates larger contigs based on overlaps between reads . Thereby, the assembly parameter specifying the minimum alignment identity (mi) to accept overlaps between reads has to be selected with caution. On the one hand, a too relaxed mi parameter may cause a collapse of homeologous sequence copies, but, on the other hand, too stringent mi parameter would overestimate gene copies due to sequencing errors. Two methods applied for selection of the best mi parameter are discussed in the following section. The final set of “sub-assembly” sequences was created by combining singletons (454 reads that do not overlap with any other reads) and the assembled contigs (> = 100 bp).
As already shown in Figure 2 and Figure 3, the correct selection of the Newbler parameter –mi, which specifies the minimum identity in order to combine two reads during the assembly, is a major factor affecting both the assembly outcome and the downstream gene family analysis based on predicted gene copy numbers. To estimate the correct Newbler parameterisation two simulation experiments were applied: (i) simulation of an whole-genome sequencing experiment for a diploid reference genome of similar genome size and repeat and gene composition and (ii) an in silico generation of an hexaploid gene set with similar sequence differences as in the homeologous genes of the target species.
Established reference assemblies or complete genome sequences of a reference species with similar genome size and structure can be used for the in silico simulation reflecting the experimental setting as achieved for the target genome sequence. Thereby, genomes with different degree of polyploidy could be used for this evaluation by adjusting the expectation of the reference gene family size. For the bread wheat genome analysis, the repeat masked genome sequence of maize (ZMb73, version 5b.60; http://www.maizegdb.org) was used and first, the “real” gene family size (gene count) identified by OrthoMCL clustering of the 39,656 annotated maize proteins and the reference protein data sets. However, because highly similar gene sequences, which could not be discriminated with the applied identity cut-offs during the sequence assembly, would artificially lead to significant underestimation of predicted gene counts, CD-Hit  nucleotide similarity clustering was performed at different similarity levels (97%, 99% and 100%, respectively) and all but one of the redundant sequences was retained. We selected more than 15,000 OrthoMCL cluster including one or more maize genes as well as exactly one OGR. Thus, counting the number of grouped maize proteins provided reference values for subsequent analysis and comparisons.
Next, the input data set for the orthologous assembly was designed and a collection of raw 454 sequencing reads created from the maize reference genome assembly applying with 5× genome coverage along with a uniform error rate of 0.5% . Then, the above described protocol for the orthologous assembly was carried out with varying minimum overlap identity between 97% mi and 100% mi, respectively, and the gene copy number was predicted for the orthologous assembly.
Complementary to the above described evaluation method, a second evaluation was implemented, which in addition to the gene count simulates the polyploid effect size in the whole genome sequencing experiment. For this purpose, the transcript sequences of the 28,236 rice gene models (RAP2 annotation), including coding sequences as well as non-coding 3′ and 5′ UTRs and introns, were aligned against the orthologous gene representatives. Thereby the gene copy number in the rice data set was determined for each OG representative. Then, the aligned rice transcripts were triplicated and random single nucleotide variants (SNVs) were introduced to simulate the sequence similarity of the homeologous genes in the polyploid target genome. For example, for bread wheat nucleotides were randomly exchanged with p = 0.01 (1 nucleotide change per 100 bp) in coding transcript sequences and p = 0.04 (4 nucleotide exchanges per 100 bp) in non-coding transcript sequences, respectively. Then, 454-like shotgun reads were simulated (5× genome coverage), re-mapped against their corresponding OG representatives, sub-assembled with varying minimum overlap identity (97% mi, 99% mi and 100% mi) and, finally, the gene copy number predicted.
As expected, depending on the required stringency for read overlaps, the copy number prediction largely differs. In both evaluation methods 454 reads of different gene copy were collapsed by using 97% mi, whereas, requiring perfect alignment overlaps (100% mi) clearly results in an overestimation of the gene family sizes. As expected, we observed nearly perfect agreement of the expected 1:1 relationship between the expected and observed gene family size in the maize simulation, as well as the 1:3 relationship in the polyploidy simulation by using 99% alignment identity. This parameterisation allows compensating for sequencing errors, by simultaneously maintaining distinct gene copies that share high sequence similarity in coding regions. However, highly similar gene copies from large gene families may still collapse into single assemblies resulting in slight overall gene number underestimates.
Besides sheer size and high repeat content hexaploidy makes the genome of bread wheat extremely challenging to analyse. Being able to differentiate between homologous genes of the three wheat sub-genomes (A, B and D) is of high importance not only for marker design and breeding but also to address open questions in the evolution and domestication of bread wheat.
Recent NGS sequencing approaches of the bread wheat genome generated a 5-fold 454 survey sequence, without being able to separate A, B and D sub-genome directly . In principle, this can be facilitated applying a chromosome sorting technology , however, sorted sequences were only available for wheat linkage group 1 at this time .
As a consequence, an alternative approach was established to classify wheat sub-assembly sequences for the A, B and D sub-genomes. Wheat sub-assemblies were generated by a stringent assembly of reads mapped to representative (for orthologous groups defined by OrthoMCL ) genes from the reference organisms Brachypodium distachyon, Hordeum vulgare, Oryza sativaand Sorghum bicolor as well as the genome sequences of the D genome donor species Ae. tauschii, and the A genome relative Triticum monococcum (NCBI archive SRP004490.3), and cDNA sequence assemblies from Ae. speltoides (Trick&Bancroft, unpublished data) a member of the Sitopsis section to which the putative B genome donor belongs. Expecting that A- related sub-assemblies are more related to T. monococcum sequences, D- related sub-assemblies to Ae. tauschii, and B-related sub-assemblies to Ae. speltoides, sequence similarities of the sub-assemblies to each of these datasets would define and discriminate their origin.
In a first step, sequence similarities of each sub-assembly sequence to the wheat progenitor sequences were computed using BLAST . Only sub-assemblies with BLAST hits to all three wheat progenitor sequences were considered for classification. Although a classification into A, B or D sub-genome derived transcripts seemed possible for many sub-assemblies by applying simple similarity cut-offs or rules, fixed similarity cut-offs appeared not suitable to separate the majority of sub-assemblies with confidence .
Consequently, several machine learning approaches were applied to the similarity matrix and evaluated for their performance. A major prerequisite represents the identification of a suitable training and test data set. We made use of wheat group 1 chromosome sequences which were separated into their sub-genomes (A, B and D) using flow-sorted chromosomes . We extracted all wheat sub-assemblies associated with wheat group 1 chromosome sequences and classified them into A, B or D depending on their best hit. Every sub-genome classification was then complemented with the sub-assemblies’ similarities to T. monococcum, Ae. speltoides and Ae. tauschii sequences to create a training set compatible to all non-chr1 related sequences.
Support Vector Machine classification results on wheat genic sub-assemblies
Wheat sub-assembly sequences with intact open reading frame
Used for classification
Classified as A
Classified as D
Classified as B
With the recently published draft genome sequences of Aegilops tauschii and Triticum urartu two additional WGS datasets from wheat progenitors have since become available. As T.urartu is thought to represent the wheat A sub-genome progenitor even better than T. monococcum, using these datasets in an analysis update could potentially further improve the separation of sub-assemblies generated from the A sub-genome and therefore re-fine the overall result.
An alternative approach was taken in the separation of homeologs in the tetraploid wheat (AABB; “pasta wheat”) transcriptome, published by Krasileva et al. . Here, a post-assembly pipeline including polymorphism identification, phasing of SNPs, read sorting and re-assembly of phased reads was used to separate homeologs.
Barley is diploid and diverged from wheat approximately 12 million years ago (Mya) . Barley is the first cereal genome where an anchored physical map has been reported on the basis of the entire genome. The physical map was constructed by combining different clone libraries using FPC  to assemble clones into contigs.
The genome size of 5 Gb made barley the largest genome for which a physical map has been constructed. 570,000 BAC clones were assembled into 9,265 fingerprinted contigs (contigs). Several clone libraries were combined to prevent that genomic regions would remain unrepresented. End-merging of contigs allowed to bridge overlapping contigs together, possible when marker evidences on both ends would indicate a merge. A wealth of different marker resources mostly linked to transcripts  but also by using genotyping by sequencing (GBS) technology helped to assign a majority of contigs to a genetic position. To increase the robustness of the resulting map, experimental markers were included. Apart from the construction of the physical map three different barley cultivars were sequenced under high coverage resulting in whole genome shotgun (WGS) assembly with approximately 2 gigabases (Gb) each. These WGS contigs, apart from forming the basis to derive gene models also helped to extend sequences on the physical map by extending clone end sequences. The WGS contig decorated physical map was then taken to infer a genetic position for physical contigs. Altogether 4 Gb out of the 5 Gb cumulative contig map length was anchored.
The anchoring and ordering of FPC and sequence contigs along the barley genome demonstrates that GBS technology combined with a transcript derived map is powerful to develop a rich and deep physical map even for the complex and large barley genome. Along with the annotation of genes it describes the functional and physical assembly of a cereal genome.
The GenomeZipper (GZ) [11, 12] is a synteny driven approach to order and structure NGS survey sequences of grass genomes that lack a physical map. The approach can be applied to a variety of different data sets, i.e. 454 reads, contigs, or scaffolds grouped in individual chromosomes or chromosome arms. The approach exploits the widely conserved synteny among grasses  and uses corresponding syntenic intervals, as defined by heterologous, corresponding marker intervals among the species under investigation and suitable reference genomes to deduce a tentative ordering of genes in the corresponding regions. This approach can be undertaken for smaller regions of interest, whole chromosome arms and chromosomes on a genome wide level. The outcome is a virtual gene order map which integrates gene-based marker maps as well as conserved syntenic information from at least one sequenced model grass genome and NGS data. Initially applied on the barley genome [11, 12], the genome zipper approach has now also been used with other grass species, such as wheat chromosome 4A , rye chromosome B  and Lolium .
In the first step, filtering repeat regions out of the query sequence reduces computational effort in gene space estimation. In barley and wheat the repetitive amount was identified by aligning the 454 reads against the MIPS REdat Poaceae repeat library using Vmatch (http://vmatch.de).
In the next step, the conserved homologs between query sequence and one or multiple reference genomes are determined. For barley, three model grass genomes - Brachypodium distachyon, rice  and Sorghum bicolor - were used to identify homologous regions. The sequence comparisons were done using BLASTX and only first best hits with at least 75%/70% sequence identity and a minimal alignment length of 30 amino acids were considered. The syntenic conserved regions are defined by the density of homologous matches between query and reference genome using a sliding window approach.
During the last step, the NGS data is structured and ordered using a high resolution genetic map and orthologous genes obtained in the previous step. Thereby the intervals as defined by genetically ordered markers are used as a scaffold to project the likely order of corresponding genes in these intervals. Gene order in these intervals is deduced from the order found in the respective reference genomes, whereby evolutionary closest reference genomes get highest rank. Once ordered, additional evidences, such as full length cDNAs and/or ESTs can be attached to the ordered gene scaffold.
The linear ordered gene maps provide a valuable resource for a variety of applications: (i) for marker development and to assist positional cloning , (ii) for comparative analyses of the conserved gene space , and (iii) to resolve the structure of a genome/chromosome and to establish the colinearity between grass genomes[34, 35].
Next generation sequencing technologies now start to enable to decipher large plant genomes such as those from many Triticeae (wheat, barley, rye) which until recently were difficult to access due to severe technological and economic restrictions. The assembly and analysis of these complex genomes remains a challenge and requires novel concepts and strategies. Here, we outlined and described a number of these concepts, that were developed and used to analyse and order genes from the recently published genome sequences of barley and hexaploid wheat. For barley, a physical and genetic map integration approach allows to positionally anchor ~21,000 genes. A complimentary approach, the GenomeZipper concept, makes use of the conserved gene order between grass reference genomes and many monocot crop genomes to anchor and order genes by an in silico approach in the complex wheat and barley genomes.
The hexaploid nature with three highly homologous sub-genomes makes the genome of bread wheat extremely challenging to assemble and analyse. To access the gene inventory, a set of orthologous representative genes was constructed from related and finished reference grass genomes. Wheat NGS reads from a 454 5× whole genome sequence survey were mapped onto these orthologous representatives and separately assembled in a stringent way to avoid collapsing of homologous (sub-genome derived) genes. Genic sub-assembly sequences were subsequently classified into A, B or D- sub-genome derived with a machine-learning assisted approach making use of differing sequence similarities to the A, B and D sub-genomes progenitor species.
The approaches and concepts outlined here may be readily applied to other complex genomes, even beyond plants, where direct sequence assembly and analysis is hampered by size and/or polyploidy but related, less complex reference genomes are available.
The authors want to thank all collaboration partners involved in the individual studies reviewed here. In particular, all members from the International Barley Sequencing Consortium (IBSC), the UK wheat consortium (Neil Hall, Anthony Hall and Mike Bevan et al.) as well as IPK Gatersleben (Nils Stein et al.).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.