AprGPD: the apricot genomic and phenotypic database

Background Apricot is cultivated worldwide because of its high nutritive content and strong adaptability. Its flesh is delicious and has a unique and pleasant aroma. Apricot kernel is also consumed as nuts. The genome of apricot has been sequenced, and the transcriptome, resequencing, and phenotype data have been increasely generated. However, with the emergence of new information, the data are expected to integrate, and disseminate. Results To better manage the continuous addition of new data and increase convenience, we constructed the apricot genomic and phenotypic database (AprGPD, http://apricotgpd.com). At present, AprGPD contains three reference genomes, 1692 germplasms, 306 genome resequencing data, 90 RNA sequencing data. A set of user-friendly query, analysis, and visualization tools have been implemented in AprGPD. We have also performed a detailed analysis of 59 transcription factor families for the three genomes of apricot. Conclusion Six modules are displayed in AprGPD, including species, germplasm, genome, variation, product, tools. The data integrated by AprGPD will be helpful for the molecular breeding of apricot. Supplementary Information The online version contains supplementary material available at 10.1186/s13007-021-00797-4.

The modern sequencing technology has boosted the genome and transcriptome data of apricot and the phenotypic data of rich germplasm resources in apricot have been accumulated. Therefore, a comprehensive database is needed to construct for integrating these data, which is convenience for researchers and breeders of apricot and other species in Rosaceae. Here, we have summarized three high-quality and chromosome scale assemblies of genomes in AprGPD, including P. sibirica (F106, 219 Mb with a contig N50 length of 6.70 Mb, eight chromosomes), P. armeniaca (Sungold, 217 Mb with a contig N50 length of 7.13 Mb, eight chromosomes), and kernel consumption apricot (Longwangmao, 225 Mb with a contig N50 length of 6.91 Mb, eight chromosomes), the reference genome was version 0.9 for all species. AprGPD includes 90 RNA sequencing (RNA-seq) data, 306 genome resquencing data, and 1692 germplasms (58 phenotypic types). All data included in the AprGPD were collected by the authors. Based on these data, we have conducted gene annotation, transcriptome analysis, detailed analysis of transcription factor (TF) family, annotation of variation sites, construction of the metabolic pathways, classification of quantitative characteristics, and prepared figures for the visual representation of the phenotypic data. AprGPD also offers various query, analysis, and visualization tools. A total of six modules are displayed in AprGPD, including species, germplasm, genome, variation, product, and tools. This database is a useful resource for the apricot research.

Construction and content
The data of AprGPD are stored in MySQL database on a Ubuntu server. User-friendly interfaces are developed using JavaScript, HTML5, and CSS3. Query searches are achieved suing JavaScript and PHP. AprGPD is divided into six modules base on different data and applications. The data and functions of each module are described as follow:

Species module
The species module exhibits the distribution and phenotypic traits, including flower, leaf, fruit, seed, and tree type of the nine species from Armeniaca section in China (Additional file 1: Table S1) and provides a visual representation of each species (Fig. 1).

Germplasm module
The germplasm module contains 58 phenotypic data types of 1692 germplasms, and clicking on the corresponding name, then, displaying pictures of their flowers, fruits, seeds, and distribution information (Fig. 2b). The intersting traits can also be quickly observed (Fig. 2a). The pie chart provides statistical information on qualitative traits, and the frequency chart provides the overall summary of quantitative traits (Fig. 2c). We divided the quantitative traits into three levels by using μ± 0.5246δ for data showing normal distribution (μ and δ indicated mean and standard deviation, respectively) and y = G ± (1/2 + n)x (n = 0, 1, 2, 3, 4; G is the median of the data, and x is the grade difference) for data showing non-normal distribution as previous descriptions [7,8].

Genome module
The genome module is divided into five submodules, including gene annotation, metabolic pathway, gene family, gene network, and transcription factors. For each submodule, the data and basic search functions have been described.

Gene annotation
A total of 98,615 protein-coding genes were predicted from these three genome assemblies, which included 32,959 genes from P. sibirica (F106), 32,669 genes from P. armeniaca (Sungold), and 32,987 genes from P. armeniaca × P. sibirica (Longwangmao). The information of gene structure, gene annotations from Gene Ontology (GO) and KEGG annotation (Fig. 3a), gene expression profiles (Fig. 3b), and variations are displayed interactively ( Fig. 3c) in this submodule. Other submodules, including metabolic pathways, transcription factors, gene network, and gene family, are integrated with gene annotation.

Metabolic pathways
KEGG pathway maps are the graphical representations of the reaction networks, and each map is summarized by experimental evidence from the literature [9]. We have obtained the KEGG orthologs for the Sungold, F106, and Longwangmao genomes and generated their metabolic pathways. Moreover, we have constructed five additional pathways (Additional file 1: Figure S1) related to the flowering [10]. When the cursor of the mouse is moved to the green nodes, all the gene IDs corresponding to the enzymes are displayed, and the expression of the genes for fruit development, kernel development, and different tissues can be observed conveniently. Each pathway is displayed on a separate web page, and detailed information of all the genes in this pathway is displayed in a tabular form at the bottom of the page (Fig. 4).

Gene network
We have collected 90 RNA-seq samples for F106 (30), Sungold (22), and Longwangmao (38), including fruit, kernel, leaves, flower, and flower bud at different stages. The expression pattern of each gene is calculated and normalized to fragments per kilobase of transcript per million mapped fragments (FPKM). Based on the FPKM values, the co-expression network is constructed   (Fig. 6).

Transcription factors
Transcription factors are important regulators of plant growth, development, and external stress. A total of 59 transcription factor (TF) families have been analyzed in detail at the whole-genome level of three apricot genomes. TFs are determined by HMMER [12] (version 3.0) and iTak [13]. The presence of the TF domain is further confirmed using SMART [14] and CDD [15]. Multiple sequences alignment are performed using ClustalX [16] (version 2.1), and phylogenetic trees are constructed using MEGA (version X) [17]. MEME Suite [18] is used to determine motifs in transcription factor protein sequences, where the width of the motif is 6-200, and the maximum number of motifs is 20. Syntenic blocks are inferred using MCScanX [19]. Gene structure, chromosome location, and collinearity are visualized using TBtools (version 1.089) [20]. Phylogenetic trees are aesthetically improved using iTol [21]. Additionally, the links are established with PlantTFDB [22] to provide users with valuable information.

Variation module
The single-nucleotide polymorphisms (SNPs) and insertions-deletions (Indels) of 306 accessions have been collected from our previous study, and minor allele frequencies over 0.05 of variants have been filtered out using plink [23] (version 1.9); all filtered nucleotide variants are annotated using ANNOVAR [24]. Annotation information of 8,838,420 (SNPs) and 1,650,013 (indels) is obtained. The information on the variation of each sample, including variation in positions and allele types, is sorted, and a comparative search tool is established (Fig. 8a), which can be searched by gene ID or locations; the information on variations are displayed in tabular form (Fig. 8b). Variations can also be browsed interactively through JBrowse [25] by clicking on the  (Fig. 8c). Statistical information on variation is provided through pie charts and tables, which can be accessed by clicking the pie chart icon (Fig. 7c). In addition, The information about structural variations (SVs, ≥50 bp) was obtained through a comparison among the three genomes. In total, 2306 insertions (843, 721, and 742 for F106 vs. Sungold, F106 vs. Longwangmao, and Sungold vs. Longwangmao, respectively) and 1296 deletions (427, 409, and 460 for F106 vs. Sungold, F106 vs. Longwangmao, and Sungold vs. Longwangmao) were obtained, which contain 2888 genes.

Product module
Apricot has high nutritional and commercial value [2]. Apricot kernel is rich in protein, crude fat, calcium, phosphorus, and iron, and has good nutritional and health effects. The apricot fruit has a pleasant aroma and is rich in nutrients such as sugar, vitamin C, and carotenes. We have developed some products of apricot, including foods and cosmetics. The edible products of apricot include apricot wine, apricot kernel oil [26], apricot kernel tofu [27], apricot kernel noodles, and apricot kernel biscuits. The use of apricot in cosmetics includes facial masks, essential oils, shampoos, and shower gels. The detailed information of all products, such as product image, patent number, etc., are displayed in this module by clicking on the product name (Fig. 9).

Tools module
The BLAST program, developed by ViroBLAST [28], provides more BLAST options to obtain specific information from the gene, CDS, and protein databases of all reference genomes. Information on genomes (Sungold, Longwangmao, and F106) and the variations of 306 accessions (reference F106) were visualized by JBrowse [25]. Bud dormancy plays an important role in fruit yield. The chilling requirement (CR) is the temperature required for deciduous fruit trees to release their natural dormancy and is an important quantitative trait for measuring dormant release. We developed a tool for calculating chilling requirements (CHR) for the convenience of the researchers; the tool contains 0-7.2 °C, Utah, and dynamic models. The CHR tool is developed by using PyQt5 (https:// github. com/ Chill ing-requi remen ts/ Chill ing-softw are/ relea ses), which also had more functions, such as, the custom model and the model of growing degree hours for heart requirements. Expression visualization [29] could help users quickly observe the expression patterns of the related genes. The results of the synteny analysis with three pairs of reference genomes (Sungold vs. F106, Sungold vs. Longwangmao, and F106 vs. Longwangmao) are visualized by SynVisio (https:// github. com/ kiran bandi/ synvi sio).

Utility and discussion
To provide more convenience for users, the query tools in each module are constructed. Users could be obtain information, including phenotype, gene annotation, transcription factors, metabolic pathways, co-expression network, variation, by query tools. The help module provides the detailed of user tutorial information. To make the AprGPD even better to understand, several demonstrations have been provided below.

Species and phenotype query
If users are interested in species information, they can click on the species number. For example, a user can click on No. 4 to retrieve information and images of P.
holosericeae. The germplasm module addresses any query on phenotype data. Boxplots were used to show three levels of quantitative traits. The Pie-charts and frequency charts were used to illustrate the distribution of quantitative and qualitative traits, respectively. For example, users could click on the name of Gongfuoxing to show its image and location information, and also could obtain the overall information on kernel dry weight using frequency charts. All quantitative traits are divided into three grades, if users want to get samples with larger kernel dry weight, they could select the maximum level (≥ 0.39 g) to filter the samples.

Gene query
Abundant query functions could help users quickly find the information of related genes. For example, the information on PaF106G0500018564.01 needs to be obtained. First, the user inserts the gene ID in the annotation module and clicks on the search button. The expression level of the gene is showen by histogram, and the highest expression at the end of fruit development, and gradually decreased and then increased during kernel Fig. 9 The products involved in apricot. a The product image. b Details of corresponding apricot-derived product development were displayed. Second, the user inserts the gene ID in Pathway module and finds that it is not annotated to the metabolic pathway. Third, the user searches for this gene in the transcription factor module and finds that belongs to the subfamily of the WOX family. Fourth, the user searches for this gene in the Gene network module and sets the weight at 0.35, and two coexpression genes (WRKY, PaF106G0500019874.01 and PaF106G0600024291.01.T01) are showed. Fifth, the user searches and observes the information of variation in this gene using the comparative search tool of the variation module and finds that PaF106G0500018564.01 has 20 SNPs and 9 INDELs.

Transcription factor family query
The detailed analysis of transcription factors are of great use to researchers. An example of MIKC_MADS is illustrated. First, the user could be find the number of members in the MIKC_MADS family page. Second, comparative analysis of all MIKC_MADS family genes in three species were shown by using a phylogenetic tree (Additional file 1: Figure S2) and table, a total of 14 subfamilies are divided. Third, the distributions of conserved domains of MIKC_MADS family genes were obtained (Additional file 1: Figures S3-5). For example, SOC1 contains motif17, SVP contains motif5 and motif6. Fourth, AP3, SVP, AGL6, SOC1, AGL9, and AP1/ FUL members have fragment duplication by chromosomal collinearity (Additional file 1: Figures S6-8).
The heatmap of related genes were displayed by expression visualization in tool (Additional file 1: Figures S9-11

Value and future directions
At present, AprGPD contains the phenotypic, genome, transcriptome data, and variation information, and a series of analysis base on these data, which provides more valuable information to aptivot research. We completed a detailed analysis of 58 transcription factor families from three genomes in AprGPD, which allows comparisons in various apricots and in plant species of Rosaceae. We developed a calculation tool of chilling requirements, which provids convenience for researchers. In the future, we will be further the new genome, transcriptome, and phenotype data of apricot, establish, and improve the multi-omics analysis platform.

Conclusions
In this study, the database of apricot genome and phenotype was established, which provides a database of comprehensive phenotype, genome, and transcriptome resources of apricots. AprGPD is composed of six modules, including genome, predicted genes and proteins, functional annotations and gene expression profiles. In addition, this database also providess various query, analysis, and visualization tools. Our AprGPD will become a active platform for researchers and breeders of apricot.