A high-throughput quantification of resin and rubber contents in Parthenium argentatum using near-infrared (NIR) spectroscopy

Background Guayule (Parthenium argentatum A. Gray), a plant native to semi-arid regions of northern Mexico and southern Texas in the United States, is an alternative source for natural rubber (NR). Rapid screening tools are needed to replace the current labor-intensive and cost-inefficient method for quantifying rubber and resin contents. Near-infrared (NIR) spectroscopy is a promising technique that simplifies and speeds up the quantification procedure without losing precision. In this study, two spectral instruments were used to rapidly quantify resin and rubber contents in 315 ground samples harvested from a guayule germplasm collection grown under different irrigation conditions at Maricopa, AZ. The effects of eight different pretreatment approaches on improving prediction models using partial least squares regression (PLSR) were investigated and compared. Important characteristic wavelengths that contribute to prominent absorbance peaks were identified. Results Using two different NIR devices, ASD FieldSpec®3 performed better than Polychromix Phazir™ in improving R2 and residual predicative deviation (RPD) values of PLSR models. Compared to the models based on full-range spectra (750–2500 nm), using a subset of wavelengths (1100–2400 nm) with high sensitivity to guayule rubber and resin contents could lead to better prediction accuracy. The prediction power of the models for quantifying resin content was better than rubber content. Conclusions In summary, the calibrated PLSR models for resin and rubber contents were successfully developed for a diverse guayule germplasm collection and were applied to roughly screen samples in a low-cost and efficient way. This improved efficiency could enable breeders to rapidly screen large guayule populations to identify cultivars that are high in rubber and resin contents.

guayule can generate NR latex with much lower Type I Hev-b protein, which is important to reduce allergic reactions to medical products-a major problem in the application of Hevea rubber [2,4].
Resin and rubber are the two major industrial components in guayule, which are obtained using a sequential solvent extraction protocol in two steps: treating with polar solvent (acetone, ethanol) to extract resin followed by a non-polar solvent (hexane, cyclohexane, chloroform, etc.) to extract rubber [2,[5][6][7][8]. Accelerated solvent extraction (ASE) has been used in analytical chemistry in recent years to accurately determine chemical components [9]. The application of ASE to quantify resin and rubber content in guayule has been previously published [2,10]. Compared to other solvent-based methods such as Soxhlet and a high-speed homogenizer (Polytron) [5,6], ASE shortened extraction time by using high temperature and nitrogen pressure while requiring low solvent volumes [2,8,11]. However, despite these improvements, using traditional wet chemistry methods to determine chemical contents are time-consuming, labor-intensive, and expensive [12]. The methods to quantify the chemical compositions after extraction procedures are not easily scaled-up to hundreds or thousands of samples, which is the level required for germplasm evaluation in plant breeding programs. Thus, inexpensive, high-throughput, and rapid quantification methods are needed to determine biopolymer components for guayule genetic improvement.
Near-infrared (NIR) spectroscopy, based on vibration properties of organic molecule chemical bonds and their interactions with NIR radiation, is a technique used for rapid, reliable, and non-destructive prediction of chemical components in plants, animal products, food, and pharmaceuticals [12][13][14][15][16]. In the last several decades, NIR spectroscopy has been applied to determine resin and rubber content in guayule [2,4,6,17,18]; however, these studies were either too early to use advanced multivariate data analysis approaches or the sample size of varieties/ accessions was small with a very limited range of rubber and resin contents. Moreover, no previous studies made comparisons between different NIR instruments with varying spectral ranges and resolution. Given these limitations, the objectives of this study were to (1) develop PLSR models using NIR spectroscopy to estimate rubber and resin content for a guayule germplasm containing 56 different accessions; (2) identify optimal pretreatment approaches and ranges of wavelengths for obtaining the most robust and reliable PLSR models; and (3) compare two NIR spectral instruments in prediction accuracy of PLSR models for the estimation of rubber and resin content.

Plant Materials
A total of 49 and 56 guayule accessions (49 were included in 56 accessions) from a USDA germplasm collection were planted under water-stressed and non-stressed field conditions, respectively for 2.5 years with each accession replicated three times [19]. Finally, a total of 315 guayule samples were harvested from water-stressed (147) and non-stressed (168) field plots at Maricopa, Arizona, USA. Trials were irrigated differentially to reach suitable stress levels following the soil water depletion model described by Hunsaker and Elshikha [20]. Two homogenous plants from each plot were harvested in spring of 2018. Harvested plants were dried in an open area then chipped using Troy-Bilt Model 47321 Chipper/Shredder (Garden Way, Inc., Troy, New York) with a 9.53-mm round-holed screen. After drying, the chipped samples were ground using a hammer mill with a 6-mm screen (Model W6H, Schutte-Buffalo Hammermill, LLC, Buffalo, NY). The samples were then fine-ground using a Model 4 Wiley mill to pass the material through 2-mm sieves (Thomas Scientific, Swdesboro, NJ). The dried and ground samples were stored in small sealed plastic bags at 4 °C to limit risk of oxidation.

Accelerated solvent extraction (ASE) for rubber and resin quantification
Fine dried ground samples weighing 1 ± 0.0005 g were loaded into stainless steel cells (11 mL) of an ASE (Model 200, Dionex Corp., now ThermoFisher Scientific Inc., Waltham, MA), which was equipped with an auto-sampler carousel, a solvent controller that accommodated up to four different solvents, and a collection tray that allowed up to 24 samples to be sequentially extracted [2,10]. The entire machine was connected to a nitrogen tank. All ASE extraction cells were prepared uniformly. A cellulose microfilter (20-mm diameter) was first placed at the bottom of each cell, which was then filled with dry ground samples mixed with diatomaceous earth (DE). Glass collecting vials (250 mL) were placed into the collection tray. The first cell, as a control, was only filled with DE. Extraction was performed under the following conditions (Table 1): Each sample was first extracted with acetone at 100 °C and 1500 psi of nitrogen, with a heating time of 5 min, static extraction time of 10 min, purge time of 60 s, and flush volume 100%, followed by cyclohexane extraction at 140 °C under the pressure of 1500 psi of nitrogen, heating time of 7 min, static extraction for 20 min, purge time of 60 s, and a flush volume of 100%. Three static cycles were applied to each extraction stage. Following this, the extractant was transferred into a pre-weighed glass vial (250 mL). Evaporation of the solvent from the extract was done in a fume hood at room temperature for 2 weeks and dried in an oven at 55 °C for 24 h before weighing again. Three samples were randomly selected from each ASE batch (11 samples) for moisture content estimation, which was determined by drying a 5-g sample at 105 °C in an oven for 24 h, and then kept 8 h in a desiccator before weighing. The moisture content values of each batch were averaged and used to adjust rubber and resin contents for further use with the following adjustment formula:

NIR spectroscopy analysis
Five near-infrared (NIR) spectral scans were collected for each dry ground sample using an ASD FieldSpec ® 3 spectrophotometer (Malvern Panalytical, Cambridge, UK) and a handheld Polychromix Phazir ™ model Phazir 1624 spectrophotometer (Polychromix Inc., Wilmington, MA, USA) under ambient temperature. The dry ground samples were stirred and remixed during the scanning intervals. For the ASD scans, the "Muglight" attachment was used with the spectrophotometer, which provided a light source and specialized tray for holding samples during spectral data collection. For the Polychromix device, samples were placed in a plastic laboratory boat with the instrument resting on top of the sample. Standard reference targets were scanned after scanning every fifth sample and tenth sample for ASD FieldSpec ® 3 and Polychromix Phazir ™ , respectively. The reference target for the ASD was a small 99% Spectralon disk designed to fit in the sampling tray of the Muglight attachment. For (1) Adjusted resin = % dry resin content Adjusted rubber = % dry rubber content the Polychromix Phazir ™ , the reference target covered the bottom of a weighing pan to avoid light leaking as provided by the manufacturer and used according to the manufacturer's recommendation. Spectral data of 2151 wavelengths were obtained from the ASD FieldSpec ® 3 with the reflectance ranging from 350 to 2500 nm at 1 nm interval, while spectral data of only 100 wavelengths were obtained from the Polychromix Phair ™ with reflectance ranging from 1600 to 2400 nm at 8 nm intervals. As for the reflectance spectra obtained from the ASD FieldSpec ® 3, only the wavelengths between 750 and 2500 nm were used for further analysis since this range covers the NIR region.

Chemometrics and data analysis Spectral data pretreatment
The Unscrambler X ® software (v.10.5, Camo Software AS) was used to perform data pretreatment and establish partial least squares regression (PLSR) models for rubber and resin contents. As a first step to identify and remove outliers, the spectral data was subjected to principal component analysis (PCA). PCA provided a score plot to show the degree of similarity and difference among the samples [21]. From PCA, Hotelling's T 2 and Q-residuals explained how far a projection of the sample is away from the origin, and whether the pattern of variables for a sample deviates largely from the model [22]. The samples with both high Hotelling's T 2 values and Q-residuals (if any) were detected as outliers and removed before further analysis. Spectral pretreatments were intended to suppress various adverse effects coming from physical properties of the sample, technical errors during measurements or, simply, instrument noise [21]. In our experiment, eight different types of pretreatments were applied to the spectral data to test and compare their effects on the performance of PLSR models, particularly through improvements in the signal-to-noise ratio and in the prediction accuracy. These eight different pretreatments included the following: smoothing using a median filter with segment size of 3, normalization by the mean, baseline correction, standard normal variate (SNV), de-trending (DT) with polynomial order of two, and Savitzky-Golay (SG) first and second derivative calculation. The SG 1st and 2nd derivatives with the window size of 11 (smoothing points = 23) were applied for the spectra obtained from ASD FieldSpec ® 3, and the SG 1st and 2nd derivatives with a window size of 6 (smoothing points = 13) were applied for Polychromix Phazir ™ . The functions of these pretreatments are described as follows. The median filter was a nonlinear low-pass filter that removed high-frequency noise and preserved edges in the sample spectrum [23,24]. Normalization normalized residuals by transforming data to reach a linear relationship between samples. Baseline correction removed baseline offsets from the spectral data [25]. SNV reduced scattering interferences or (physical) variabilities between samples (i.e. centers at a zero mean intensity and unified standard deviation) [21,25]. In this way, SNV corrected intensities and baseline deviations due to light scattering possibly generated by impurities or density fluctuations in the samples [25]. De-trending (DT) was a polynomial baseline correction method for suppressing the baseline shifts and curvilinearity in spectra [26]. The SG derivatives removed baseline shifts and separated broad and overlapping NIR bands without significantly increasing spectral noise [21,27].

Multivariate data analysis
Validation was used for the assessment of the PLSR results. Cross-validation (CV), or internal validation, divides a dataset into several subsets (or segments) with each one containing a certain amount of samples [28].
In one epoch, the first subset of data was used for training the model, and the remaining subsets were used for model testing. For every epoch, the training and testing data subsets were different. External validation (EV), however, divides a dataset into two different complementary subsets: one for training and another one for testing [28]. In our study, CV and EV were carried out for waterstressed (DRY), non-stressed (IRR) and combined (ALL) datasets ( Fig. 1). Each of the three datasets was divided into two subsets, calibration (CAL) and validation (VAL), comprising 80% and 20% of the original samples, respectively. The VAL subsets were constructed by selecting every fifth scan of each sample and were used as a test set to evaluate the robustness of the developed model. The VAL subsets were only used in EV as testing subsets while CAL subsets were used as training sets for both CV and EV. The stability and robustness of the models were improved by removing non-significant variables through the Martens' uncertainty test during CV [29]. For the ALL dataset, 1260 and 315 data points were assigned to the CAL and VAL subsets, respectively. For the DRY dataset, CAL and VAL subsets contained 588 and 147 data points, respectively, while the IRR dataset contained 672 and 168 data points under CAL and VAL subsets, respectively. In the CV, the CAL subset was used for model training and testing, where 20, 17 and 18 segments with each segment containing 63, 34 and 37 samples were used for ALL, DRY and IRR dataset, respectively. In the EV, the CAL to VAL subsets with a ratio of 4:1 were used for model training and testing. For all the above divisions, PCA was conducted to check the effects of different irrigation conditions and the homogeneity of sub-datasets.
The Unscrambler X ® software (v.10.5, Camo Software AS) was then used for the establishment of all the following partial least square regression (PLSR) models. The performance of the PLSR models was determined by the following statistical parameters: where y i represents the measured values and f i represents the predicted values. A R 2 closer to 1 means a better fit of the measured values (y i ) to the regression line, and root mean square error (RMSE) determines the precision of the calibration model [30].Additionally, the residual predicative deviation (RPD) was calculated as: A higher RPD value demonstrates a greater prediction power of the model [30]. In agricultural applications, especially for the materials that are more complicated in physical nature, RPD greater than 2.0 can be applied to rough screening and RPD greater than 3.0 can be interpreted as good in control quality of NIR models [31].
Finally, the optimal pretreatment approach was selected based on the above statistical parameters and used to compare the spectral data with varying wavelength ranges between two different NIR machines. An interpretation of the regression coefficients of the developed models was undertaken to determine the important chemical components contributing to rubber and resin Standard deviation of measured extracts RMSE contents. Based on this, calibration models were further upgraded using only the partial and characteristic wavelength regions from previous PLSR models.

Rubber and resin contents
Phenotypic variations were observed for adjusted rubber and resin content in guayule accessions grown under different irrigation conditions (Table 2). In general, guayule accessions grown under stressed conditions had higher resin and rubber content compared to non-stressed conditions.  [20,32].

Principal component analysis
The total of 1575 spectra obtained from ASD FieldSpec ® 3 were divided into two groups based on different scenarios: one was based on irrigation conditions (DRY and IRR) and another was based on calibration set (CAL) and validation set (VAL). The PCA results were shown in Fig. 2 to analyze the spectral variability between different sample groupings. The first, second and third PC accounted for 74.2%, 21.1% and 3.0% variations of raw spectral data, respectively. In total, the first three PC represented 98.3% variation of the raw spectral data. All samples in the DRY dataset distributed evenly in the IRR dataset (Fig. 2a). Likewise, all the samples in the CAL dataset distributed evenly in the VAL dataset (Fig. 2b). Thus, the division of the samples was homogenous and can be used for the following spectral analysis.

PLSR models based on whole wavelengths
The prediction models established for resin and rubber quantification using eight different pretreatments under two different validation methods were compared in Tables 3 and 4. In general, the pretreatments improved the power and precision for rubber and resin predictions compared to no pretreatment. Even though both CV and EV concede a considerable confidence level in suppressing overfitting problems for PLRS models [33], external validation (EV) in the current study generated better calibration models than cross validation (CV) when the same  pretreatment was used. This can be indicated by higher R 2 R 2 p > R 2 cv , smaller RMSE (RMSE p < RMSE cv ), smaller standard error (SEP < SECV), and higher RPD (RPD p > RPD cv ) (Tables 3 and 4). Similar results were also observed by previous studies in the estimation of biochemical methane potential (BMP) [30] and stem water potential (ψ stem ) for the variety-specific model [33]. When under the same validation context (e.g. EV), the pretreatment combination of standard normal variate (SNV), detrending (DT) and Savitzky-Golay 2nd derivative resulted in the most precise and robust PLSR model as compared to other pretreatments, indicating the efficiency of SNV in removing multiplicative interferences of scattering and particle size, of DT in suppressing baseline shifts and curvilinearity in diffuse reflectance spectra, and of Savitzky-Golay 2nd derivative in improving deconvolution of some overlapping spectral peaks to unveil hidden information under these peaks [30,34]. Reliable models have also been constructed when the combinations of all or part of the three pre-processing approaches (i.e. SNV, DT, 2nd derivative) were used in previous studies [30,[35][36][37]. However, exceptions occurred with CV for resin content (Table 3), where the SNV + DT and smoothing + baseline + normalization resulted in the highest R 2 cv values for ALL and IRR datasets, respectively. This indicated that variations may occur when the same pretreatment approach was used for different datasets or under different validation processes.

Acetone and cyclohexane extracts (or resin and rubber)
The PLSR models for resin and rubber content were constructed by using NIR spectra obtained from rapid measurements in dry ground guayule stems. After the combination of SNV, DT and Savitzky-Golay 2nd derivative preprocessing, the R 2 and RPD values for predicting adjusted resin were slightly higher than that for adjusted rubber (Tables 3 and 4), indicating that the models established for resin were more robust and precise than for rubber. From Tables 3 and 4, the R 2 cv values for resin content were 0.729, 0.822 and 0.688 in ALL, DRY and IRR dataset, respectively, while the R 2 p values were 0.764, 0.829 and 0.765 with RPD p values of 2.055, 2.415 and 2.065 for the three datasets, respectively. Likewise, for rubber content, the R 2 cv values after pretreatment of SNV, DT and Savitzky-Golay 2nd derivate were 0.733, 0.756, and 0.728 in ALL, DRY and IRR dataset, respectively, while the R 2 p values were 0.756, 0.78 and 0.755 with RPD p values of 2.024, 2.128 and 2.020 for the three datasets, respectively. The greater the RPD value is, the more reliable the model will be [31], indicating that models established for resin were more robust and reliable than for rubber, and the models established separately for the samples grown under different conditions (i.e. DRY and IRR) could better reflect and differentiate the predicting power for the traits of interests. To illustrate, under both CV and EV, the R 2 and RPD values from the DRY dataset were higher than putting all the dry and irrigated samples together while R 2 and RPD values from the IRR dataset were lower than the ALL dataset, meaning that putting all the samples from different growing conditions together might mitigate or weaken the predictive power and accuracy of models. Undeniably, our NIR models seem not as powerful as the ones (R 2 > 0.95) established by previous researchers [2,4,6,18,38]; however, the previous studies on rubber-producing plants were all based on a limited number of accessions and large numbers of NIR scans, and this technical strategy might lead to overestimation of the stability and accuracy in the prediction of PLSR models. In contrast, our models were based on 56 different accessions representing a USDA guayule germplasm collection and included wild and improved genetic materials that were planted under different growth conditions [19]. Thus, these models could be more representative for general use in predicting guayule resin and rubber.

Comparisons between two different NIR instruments
A comparison between two commonly used NIR instruments (ASD FieldSpec ® 3 and Polychromix Phazir ™ ) was made after the determination of the optimal pretreatment method, which was the SNV + DT + Savitzky-Golay 2nd derivate under EV context (Table 5). Not surprisingly, the ASD models with both whole wavelengths (750-2500 nm) and partial wavelengths (1100-2400 nm, 1600-2400 nm) generated significantly better predictive power than Polychromix models (1600-2400 nm), which can be seen from higher R 2 p and RPD p values in Table 5. This study is the first one that compares two commonly used NIR instruments for resin and rubber quantification. The better models established from ASD FieldSpec ® 3 than Polychromix Phazir ™ data were probably due to different signal/noise ratio, different ways that the samples were presented during measurements, the different stability of equipments or different spectral resolutions between two instruments Bangalore et al. [44].
In general, reflectance at different wavelengths depended on and were closely associated with the structures of chemical components. The original reflectance plot was provided (Fig. 3a). The second derivative of reflectance for guayule resin highlighted prominent peaks centered at 1184,1385,1668,1690,1886,1914,2248,2278,2297, and 2324 nm, while the sharp peaks and valleys for guayule rubber occurred at 1205,1389,1410,1686,1716,1736,1781,1883,1914, and 2260 nm (Fig. 3b, c). The similarity of prominent wavelengths for resin and rubber (Fig. 3d) was confirmed by the significant Pearson's correlation coefficient between resin and rubber content in guayule (p = 0.038). However, this result contradicted the previous Table 5 Comparisons between models constructed using the NIR spectra from ASD FieldSpec ®

and Polychromix Phazir ™ with various wavelengths
All the models were constructed using the optimal combination of standard normal variate (SNV)   . 3 Important characteristic wavelengths contributed to partial least squares regression (PLSR) models for predicting resin and rubber content in guayule samples grown under dry condition. a The original absorbance spectra plot for all the scanned samples using ASD FieldSpec ® 3 spectrophotometer; b Second derivative of NIR reflectance spectra from ASD FieldSpec ® 3 spectrophotometer for resin content; c Second derivative of NIR reflectance spectra from ASD FieldSpec ® 3 spectrophotometer for rubber content; d Resin and rubber calibration beta coefficients from ASD FieldSpec ® 3 spectrophotometer as a function of wavelengths results where the absence of high correlation between resin and rubber was observed despite common loadings [2]. The NIR spectrum obtained from samples grown under different irrigation conditions (i.e., DRY and IRR), even though with varying regression coefficients across wavelengths (Additional file 1: Table S1), generated similar peaks and valleys at the above spectral regions, which showed significant contributions to the calibration models for resin and rubber content in guayule (Fig. 3b, c). The wavelength regions between 1140 and 1250 nm and 1360-1450 nm were correlated to the second overtone of C-H stretching [39], where the peaks at 1184, 1385 nm for resin and 1205, 1389, and 1410 nm for rubber were located. The wavelengths from 1640 to 1800 have been described as the first overtone of C-H stretching combination bands [40], where the peaks at 1668, 1690 nm for methyl groups in resin and 1686, 1716, 1736, 1781 nm for polyisoprenes in rubber were located. Similarly, Suchat et al. [2] and Black et al. [6] also found principle absorption bands within these ranges. Meanwhile, the bands from 2200 to 2440 nm with peaks centered at 2248, 2278, 2297 and 2324 nm for resin as well as 2260 nm for rubber could be due to the C-H stretching/C-H deformation combination [40] caused by surrounding molecules of rubber particles in guayule. In accordance, previous studies [2,6] also identified prominent vibrations within the ranges of C-H stretch and deformation combination of CH 2 from lipids, and C-H/C=O stretch combination of aldehyde structure [39]. This is not surprising because lipids help form one of the major membrane components surrounding rubber particles [41] and aldehydes serve as functional groups bonded to natural rubber [42]. In addition, the prominent peaks occurring at 1886 and 1914 nm for resin as well as 1410, 1883, and 1914 nm for rubber were likely to be located within the O-H stretch first overtone within the ranges of 1400-1460 nm and 1900-1960 nm, which were associated with the absorption by water molecules.

PLSR models based on selected characteristic wavelengths
The best correlative PLSR models were developed within the range of 1100-2400 nm for resin and rubber under ALL, DRY and IRR datasets except for rubber in IRR dataset, where the range of 1600-2400 nm generated the best PLSR model ( Table 5, Fig. 4). With this selected range, the R 2 p for resin and rubber for the DRY dataset were improved to 0.846 and 0.793 with RPD p of 2.542 and 2.195, respectively. Likewise, the R 2 p for resin and rubber for the IRR dataset were improved to 0.768 and 0.757 with RPD p of 2.079 and 2.030, respectively. In general, the PLSR models for guayule resin and rubber for the DRY dataset were again better than the IRR dataset, and the models for guayule resin were again more powerful than rubber. However, the PLSR models based on the characteristic wavelengths (i.e. 1140-1250 nm, 1360-1450 nm, 2200-2440 nm) were slightly less powerful than those based on the partial range (1100-2400 nm) ( Table 5). Selecting a few characteristic wavelengths doesn't always help improve model prediction precision. Similar  were also observed by Kopicky [17], in which the best model was constructed within the range of 1100-1800 nm instead of the characteristic wavelengths for rubber content. However, more optimization techniques such as genetic algorithms (GA), stepwise elimination (SE), simulated annealing (SA), and generalized simulated annealing (GSA) can be implemented with PLS or internal PLS (iPLS) to improve the accuracy of predictions for selected characteristic wavelengths [43][44][45]. The principle behind iPLS is to split the spectra into smaller equidistant subintervals and develop PLS models on each subinterval [43,46]. Future research is needed to further optimize the selected characteristic wavelengths.

Conclusion
We have successfully constructed reliable high-throughput PLSR models for the determination of resin and rubber in dry, ground, guayule biomass using NIR spectroscopy. The prediction power of the models for resin content were better than rubber content and the increased spectral resolution of data from the ASD FieldSpec ® 3 improved the prediction accuracy as compared to data from the Polychromix Phazir ™ . Samples collected from different growing conditions are suggested to be separated for independent model establishment. In general, the established models might be used in the future to form a simple, low-cost and efficient pipeline to maximize the phenotyping efficiency in determining guayule rubber content. The established models could enable guayule breeders to efficiently screen large populations for individuals with superior traits of interests.