Application of machine learning algorithms and feature selection in rapeseed (Brassica napus L.) breeding for seed yield

Shahsavari, Masoud; Mohammadi, Valiollah; Alizadeh, Bahram; Alizadeh, Houshang

doi:10.1186/s13007-023-01035-9

Research
Open access
Published: 16 June 2023

Application of machine learning algorithms and feature selection in rapeseed (Brassica napus L.) breeding for seed yield

Masoud Shahsavari¹,
Valiollah Mohammadi¹,
Bahram Alizadeh² &
…
Houshang Alizadeh¹

Plant Methods volume 19, Article number: 57 (2023) Cite this article

2035 Accesses
5 Citations
2 Altmetric
Metrics details

Abstract

Background

Studying the relationships between rapeseed seed yield (SY) and its yield-related traits can assist rapeseed breeders in the efficient indirect selection of high-yielding varieties. However, since the conventional and linear methods cannot interpret the complicated relations between SY and other traits, employing advanced machine learning algorithms is inevitable. Our main goal was to find the best combination of machine learning algorithms and feature selection methods to maximize the efficiency of indirect selection for rapeseed SY.

Results

To achieve that, twenty-five regression-based machine learning algorithms and six feature selection methods were employed. SY and yield-related data from twenty rapeseed genotypes were collected from field experiments over a period of 2 years (2019–2021). Root mean square error (RMSE), mean absolute error (MAE), and determination coefficient (R²) were used to evaluate the performance of the algorithms. The best performance with all fifteen measured traits as inputs was achieved by the Nu-support vector regression algorithm with quadratic polynomial kernel function (R² = 0.860, RMSE = 0.266, MAE = 0.210). The multilayer perceptron neural network algorithm with identity activation function (MLPNN-Identity) using three traits obtained from stepwise and backward selection methods appeared to be the most efficient combination of algorithms and feature selection methods (R² = 0.843, RMSE = 0.283, MAE = 0.224). Feature selection suggested that the set of pods per plant and days to physiological maturity along with plant height or first pod height from the ground are the most influential traits in predicting rapeseed SY.

Conclusion

The results of this study showed that MLPNN-Identity along with stepwise and backward selection methods can provide a robust combination to accurately predict the SY using fewer traits and therefore help optimize and accelerate SY breeding programs of rapeseed.

Background

Rapeseed (Brassica napus L.) is the second global oilseed production source after soybean, producing 13% of worldwide oil [1, 2]. The extensively cultivated double-low rapeseed, also known as canola, contains a very low amount of saturated fatty acids, palmitic C16:0 and stearic C18:0 (about 7% in total), and rich amount of unsaturated fatty acids, oleic C18:1 (about 62%), linoleic C18:2 (20%), linolenic C18:3 (10%) and eicosenoic C20:1 (1%) making it a healthy and nutritiously rich edible oil for humans [3, 4]. Owing to the energy crisis, rapeseed is also increasingly considered as a promising green energy source with minimal air pollution, and renewability [5,6,7]. Due to the growing demand for rapeseed oil in the food and industrial sectors, attempts to increase its yield have become inevitable [8,9,10,11].

Increasing seed yield (SY) has always been one of the major aims of breeding programs [12]. However, measuring SY in large breeding populations with thousands of genotypes is labor-intensive and time-consuming [13, 14]. Controlled by various genes and greatly affected by the environment, seed yield breeding is a highly complicated and nonlinear process [15, 16]. As a result, breeding strategies based on secondary traits (e.g., yield-related traits) that are highly linked to a primary trait enable plant breeders to efficiently identify promising lines at early stages of growth [17].

Thus far, conventional statistical methods, for instance, correlation coefficient analysis, principle component analysis (PCA), path analysis, and multiple linear regression (MLR), have been widely used in rapeseed to elucidate relationships between SY and other traits [18,19,20,21]. Nonetheless, they presume a linear relationship between the variables and are neither adequate nor comprehensive in displaying the interactions of traits and SY and would be incapable of analyzing highly nonlinear and complicated relationships between SY and other traits [22].

Machine learning algorithms have been effectively applied to optimization and prediction of many complicated biological systems [23]. The use of nonlinear machine learning algorithms in yield component analysis and indirect selection researches allows for a better understanding of nonlinear relations between yield and yield-related traits, and consequently, more precise yield prediction, which can efficiently improve breeding programs [24].

Lately, the multilayer perceptron neural networks (MLPNNs), one of the most well-known artificial neural networks (ANNs), has been widely employed for prediction and modeling complicated characteristics, such as yield, in several breeding programs and also other areas of plant sciences [17, 25]. This algorithm consists of various highly interconnected functioning neurons that can be simultaneously employed to solve a particular problem. MLPNN algorithms can also realize the intrinsic knowledge in datasets and determine the interaction between output and input variables without prior physical considerations [25, 26].

Support vector machine (SVM) is another advanced and popular machine learning algorithm with the ability to find both linear and nonlinear relationships in data [12, 27]. The benefits of employing SVMs are a large number of hidden units and better learning problem formulation, which leads to the quadratic optimization task [28]. Support Vector Regression (SVR) is the regression version of SVM and has recently been used to solve problems in agricultural and plant sciences fields [17, 25, 29,30,31]

Although some studies have used ANNs to predict the yield of rapeseed, they have been based on meteorological data (air temperature and precipitation) and information about mineral fertilization [4, 32, 33]. No study regarding the application of machine learning algorithms using agronomical yield-related traits has been conducted to predict the SY of rapeseed and also introducing indirect selection criteria. Furthermore, apart from MLR, ANN and SVR algorithms there are other methods such as ridge regression (RR), stochastic gradient descent (SGD) and Bayesian regression, which have not been widely used to predict SY and have remained relatively unknown to scientists in plant breeding. Therefore, in the present study, we aimed to (a) develop and optimize regression-based machine learning algorithms to predict the SY of rapeseed, (b) introduce the most important indirect selection criteria for SY of rapeseed through feature selection methods, and (c) maximize the efficiency of indirect selection for SY of rapeseed by means of finding the best combination of regression-based machine learning algorithms and feature selection methods. According to the best of our knowledge, this study is the first comprehensive report on applying a diverse range of machine learning algorithms in the field of plant breeding.

Materials and methods

Plant material and field experiments

Field experiments were conducted in the research farm of Seed and Plant Improvement Institute (SPII), Karaj, Iran, in the 2019–2020 and 2020–2021 growing seasons. Twenty genotypes were cultivated in the first year, and nineteen genotypes were cultivated in the second year (due to insufficient seed availability for one of the genotypes). The experiment carried out in a randomized complete block design (RCBD) with three replicates. The genotypes comprise 7 lines obtained from a pedigree experiment, a restorer line (R2000), 7 hybrids obtained from crosses between the 7 lines and R2000 and 5 cultivars (Nilufar, Neptune, Nima, Okapi and Nafis). Each plot consisted of four rows with 4 m length and with 30- and 5 cm between and within lines, respectively. Also, the distance between two plots was 60 cm. At the end of each growing season, seed yield (Kg per plot, SY) along with some important yield-related traits such as plant height (cm, PH), pods per main branch (number, PMB), pods per axillary branches (number, PAB), pods per plant (number, PP), branches per plant (number, BP), main branch length (cm, MBL), first pod height from the ground (cm, FPH), pod length (cm, PL), days to start of flowering (number, DSF), days to end of flowering (number, DEF), days to physiological maturity (number, DPM), flowering period (number, FP), thousand seed weight (g, TSW), seeds per pod (number, SP) and stem diameter (mm, SD) were recorded using 10 randomly selected plants from two intermediate rows in each plot (to prevent marginal effects) and their averages were used for training and testing datasets of algorithms.

Data preprocessing

Data normalization is an essential preprocessing step for learning from data [34]. Moreover, when the numerical input variables have very varied scales, machine learning algorithms do not perform effectively because the algorithms could be dominated by the variables with large values [35]. To address these issues, data were normalized using Yeo-Johnson normalization method [36], and all the traits were scaled to a [0, 1] range using the Eq. (1):

$${X}_{scaled}=\left[\frac{\left(X-{X}_{min}\right)}{\left({X}_{max}-{X}_{min}\right)}\times \left({X}_{max}-{X}_{min}\right)\right]+{X}_{min}$$

(1)

where ${X}_{scaled}$ is the scaled value for $X$ input, ${X}_{max}$ and ${X}_{min}$ are the maximum and minimum values of$X$, respectively.

Learning curve

A learning curve displays an algorithm's validation and training scores for different numbers of training samples. It is a fundamental technique to determine how much we would benefit from including extra training data, and consequently the optimal numbers of a training set [37]. To achieve this, different number of samples (from 25 to 90) were entered into MLR and ridge regression algorithms as the training set. In order to evaluate each training sample number, a 5-folds cross-validation was implemented, and then mean and 95% confidence interval of mean square errors (MSEs) were calculated in both training and validation sets. The training and the validation scores in both of the algorithms converge to a value that is quite low with increasing size of the training set (Fig. 1). MSE of validation sets approximately reached its lowest value in training size = 80 with a confidence interval overlap with the training set. Thus, training size = 80 is the proper size for the training set, and there is no benefit of more training data. The dataset was randomly divided into two subsets with 81 samples (70%) and 36 samples (30%) for training and testing data, respectively.

Algorithm development

Multiple linear regression

Multiple linear regression (MLR) is a predictive technique based on linear and additive relationships of explanatory variables. MLR aims to describe the relationship between two or more explanatory variables and a dependent variable by assuming a linear relationship [38]. MLR algorithm was developed according to Eq. (2).

$$\widehat{y}={\theta }_{0}+ {\theta }_{1}{x}_{1}+{\theta }_{2}{x}_{2}+\dots + {\theta }_{n}{x}_{n}+\varepsilon$$

(2)

where $\widehat{y}$ is the predicted SY, ${\theta }_{0}$ is the bias term, $\theta$ ₁–$\theta$ _n are the coefficients of regression (aka feature weights), ${x}_{1}-{x}_{n}$ are the input features (traits), and ε is the error associated with the ${i}^{th}$ observation. Equation (2) can be concisely written in a vectorized form:

$$\widehat{y}={h}_{\theta }\left(x\right)=\theta .X={\theta }^{T}X$$

(3)

where ${\theta }^{T}$ is the transpose of the algorithm’s parameter vector ($\theta$), containing the bias term ${\theta }_{0}$ and the feature weights $\theta$ ₁ to $\theta$ _n. X is the feature vector, containing ${x}_{0}$ to ${x}_{0}$, with $x$ always equal to 1 and ${h}_{\theta }$ is the hypothesis function, using the algorithm parameters$\theta$. The error of the algorithm is:

$$E\left(X, {h}_{\theta }\right)=\frac{1}{m}\sum_{i=1}^{m}{\left( {\theta }^{T}{X}^{(i)}-{y}^{(i)} \right)}^{2}$$

(4)

where $E\left(X, {h}_{\theta }\right)$ is the error, $m$ is the number of samples, and ${\theta }^{T}{X}^{(i)}$ and ${y}^{(i)}$ denote the predicted and actual amounts of SY for the ${i}^{th}$ sample, respectively.

Ridge regression

Ridge regression (RR) is a regularized version of MLR. Compared to MLR, RR algorithm has an additional L2 regularization term equal to $\alpha \frac{1}{2}\sum_{j=1}^{n}{\theta }_{j}^{2}$ where $\alpha$ is a non-negative hyperparameter that controls the regularization strength. The L2 regularization term is added to the error function and forces the learning algorithm to not only fit the data but also keep the algorithm weights as small as possible [35].

Stochastic gradient descent

Stochastic gradient descent (SGD) employs approximate gradients computed from subsets of the training dataset to update the parameters in real-time. The major advantage of utilizing this strategy is that many of the feature weights will become zero throughout training. Another benefit is that it enables us to apply the L1 regularization, bypassing the need to update the weights of features that are not used in the current sample, resulting in substantially quicker training when the feature space dimension is large [39]. Equation 5 can be used to minimize the error of the SGD algorithm:

$$E\left(X, {h}_{\theta }\right)=\frac{1}{m}\sum_{i=1}^{m}L\left({y}_{i} , f({x}_{i})\right)+ \alpha R\left(\theta \right)$$

(5)

where ${y}_{i}$ and $f({x}_{i})$ are the actual and predicted amounts of SY, respectively. $L$ is a loss function that measures the algorithm fitting or mis-fitting and $\mathrm{\alpha R}\left(\uptheta \right)$ is a regularization term that penalizes the algorithm complexity. Squared error (Eq. (6)), huber (Eq. (7)), epsilon insensitive (Eq. (8)), and squared form of epsilon insensitive are the loss functions that can be applied to SGD algorithm.

$${\mathrm{Squared \,Error}: L\left({y}_{i} , f({x}_{i})\right)}^{2}=\frac{1}{2}{\left({y}_{i}- f({x}_{i})\right)}^{2}$$

(6)

$${\text{Huber: is equal to MLR's cost function when }}\left| {y_{i} } \right. - \left. {f{\text{(}}x_{i} {\text{)}}} \right| \le \,\varepsilon \,{\text{and L (y}}_{i} ,f(x_{i} )) = \varepsilon \left| {y_{i} } \right. - \left. {f(x_{i} )} \right| - \frac{1}{2}\varepsilon ^{2} \,{\text{otherwise}}$$

(7)

$$\mathrm{Epsilon \,Insensitive}: L ({y}_{i} , f({x}_{i}))=\mathit{max}\left(0, \left|{y}_{i}- f({x}_{i})\right|-\varepsilon \right)$$

(8)

Generalized linear model

Generalized Linear Model (GLM) is an extended form of MLR which uses a link function, and also its loss function can be differently computed based on the given distribution [40,41,42]. $\widehat{y}$ is calculated through $\widehat{y}=f({\theta }^{T}X+{\theta }_{0})$, where $f$ is the link function.

Bayesian ridge regression

Using Bayesian theory in linear regression helps an algorithm avoid overfitting and also leads to automatic methods of determining algorithm complexity using the training dataset alone [42]. Bayesian ridge regression (BRR) is similar to the RR method, except that BRR has an additional noise precision parameter ($\lambda$) other than $\alpha$. Both $\alpha$ and $\lambda$ are estimated concurrently when the algorithm is fitting, and their priors are selected from the gamma distribution. The probabilistic model of $y$ is:

$$p\left({y}^{(i)}|{X}^{(i)}, \theta , \alpha \right)=N\left({y}^{\left(i\right)}|{\theta }^{T}{X}^{\left(i\right)}, \alpha \right)$$

(9)

and Gaussian prior of coefficients $\theta$ is:

$$p\left(\theta , \lambda \right)=N(\theta |0, {\lambda }^{-1}I)$$

(10)

A comprehensive description of Bayesian regression can be found in [42, 43].

Automatic relevance determination

Automatic relevance determination (ARD) (aka relevance vector machine) was first introduced by [44] and typically results in algorithms that are sparser, which allows for quicker performance on testing dataset while preserving the same generalization error. Similar to BRR, ARD is also based on Bayesian theory with the difference that each coefficient ${\theta }_{i}$ can itself be obtained from a Gaussian distribution, centered on zero and with a precision ${\lambda }_{i}$:

$$p\left(\theta , \lambda \right)=N\left(\theta |0, {A}^{-1}\right)$$

where $A$ is a positive definite diagonal matrix with a diagonal equal to: $\lambda =\left\{{\lambda }_{1}, \dots , {\lambda }_{n}\right\}$. More information on developing an ARD algorithm is available in [44, 45].

Support vector regression

In linear support vector regression (LSVR) we aim to minimize the Eq. (11):

$$\underset{\theta ,b}{\mathit{min}}\frac{1}{2}{\theta }^{T}\theta +C\sum_{i=1}max(0,\left|{y}^{(i)}-\left({\theta }^{T}\varnothing \left({x}^{(i)}\right)+b\right)\right|-\epsilon )$$

(11)

where $b$ represents bias, $C$ is regularization parameter and $\varnothing$ is the loss function (epsilon insensitive and squared epsilon insensitive can be applied).

Epsilon support vector regression (ESVR) is another form of SVR employed in this study. ESVR should be trained in such a way that the following statement would be minimized:

$$\underset{\theta , b,\zeta ,{\zeta }^{*}}{\mathrm{min}}\frac{1}{2}{\theta }^{T}\theta +C\sum_{i=1}^{m}{(\zeta }_{i}+{\zeta }_{i}^{*})$$

$${\mathrm{subject \,to }\,y}_{i}-{\theta }^{T}\varnothing \left({x}^{(i)}\right)-b\le \epsilon +{\zeta }_{i} ,$$

$$-\left({y}_{i}-{\theta }^{T}\varnothing \left({x}^{(i)}\right)-b\right)\le \epsilon +{\zeta }_{i}^{*},$$

$${\zeta }_{i}, {\zeta }_{i}^{*}\ge 0, i=1,\dots ,m$$

(12)

In this case, we penalize samples whose predictions are at least $\epsilon$ off from their real target. In accordance with whether or not their predictions are placed above or below the $\epsilon$ tube, these samples penalize the objective by ${\upzeta }_{\mathrm{i}}$ or ${\upzeta }_{\mathrm{i}}^{*}$ (Fig. 2A). As having high dimensional data causes complex computational possess, it is usually more advantageous to apply the dual problem to reduce the features from N to S. The dual problem is:

$$\underset{\alpha ,{\alpha }^{*}}{\mathrm{min}}\frac{1}{2} {\left(\alpha -{\alpha }^{*}\right)}^{T}Q\left(\alpha -{\alpha }^{*}\right)+\epsilon \sum_{i=1}^{m}\left({\alpha }_{i}+{\alpha }_{i}^{*}\right)+\sum_{i=1}^{m}{y}^{(i)}\left({\alpha }_{i}-{\alpha }_{i}^{*}\right)$$

$${\mathrm{Subject to }e}^{T}\left(\alpha -{\alpha }^{*}\right)=0$$

$$0\le {\mathrm{\alpha }}_{\mathrm{i}},{\alpha }_{i}^{*}\le \mathrm{C},\mathrm{ i}=1,\dots ,\mathrm{m}$$

(13)

where $e$ is the vector of all ones, $Q$ is a n by n positive semidefinite matrix, and ${Q}_{is}=K\left({x}_{i},{x}_{s}\right)$ is the kernel function. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function $\varnothing$. Equation (14) shows the estimation function of ESVR algorithm.

$$\sum_{\mathrm{i}=1}^{\mathrm{m}}\left({\mathrm{\alpha }}_{\mathrm{i}}^{*}-{\mathrm{\alpha }}_{\mathrm{i}}\right)\mathrm{K}\left({\mathrm{x}}_{\mathrm{i}},\mathrm{x}\right)+\mathrm{b}$$

(14)

Different kernel functions Eqs. (15), (16), (17), and Eq. (18)) can be employed to ESVR algorithm.

$$\mathrm{Linear}: K\left({x}_{i},{x}_{s}\right)={x}_{i}^{T}{x}_{s}$$

(15)

$$\mathrm{Radial Basis Function }(\mathrm{RBF}): K\left({x}_{i},{x}_{s}\right)=\mathit{exp}\left(-\gamma {\left|\left|{x}_{i}-{x}_{s}\right|\right|}^{2}\right)$$

(16)

$$\mathrm{Sigmoid}: K\left({x}_{i},{x}_{s}\right)=\mathit{tan}h\left(\gamma {x}_{i}^{T}{x}_{s}+r\right)$$

(17)

$$\mathrm{Polynomial}: K\left({x}_{i},{x}_{s}\right)={(\gamma {x}_{i}^{T}{x}_{s}+r)}^{d}$$

(18)

where $\upgamma$ and $r$ are hyperparameters, and $d$ specifies the degree of the polynomial kernel function. Nu-Support Vector Regression (NuSVR) adopts a similar approach to ESVR with an additional Nu hyperparameter which controls the number of support vectors.

Multilayer perceptron neural network

The MLPNNs, one of the most well-known forms of ANNs, comprise an input layer, one or more hidden layers, and an output layer (Fig. 2B). A MLPNN algorithm uses Eq. (19) as loss function, which should be minimized through the training process.

$$\mathrm{Loss}\left(\widehat{y},y,\theta \right)=\frac{1}{2m}\sum_{i=0}^{m}{({\widehat{y}}^{(i)}-{y}^{(i)})}^{2}+\frac{\alpha }{2m}\sum_{\mathrm{j}=1}^{\mathrm{n}}{\uptheta }_{\mathrm{j}}$$

(19)

To compute the $\widehat{y}$ in the MLP with u neurons in the hidden layer and z output features, the Eq. (20) is implemented:

$$\widehat{y}=\sum_{j=1}^{u}{w}_{j}.g(\sum_{i=1}^{z}{w}_{ji}{x}_{i}+{w}_{j0})+{w}_{0}$$

(20)

where ${x}_{i}$ denotes the ${i}^{th}$ input feature, ${w}_{j}$ indicates the weighted input data into the ${j}_{th}$ hidden neuron, ${w}_{ij}$ shows the weight of the direct association between input neuron $i$ and the hidden neuron $j$, ${w}_{j0}$ represents the bias for node ${j}_{th}$, ${w}_{0}$ denotes the bias related to the neuron of output, and $g$ is the activation function and can be one the following items:

$$\mathrm{Identity}: g\left(x\right)=x$$

(21)

$$\mathrm{Logistic}: g\left(x\right)=\frac{1}{(1 + exp(-x))}$$

(22)

$$\mathrm{Tanh}: g\left(x\right)=tanh(x)$$

(23)

$$\mathrm{Relu}: g\left(x\right)=max(0,x)$$

(24)

Hyperparameter optimization

In order to find the optimized values of the hyperparameters, a cross-validation method was implemented. The training dataset was first shuffled and then randomly split into train (70%), and validation (30%) sets with 150 replications, and as a result, 150 independent train-validation sets were developed. To find the optimized value of a hyperparameter in an algorithm, we first set aside the validation sets. Then we trained algorithms on train sets using a range of values for a specific hyperparameter. The trained algorithms were applied to validation sets, and the average error of each hyperparameter value was calculated. Finally, the value with the minimum amount of error was considered as the optimized value of the hyperparameter.

As hyperparameter optimization of MLPNN algorithms is computationally intensive, a five-fold cross-validation was used to optimize the hyperparameters and also the numbers of hidden layers and neurons in each hidden layer of MLPNN algorithms. We first divided the training dataset into five groups (folds). We then fitted MLPNN algorithms using four folds and then applied the algorithm to the remaining fold, and measured the error. We repeated this procedure for each of the five folds in turn. Over the 5 folds, the optimized hyperparameters were selected based on the minimum average of error.

Algorithm performance

The algorithm performance to predict desired output was calculated using three statistical quality parameters, including root mean square error (RMSE), mean absolute error (MAE), and determination coefficient (R²) as follows:

$$RMSE=\sqrt{\frac{\sum_{i=1}^{m}{({O}_{i}-{P}_{i})}^{2}}{m}}$$

(25)

$$MAE=\frac{1}{m} \sum_{i=1}^{m}\left|{O}_{i}-{P}_{i}\right|$$

(26)

$${R}^{2}=\frac{\sum_{i=1}^{m}\left({O}_{i}-\overline{O }\right)\left({P}_{i}-\overline{P }\right)}{\sqrt{\sum_{i=1}^{m}{({O}_{i}-\overline{O })}^{2} \sum_{i=1}^{m}{({P}_{i}-\overline{P })}^{2}}}$$

(27)

where $m$ is the number of data, ${O}_{i}$ is the observed values, ${P}_{i}$ is the predicted values, and the bar denotes the mean of the feature.

Feature selection and sensitivity analysis of input features

Different methods, including principle component analysis (PCA), forward selection (FS), backward selection (BS), stepwise selection (SS) [46], Pearson correlation coefficient, and lasso [47] were used to reduce the number of the yield-related traits and find the most effective traits which can justify the SY variance. Figure 3 presents a general illustration of the connection between different stages in this study. A sensitivity analysis was also performed to study the effects of various independent traits on the output and provides insight into the helpfulness of individual traits. FS, BS, and SS were conducted using caret (version 6.090) and leaps (version 3.1) packages in R (version 4.1), and other feature selection methods, algorithm development, sensitivity analysis, and visualization were conveniently implemented in Python (version 3.7.7). Trait clustering was carried out via cluster package (version 2.1.4) in R.

Results

Seed yield prediction using all measured traits

A total of 25 algorithms were developed and optimized to predict the SY of rapeseed. All measured yield-related traits were entered into the algorithms as inputs and their performances were evaluated using R², RMSE, and MAE values (Tables 1, 2). According to the results, the least amounts of RMSE and the highest R² values were achieved using the NuSVR algorithm with quadratic polynomial kernel function (NuSVR-QP) in both training and testing stages (Fig. 4A, B), followed by the MLPNN algorithm with tanh activation function (MLPNN-Tanh) and the NuSVR algorithm with Cubic polynomial kernel function (NuSVR-CP) in the training and testing datasets, respectively. The least amounts of training MAE were seen in the MLPNN algorithm with tanh and relu activation functions, respectively. MLPNN algorithm with logistic activation function (MLPNN-Logistic) had the least testing MAE value (Fig. 4D) prior to NuSVR-QP. The least accuracy of the algorithms was achieved by ESVR algorithm with sigmoid kernel function (ESVR-Sigmoid) in all statistical criteria and both training and testing datasets (Fig. 4E, F), followed by MLPNN-Logistic in the training stage and MLR in the testing stage. The predicted and measured values of SY in both training and testing datasets were presented and contrasted as box plots to provide a better understanding of the data distribution and the effectiveness of algorithms to predict SY (Fig. 5).

Table 1 The performance of the algorithms to predict the SY of rapeseed using all measured traits

Full size table

Table 2 The performance of the MLPNNs to predict the SY of rapeseed using all measured traits

Full size table

In the present study, the reduction of R² value and the increase of RMSE and MAE amount between testing and training datasets of MLR (with R²_Test–R²_Train = − 0.07, RMES_Test–RMSE_Train = 0.082, MAE_Test–MAE_Train = 0.063) demonstrated that MLR is the most overfitted algorithm followed by GLM algorithm (with R²_Test–R²_Train = − 0.04, RMES_Test–RMSE_Train = 0.058, MAE_Test–MAE_Train = 0.049). It has also been shown in the scatter plot of the MLR and GLM algorithms (Fig. 6A, B, E, F) that they fit very well in the training stage; however, they have not been capable of repeating the same performance in the testing stage.

Feature selection and SY prediction using selected traits

In order to reduce the dimensions of the data and find the most important variables in predicting SY in rapeseed genotypes, 6 different feature selection methods including Pearson correlation coefficient, principal component analysis (PCA), stepwise selection (SS), forward selection (FS), backward selection (BS), and lasso were used in this study. To avoid overfitting in the SS, FS, and BS methods, leaps and caret packages in R with a five-fold cross-validation were employed to create 10 trait subsets. The first subset included the first trait selected by each method, and in the following subsets, one trait was added to the previous trait(s). Based on the R², RMSE and MAE values of the cross-validation stage, the best subsets were achieved using PP, FPH, and DPM in the SS and BS methods and PP, PH, and DPM in the FS method (Table 3).

Table 3 The output of stepwise selection, forward selection, and backward selection methods

Full size table

Using the ability of the lasso method to effectively reduce the number of features by giving zero coefficients to less important variables led to the Eq. (28)

$$SY = 0.736 + 0.608\,PH + 2.055\,PP + 0.409\,DPM$$

(28)

where the SY is seed yield, the PH is plant height, the PP is pods per plant, and the DPM is days to physiological maturity. As can be seen from the results of FS and lasso methods, both had the same traits as output.

Since having 3 traits in all variable selection methods could enable us to compare the methods with the same number of variable subsets, three traits were also selected in Pearson correlation coefficient and PCA methods. The results of the Pearson correlation coefficient showed that PP, PAB, and SD had the highest positive correlations with SY of rapeseed genotypes (Fig. 7). PP, PAB, and BP were the selected traits based on PCA results (Table 4).

Table 4 Principal component analysis of yield-related traits in rapeseed genotypes

Full size table

The traits given by feature selection methods were applied to the algorithms developed in the ‘‘Seed yield prediction using all measured traits’’ Sect as inputs to estimate the power of feature selection methods and find the most compatible algorithms to predict the SY of rapeseed genotypes using fewer traits. Additional file 1 displays the performance of the algorithms using the traits obtained from each feature selection method and a summarized table has been presented in Table 5. The best training performance was seen in the NuSVR algorithm with RBF kernel function and SS/BS methods (NuSVR-RBF-SS/BS) (Fig. 8C). Also, using the same algorithm with lasso/FS methods (NuSVR-RBF-lasso/FS) resulted in the least amount of MAE in the testing dataset (Fig. 8D). The highest R² value of the testing dataset was seen in the MLPNN algorithm with identity activation function and SS/BS methods (MLPNN-Identity-SS/BS) (Fig. 8B). Using SS/BS methods along with 3 algorithms including GLM and MLPNN with tanh and identity activation functions showed the least amount of testing RMSE simultaneously (Table 5). The ESVR algorithm with cubic polynomial kernel function and SS/BS methods (ESVR-CP-SS/BS) had the worst performance in all three statistical criteria of both training and testing datasets (Fig. 8E, F). A comparative box plot has been presented in Fig. 9 that shows the obvious difference between the performance of algorithms.

Table 5 The performance of machine learning algorithms using selected traits by feature selection methods as inputs

Full size table

Some algorithms were differentially performed using all measured traits or selected traits as inputs. For instance, NuSVR and ESVR algorithms with QP and CP kernel functions performed well when all measured traits were used as inputs; however, applying selected traits by feature selection methods led to lower performance (Fig. 10). Nevertheless, there was no noticeable difference in the performance of NuSVR and ESVR algorithms with linear kernel function, nor in LSVR algorithms when all measured traits or selected traits were applied as inputs (Fig. 11). Likewise, using all measured traits or selecting traits by feature selection methods as inputs did not significantly affect the performance of regularized linear algorithm (ridge, BRR, ADR, and SGD) (Fig. 12). Compared to using all measured traits as inputs, MLPNN algorithm with identity, tanh, and relu activation functions demonstrated better testing performance when selected traits by SS, FS, BS, and lasso methods were entered into these algorithms as inputs (Fig. 13).

In order to assess the efficiency of feature selection methods and compare them with using all measured traits as inputs to the algorithms, the mean of algorithms performance using all measured traits and selected traits by feature selection methods was calculated in both training and testing stages (Table 6). According to the results, using all measured traits as inputs to predict the SY of rapeseed genotypes resulted in highest R² value and least amount of RMSE and MAE. Among the feature selection methods, the best performance in all 3 statistical criteria was achieved using the lasso and FS methods in both training and testing datasets, while PCA exhibited the worst. Moreover, based on the testing R² and RMSE values, the most efficient algorithms with selected traits by correlation and PCA as inputs ranked thirty-fifth and forty-fifth among all combinations of the algorithms and feature selection methods, respectively (Additional file 1).

Table 6 The mean of R², RMSE and MAE values of machine learning algorithms with different inputs

Full size table

Sensitivity analysis

To find the most important input traits affecting the SY of rapeseed, sensitivity analysis was conducted using the MLPNN algorithm with identity activation function, NuSVR algorithm with quadratic kernel function, and MLR algorithm. The results of sensitivity analysis showed that the highest RMSE and MAE, and the lowest R² were achieved without DPM in all 3 algorithms (Table 7). The PP was also among the first 4 traits, which its elimination from the 3 algorithms caused an increase in RMSE and MAE, as well as a reduction in R² value. Figure 14 shows the status of high and low-yielding genotypes from the perspective of DPM and PP traits.

Table 7 Sensitivity analysis of the input features on the seed yield of rapeseed

Full size table

Discussion

Increasing SY has always been a central objective in breeding programs [12]. However, assessing SY in large populations of diverse genotypes is a laborious and time-consuming task [13, 14]. Due to the intricate interaction of genetic and environmental factors, seed yield breeding is a complex and nonlinear process [15, 16]. Consequently, breeders have adopted strategies that employ secondary traits closely associated with the primary trait to efficiently identify promising genotypes at early growth stages [17]. While conventional statistical methods have been widely used in rapeseed research to explore the relationships between SY and other traits, their assumption of linear relationships falls short in capturing the interactions and highly nonlinear associations between SY and other traits [18,19,20,21,22]. In contrast, the application of machine learning algorithms has proven effective in optimizing and predicting complex biological systems and, therefore, can be employed to facilitate more precise yield prediction and enhance the efficiency of breeding programs [23, 24].

Polynomial kernels of SVR algorithms: efficient tools for SY prediction using all traits as inputs

SY is a quantitative and complex trait with a nonlinear and complicated relationship with other yield-related traits [9, 22]. Applying linear algorithms cannot fully show the relationship between SY and other traits. Using nonlinear methods such as polynomial regression can be a solution to this issue. Polynomial regression involves including polynomial terms (quadratic, cubic, etc.) in a regression equation and, as a result making new combinatorial features and allowing learning of nonlinear models [48]. However, there is a problem with polynomial regression; it is too slow and computationally intensive [35]. To address that, polynomial kernel functions in the SVR algorithms could be employed, which performs operations in the original dimension without adding any combinatorial feature and subsequently is much more computationally effective [35]. In the present study, the NuSVR and ESVR algorithms with the QP and CP kernel functions were the first four most efficient algorithms in the testing stage based on R² and RMSE values (Table 1, Fig. 5B), which proved the high capability of SVR algorithms in combination with polynomial kernel functions to predict a complex trait such as SY in rapeseed.

Hyperparameter optimization: the first approach to avoid overfitting

Overfitting is one of the major issues in the machine learning area, which occurs when an algorithm fails to generalize successfully from observed data to new data. Due to the presence of overfitting, the algorithm performs flawlessly on the training set while fitting badly on the testing set [49]. MLR and GLM algorithms with all measured traits as inputs appeared to be the most overfitted algorithm in this study (Table 1, Fig. 6A, B, E, F). Algorithm training is actually a process of hyperparameter optimization. Well-optimized parameters make a good balance between training accuracy and regularity and then inhibit the effect of overfitting. Regularization-based algorithms help us distinguish noises, meaning and meaningless features, and assign different weights to them [49,50,51]. In this study, MLR was the only algorithm without any hyperparameter. Hyperparameter optimization led to a better performance in the rest of the algorithms. As a result, using regularization-based algorithms with hyperparameter optimization can be a solution to overcome overfitting in rapeseed SY prediction. One of the most important advantages of these results is the reduction of required time for optimizing predictive algorithms and therefore expediting the rapeseed breeding programs.

Feature selection

Stepwise selection is widely used to find the most important traits related to SY in plant breeding. However, discovering the best subset of the traits is an issue because all subset regression methods (SS, FS and BS) are in-sample methods for assessing and tuning models. Consequently, model selection may suffer from overfitting (fitting the noise in the data) and may not perform as well on new data [48]. To avoid this, we validated the models by using cross-validation. In accordance with the results of the SS, BS, and FS methods (Table 3), previous studies which used stepwise regression have demonstrated that pods per plant, growth duration, and pods on the main raceme [52], and pods per plant, number of branches, and duration of flowering [21] had significant effects on the SY in rapeseed genotypes. There are similarities between the result of the correlation analysis (Fig. 7) and other studies which have reported a positive and significant correlation between SY and pods per plant [18,19,20, 53,54,55,56,57], branch number [18, 55, 58, 59] and plant height [18, 54, 58] in rapeseed genotypes. Branch per plant and pods per plant were also reported as the effective traits in the first principal component associated with the yield of rapeseed accessions [19]. TSW and SP were not selected by any feature selection method and also showed a negative correlation with SY (Fig. 7). It indicates that they are not suitable indirect criteria for rapeseed SY breeding. Similar to our results, some studies reported a negative correlation between SY and TSW [15, 52, 54, 57, 59] and SP [15, 55].

Our findings would seem to demonstrate that correlation and PCA are not efficient methods to find proper indirect selection criteria for SY of rapeseed (Table 6). To provide a better understanding of how the traits were selected by feature selection methods, the measured traits were clustered using the Euclidean distance and ward method (Fig. 15). The results showed that all traits selected by correlation and PCA methods were in the first cluster, while SS, BS, FS, and lasso chose the traits from three different clusters, which has resulted in more efficient performance. The lack of considering the combined effects of the traits could be one of the factors that caused the inefficiency of the correlation and PCA methods. Unlike these two methods, in SS, BS, and Lasso methods, the combined effect of features is taken into account, and the combination with the best fit is chosen [35, 46].

Feature selection: the second approach to avoid overfitting

Results from additional file 1 and Table 5 can be compared with the data in Table 1, which shows that feature selection methods could positively affect the overfitted algorithms. Compared to using all measured traits as inputs, when the traits selected by feature selection methods were applied, the amount of overfitting in the MLR algorithm was reduced, and the testing performance of the GLM algorithm dramatically improved and became among the best testing performance results which indicates an improvement in the performance of these algorithms if fewer inputs are used (Fig. 6C, D, G, H).

Evaluating algorithms with all and selected traits: the influence of feature selection

Although using all measured traits as inputs in NuSVR and ESVR algorithms with QP and CP kernel functions led to efficient performances (Table 1, Fig. 5), applying selected traits by feature selection methods reduced their performance (Fig. 10). This revealed that the complex essence of polynomial algorithms is helpful when the data is dimensional and also nonlinear and complex relationship exists between dependent and independent variables. Nonetheless, the RBF kernel function in NuSVR and linear kernel function in ESVR showed an effective performance with selected traits by feature selection (Table 5). Therefore, one of the benefits of NuSVR and ESVR algorithms is their ability to work with different kernel functions that can provide them a flexible characteristic with different inputs. In contrast to polynomial kernel functions, no considerable difference was seen in the performance of NuSVR and ESVR algorithms with linear kernel function and also LSVR algorithms using all measured traits or selected traits as inputs (Fig. 11). Similarly, the performance of the other regularized linear algorithms (ridge, BRR, ADR and SGD) did not significantly change using all measured traits or selected traits by feature selection methods (Fig. 12). One of the major advantages of regularized linear algorithms is their ability to systematically weigh the more important features through the training process [60] and therefore, showing relatively similar performance with or without using feature selection.

The use of all measured traits as inputs to the MLPNN algorithm with identity, tanh, and relu activation functions caused overfitting of these algorithms, while the reduction of inputs by applying feature selection methods prevented overfitting or significantly reduced it (Fig. 13). Furthermore, they showed better testing performance using selected traits by SS, FS, BS, and lasso methods compared to utilizing all measured traits (Fig. 13). [61, 62] have also mentioned the crucial role of feature selection in the performance of neural networks and removing the overfitting effect. Comparing the performance of the MLPNNs with other algorithms when selected traits by feature selection methods were used, indicated that the performance of MLPNNs with fewer number of traits was more efficient than other algorithms (Table 5). Moreover, the insignificant reduction of the performance of MLPNN-Identity with traits obtained from SS and BS methods as inputs compared to the most efficient algorithm using all measured traits as inputs (NuSVR-QP) (Tables 1, 5) shows that the combination of MLPNN-Identity and SS and BS methods is an efficient approach for precise SY prediction using a much smaller number of traits (three instead of fifteen). It can greatly help breeders to effectively and simply select high-performance plants in the SY breeding programs of rapeseed since the direct selection or indirect selection via many traits for SY is practically impossible when it comes to using thousands of genotypes in a breeding program. While this paper focuses on the development of specific artificial neural networks, MLPNNs, it is important to mention that there are a diverse range of ANN algorithms beyond those presented here. Deep neural network genomic prediction (DNNGP) is a notable example, particularly in the field of plant genomic prediction, where it has been recently utilized with great success. [63].

Indirect selection criteria

The results of sensitivity analysis (Table 7) were fully consistent with the results of feature selection since DPM and PP were the mutual traits in SS, FS, BS and lasso as the efficient feature selection methods. Rapeseed genotypes can be divided into two almost distinct groups in such a way that high-yielding genotypes has a greater number of pods per plant and longer physiological maturity time than low-yielding genotypes (Fig. 14), which is another indication that selection based on these traits can be effective in developing rapeseed varieties with higher SY performance. Comparing the results of sensitivity analysis and feature selection also indicated that DPM and PP along with PH or FPH are the most important combination traits that can greatly affect the SY of rapeseed, and as a result, can be considered as the most important indirect indicators in the breeding programs to increase rapeseed SY. Many studies have noted the direct positive effect of pods per plant on SY [19,20,21, 54, 59]. Increasing the number of pods per plant is the strategy that rapeseed plants employ to enhance the SY rather than improving the number or weight of seeds per pod [15]. Likewise, nitrogen availability increases the SY of rapeseed through producing more pods compared to influencing seed or pod weight [15, 64]. The direct positive effect of plant height on SY was reported by [20, 59]. This is also an indirect contribution of PP to increase the SY because a taller plant usually has more pods and thus a higher yield [18]. [65] reported that delayed maturity was a contributing factor to SY increasing, and the high potential crops for high SY had late maturity. Similarly, [18] observed a direct connection between maturity time and SY in some of their experiments.

Conclusion

Nonlinear and complex relations between SY and yield-related traits is one of the main issues that has limited the application of conventional multivariate models to find the most effective traits for indirect selection. Regression-based machine learning algorithms along with feature selection methods, can provide a robust solution for accurate SY prediction and also introducing effective indirect selection criteria. To achieve that, different regression-based machine learning algorithms and feature selection methods were used in the present study. NuSVR and ESVR algorithms with polynomial kernel functions had the best performance when all the measured yield-related traits were used as inputs to predict the SY of rapeseed. It revealed the high potential of SVR algorithms in interpreting the nonlinear relations of dimensional data in complex biological processes. Although polynomial kernels are not proper options when fewer features are supposed to enter the SVR algorithms as inputs, RBF (with NuSVR) and linear (with ESVR) kernel functions showed effective performance with selected traits by feature selection. It showed the flexibility of NuSVR and ESVR to efficiently work with different inputs. Employing feature selection methods to find the most effective traits on the SY and using the selected features as inputs to the algorithms showed that the MLPNN algorithm with identity activation function is the most efficient and compatible algorithm with selected traits by SS and BS methods. MLPNNs are well-known and powerful algorithms, however they are sensitive to the input variables, and employing them together with proper feature selection methods would result in efficient performance. Regularized linear algorithms are effective to overcome overfitting as one of the main issues in regression and also are capable of maintaining a stable performance using numerous or selected features as inputs. According to the results of feature selection methods and sensitivity analysis, DPM, PP, and PH or FPH were the most important traits that greatly affected the SY of rapeseed. As optimizing and finding the most efficient algorithms for predicting complex biological processes is a time-consuming and challenging procedure, the optimized algorithms of this study can be used to have quicker and more efficient SY breeding programs of rapeseed, one of the most important oil crops.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

SY:: Seed yield
PH:: Plant height
PMB:: Pods per main branch;
PAB:: Pods per axillary branches;
PP:: Pods per plant
BP:: Branches per plant
MBL:: Main branch length
FPH:: First pod height from the ground
PL:: Pod length
DSF:: Days to start of flowering
DEF:: Days to end of flowering
DPM:: Days to physiological maturity
FP:: Flowering period
TSW:: Thousand seed weight
SP:: Seeds per pod
SD:: Stem diameter
RMSE:: Root mean square error
MAE:: Mean absolute error
MLR:: Multiple linear regression
RR:: Ridge regression
BRR:: Bayesian ridge regression
ARD:: Automatic relevance determination
GLM:: Generalized linear model
SGD:: Stochastic gradient descent
NuSVR:: Nu-support vector regression
ESVR:: Epsilon support vector regression
LSVR:: Linear support vector regression
MLPNN:: Multilayer perceptron neural network
ANN:: Artificial neural network
SVM:: Support vector machine
SVR:: Support vector regression
RBF:: Radial basis function
QP:: Quadratic polynomial
CP:: Cubic polynomial
PCA:: Principal component analysis
SS:: Stepwise selection
FS:: Forward selection
BS:: Backward selection

References

Raza A. Eco-physiological and biochemical responses of rapeseed (Brassica napus L) to abiotic stresses: consequences and mitigation strategies. J Plant Growth Regul. 2021;40(4):1368–88. https://doi.org/10.1007/s00344-020-10231-z.
Article CAS Google Scholar
Hu D, Jing J, Snowdon RJ, Mason AS, Shen J, Meng J, et al. Exploring the gene pool of Brassica napus by genomics-based approaches. Plant Biotechnol J. 2021;19(9):1693–712. https://doi.org/10.1111/pbi.13636.
Article PubMed PubMed Central Google Scholar
Spasibionek S, Mikołajczyk K, Ćwiek-Kupczyńska H, Piętka T, Krótka K, Matuszczak M, et al. Marker assisted selection of new high oleic and low linolenic winter oilseed rape (Brassica napus L) inbred lines revealing good agricultural value. PLoS One. 2020;15(6):e0233959. https://doi.org/10.1371/journal.pone.0233959.
Article CAS PubMed PubMed Central Google Scholar
Niedbała G. Application of artificial neural networks for multi-criteria yield prediction of winter rapeseed. Sustainability. 2019;11(2):533. https://doi.org/10.3390/su11020533.
Article Google Scholar
Tian HY, Channa SA, Hu SW. Relationships between genetic distance, combining ability and heterosis in rapeseed (Brassica napus L). Euphytica. 2017;213(1):1–11. https://doi.org/10.1007/s10681-016-1788-x.
Article CAS Google Scholar
Wang T, Wei L, Wang J, Xie L, Li YY, Ran S, et al. Integrating GWAS, linkage mapping and gene expression analyses reveals the genetic control of growth period traits in rapeseed (Brassica napus L). Biotechnol Biofuel. 2020;13(1):1–19. https://doi.org/10.1186/s13068-020-01774-0.
Article CAS Google Scholar
Hossain Z, Johnson EN, Wang L, Blackshaw RE, Cutforth H, Gan Y. Plant establishment, yield and yield components of Brassicaceae oilseeds as potential biofuel feedstock. Indust Crops Product. 2019;141:111800. https://doi.org/10.1016/j.indcrop.2019.111800.
Article CAS Google Scholar
Abbadi A, Leckband G. Rapeseed breeding for oil content, quality, and sustainability. Eur J Lipid Sci Technol. 2011;113(10):1198–206. https://doi.org/10.1002/ejlt.201100063.
Article CAS Google Scholar
Raboanatahiry N, Chao H, Dalin H, Pu S, Yan W, Yu L, et al. QTL alignment for seed yield and yield related traits in Brassica napus. Front Plant Sci. 2018;9:1127. https://doi.org/10.3389/fpls.2018.01127.
Article PubMed PubMed Central Google Scholar
D-h F, L-y J, Mason AS, M-l X, L-r Z, L-z L, et al. Research progress and strategies for multifunctional rapeseed: a case study of China. J Integ Agric. 2016;15(8):1673–84. https://doi.org/10.1016/S2095-3119(16)61384-9.
Article Google Scholar
Yahya M, Dutta A, Bouri E, Wadström C, Uddin GS. Dependence structure between the international crude oil market and the European markets of biodiesel and rapeseed oil. Renewable Energy. 2022;197:594–605. https://doi.org/10.1016/j.renene.2022.07.112.
Article Google Scholar
Niazian M, Niedbała G. Machine learning for plant breeding and biotechnology. Agriculture. 2020;10(10):436. https://doi.org/10.3390/agriculture10100436.
Article CAS Google Scholar
Cai G, Yang Q, Chen H, Yang Q, Zhang C, Fan C, et al. Genetic dissection of plant architecture and yield-related traits in Brassica napus. Sci Rep. 2016;6(1):1–16. https://doi.org/10.1038/srep21625.
Article CAS Google Scholar
Xiong Q, Tang G, Zhong L, He H, Chen X. Response to nitrogen deficiency and compensation on physiological characteristics, yield formation, and nitrogen utilization of rice. Front Plant Sci. 2018;9:1075. https://doi.org/10.3389/fpls.2018.01075.
Article PubMed PubMed Central Google Scholar
Bennett EJ, Brignell CJ, Carion PW, Cook SM, Eastmond PJ, Teakle GR, et al. Development of a statistical crop model to explain the relationship between seed yield and phenotypic diversity within the Brassica napus genepool. Agronomy. 2017;7(2):31. https://doi.org/10.3390/agronomy7020031.
Article CAS Google Scholar
Collins NC, Tardieu F, Tuberosa R. Quantitative trait loci and crop performance under abiotic stress: where do we stand? Plant Physiol. 2008;147(2):469–86. https://doi.org/10.1104/pp.108.118117.
Article CAS PubMed PubMed Central Google Scholar
Yoosefzadeh-Najafabadi M, Earl HJ, Tulpan D, Sulik J, Eskandari M. Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. Frontiers Plant Sci. 2021;11:624273. https://doi.org/10.3389/fpls.2020.624273.
Article Google Scholar
Luo Z, Wang M, Long Y, Huang Y, Shi L, Zhang C, et al. Incorporating pleiotropic quantitative trait loci in dissection of complex traits: seed yield in rapeseed as an example. Theor Appl Genet. 2017;130(8):1569–85. https://doi.org/10.1007/s00122-017-2911-7.
Article CAS PubMed PubMed Central Google Scholar
Chen B, Xu K, Li J, Li F, Qiao J, Li H, et al. Evaluation of yield and agronomic traits and their genetic variation in 488 global collections of Brassica napus L. Genet Resour Crop Evol. 2014;61(5):979–99. https://doi.org/10.1007/s10722-014-0091-8.
Article CAS Google Scholar
Kumar S, Ali B, Khaldun A, Islam SS, Uddin MS, Akanda ML, et al. Genetic diversity, correlations and path coefficient analysis among the F5 populations of Brassica Species. Asian J Adv Agric Res. 2021;16(2):20–31. https://doi.org/10.9734/ajaar/2021/v16i230170.
Article Google Scholar
Rameeh V. Multivariate analysis of some important quantitative traits in rapeseed (Brassica napus) advanced lines. J Oilseed Brassica. 2016;1(2):162–9.
Google Scholar
Abdipour M, Younessi-Hmazekhanlu M, Ramazani SHR. Artificial neural networks and multiple linear regression as potential methods for modeling seed yield of safflower (Carthamus tinctorius L). Indust Crops Products. 2019;127:185–94. https://doi.org/10.1016/j.indcrop.2018.10.050.
Article Google Scholar
Hesami M, Pepe M, Monthony AS, Baiton A, Jones AMP. Modeling and optimizing in vitro seed germination of industrial hemp (Cannabis sativa L). Indust Crops Product. 2021;170:113753. https://doi.org/10.1016/j.indcrop.2021.113753.
Article CAS Google Scholar
Niazian M, Sadat-Noori SA, Abdipour M. Modeling the seed yield of Ajowan (Trachyspermum ammi L) using artificial neural network and multiple linear regression models. Indust Crops Product. 2018;117:224–34. https://doi.org/10.1016/j.indcrop.2018.03.013.
Article Google Scholar
Hesami M, Naderi R, Tohidfar M, Yoosefzadeh-Najafabadi M. Development of support vector machine-based model and comparative analysis with artificial neural network for modeling the plant tissue culture procedures: effect of plant growth regulators on somatic embryogenesis of chrysanthemum, as a case study. Plant Methods. 2020;16(1):1–15. https://doi.org/10.1186/s13007-020-00655-9.
Article CAS Google Scholar
Sheikhi A, Mirdehghan SH, Arab MM, Eftekhari M, Ahmadi H, Jamshidi S, et al. Novel organic-based postharvest sanitizer formulation using box Behnken design and mathematical modeling approach: a case study of fresh pistachio storage under modified atmosphere packaging. Postharvest Biol Technol. 2020;160: 111047.
Article CAS Google Scholar
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST). 2011;2(3):1–27. https://doi.org/10.1145/1961189.1961199.
Article Google Scholar
Moravej M, Amani P, Hosseini-Moghari S-M. Groundwater level simulation and forecasting using interior search algorithm-least square support vector regression (ISA-LSSVR). Groundwater Sustain Develop. 2020;11:100447. https://doi.org/10.1016/j.gsd.2020.100447.
Article Google Scholar
Zhou Z, Morel J, Parsons D, Kucheryavskiy SV, Gustavsson A-M. Estimation of yield and quality of legume and grass mixtures using partial least squares and support vector machine analysis of spectral data. Computers Electron Agric. 2019;162:246–53. https://doi.org/10.1016/j.compag.2019.03.038.
Article Google Scholar
Khosla E, Dharavath R, Priya R. Crop yield prediction using aggregated rainfall-based modular artificial neural networks and support vector regression. Environ Dev Sustain. 2020;22(6):5687–708. https://doi.org/10.1007/s10668-019-00445-x.
Article Google Scholar
Fan J, Zheng J, Wu L, Zhang F. Estimation of daily maize transpiration using support vector machines, extreme gradient boosting, artificial and deep neural networks models. Agric Water Manag. 2021;245:106547. https://doi.org/10.1016/j.agwat.2020.106547.
Article Google Scholar
Niedbała G, Piekutowska M, Weres J, Korzeniewicz R, Witaszek K, Adamski M, et al. Application of artificial neural networks for yield modeling of winter rapeseed based on combined quantitative and qualitative data. Agronomy. 2019;9(12):781. https://doi.org/10.3390/agronomy9120781.
Article Google Scholar
Niedbała G. Simple model based on artificial neural network for early prediction and simulation winter rapeseed yield. J Integr Agric. 2019;18(1):54–61. https://doi.org/10.1016/S2095-3119(18)62110-0.
Article Google Scholar
Ogasawara E, Martinez LC, De Oliveira D, Zimbrão G, Pappa GL, Mattoso M, editors. Adaptive normalization: A novel data normalization approach for non-stationary time series. The 2010 International Joint Conference on Neural Networks (IJCNN); 2010: IEEE.
Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O'Reilly Media; 2019.
Yeo IK, Johnson RA. A new family of power transformations to improve normality or symmetry. Biometrika. 2000;87(4):954–9.
Article Google Scholar
Perlich C, Provost F, Simonoff J. Tree induction vs. logistic regression: A learning-curve analysis. J Machine Learning Res. 2003.
Darlington RB, Hayes AF. Regression analysis and linear models: concepts, applications, and implementation. Guilford: Guilford Publications; 2016.
Google Scholar
Tsuruoka Y, Tsujii Ji, Ananiadou S, editors. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2009.
McCullagh P, Nelder J. Generalized linear models. 2nd ed. London: Chapman & Hall; 1989.
Book Google Scholar
Jørgensen B. The theory of exponential dispersion models and analysis of deviance. IMPA. 1992.
Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Berlin: Springer; 2006.
Google Scholar
MacKay DJ. Bayesian interpolation. Neural Comput. 1992;4(3):415–47. https://doi.org/10.1162/neco.1992.4.3.415.
Article Google Scholar
Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Machine Learn Res. 2001;1:211–44.
Google Scholar
Wipf D, Nagarajan S. A new view of automatic relevance determination. Adv Neural Inf Process Syst. 2007;49(1):641.
Google Scholar
Andersen CM, Bro R. Variable selection in regression—a tutorial. J Chemometrics. 2010;24(11–12):728–37. https://doi.org/10.1002/cem.1360.
Article CAS Google Scholar
Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol). 1996;58(1):267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
Article Google Scholar
Bruce P, Bruce A, Gedeck P. Practical statistics for data scientists: 50+ essential concepts using R and Python. O'Reilly Media; 2020.
Ying X, editor. An overview of overfitting and its solutions. Journal of Physics: Conference Series; 2019: IOP Publishing.
Hutter F, Hoos H, Leyton-Brown K, editors. An efficient approach for assessing hyperparameter importance. International conference on machine learning; 2014: PMLR.
Weerts HJ, Mueller AC, Vanschoren J. Importance of tuning hyperparameters of machine learning algorithms. arXiv. 2020. https://doi.org/10.48550/arXiv.2007.07588.
Article Google Scholar
G-y LU, Zhang F, P-y Z, Cheng Y, Feng-Ian L, G-p FU, et al. Relationship among yield components and selection criteria for yield improvement in early rapeseed (Brassica napus L). Agric Sci Chin. 2011;10(7):997–1003. https://doi.org/10.1016/S1671-2927(11)60086-2.
Article Google Scholar
Özer H, Oral E, DOĞRU Ü. Relationships between yield and yield components on currently improved spring rapeseed cultivars. Tur J Agric Forest. 1999;23(6):603–8.
Google Scholar
Marjanović-Jeromela A, Marinković R, Ivanovska S, Jankulovska M, Mijić A, Hristov N. Variability of yield determining components in winter rapeseed (Brassica napus L) and their correlation with seed yield. Genetika-Belgrade. 2011;43(1):51–66.
Article Google Scholar
Hashemi AS, Ghorban AN, Nadali BJ, Omid GC. Genetic evaluation of yield and yield components at advanced generations in rapeseed (Brassica napus L). Afr J Agri Res. 2010;5(15):1958–64.
Google Scholar
Rameeh V. Correlation and path analysis in advanced lines of rapeseed (Brassica napus) for yield components. J Oilseed Brassica. 2016;1(2):56–60.
Google Scholar
Szała L, Kaczmarek Z, Wójtowicz M, Sosnowska K, Cegielska-Taras T. Genetic variability in oilseed rape DH line population developed from F1 hybrids obtained by crossing black-and yellow-seeded DH lines. I yield and yield components Euphytica. 2021;217(6):1–11. https://doi.org/10.1007/s10681-020-02745-0.
Article Google Scholar
Aftab M, Fatima S, Qamar H, Hassan M, Zubair M, Arshad A, et al. Study of morphological characters give an insight into the genetic variation present in Brassica napus L germplasm. Life Sci J. 2020;17(3):56–61.
Google Scholar
Tariq H, Tanveer SK, Qamar M, Javaid RA, Vaseer SG, Jhanzab HM, et al. Correlation and path analysis of Brassica napus genotypes for yield related traits. Life Sci J. 2020;17(8):22–34. https://doi.org/10.7537/marslsj170820.05.
Article CAS Google Scholar
Wang Y, Ding Y, Wei W, Cao Y, Davidson DF, Hanson RK. On estimating physical and chemical properties of hydrocarbon fuels using mid-infrared FTIR spectra and regularized linear models. Fuel. 2019;255:115715. https://doi.org/10.1016/j.fuel.2019.115715.
Article CAS Google Scholar
Punitha S, Al-Turjman F, Stephan T. An automated breast cancer diagnosis using feature selection and parameter optimization in ANN. Computer Electr Eng. 2021;90:106958. https://doi.org/10.1016/j.compeleceng.2020.106958.
Article Google Scholar
May R, Dandy G, Maier H. Review of input variable selection methods for artificial neural networks. Artif Neural Networks-Methodol Adv Biomed Appl. 2011;10(1):19–45.
Google Scholar
Wang K, Abid MA, Rasheed A, Crossa J, Hearne S, Li H. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol Plant. 2023;16(1):279–93. https://doi.org/10.1016/j.molp.2022.11.004.
Article CAS PubMed Google Scholar
Allen E, Morgan D. A quantitative analysis of the effects of nitrogen on the growth, development and yield of oilseed rape. J Agric Sci. 1972;78(2):315–24. https://doi.org/10.1017/S0021859600069161.
Article Google Scholar
Habekotté B. Options for increasing seed yield of winter oilseed rape (Brassica napus L): a simulation study. Field Crops Res. 1997;54(2–3):109–26. https://doi.org/10.1016/S0378-4290(97)00041-5.
Article Google Scholar

Download references

Acknowledgements

The authors sincerely appreciate Mrs. Leili Alizadeh for her valuable support during the field experiments of this study.

Funding

This study has not been funded by any organization or institute in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Agronomy and Plant Breeding, College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran
Masoud Shahsavari, Valiollah Mohammadi & Houshang Alizadeh
Seed and Plant Improvement Institute, Agricultural Research, Education and Extension Organization (AREEO), Karaj, Iran
Bahram Alizadeh

Authors

Masoud Shahsavari
View author publications
You can also search for this author in PubMed Google Scholar
Valiollah Mohammadi
View author publications
You can also search for this author in PubMed Google Scholar
Bahram Alizadeh
View author publications
You can also search for this author in PubMed Google Scholar
Houshang Alizadeh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MSH conducted the field experiments, designed and carried out the computational section, and prepared the main manuscript text. VM and BA designed the field experiments, provided the plant materials and revised the manuscript. HA revised the manuscript.

Corresponding author

Correspondence to Valiollah Mohammadi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

The performance of regression-based machine learning algorithms using selected traits by feature selection methods as inputs to predict the seed yield of rapeseed.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Shahsavari, M., Mohammadi, V., Alizadeh, B. et al. Application of machine learning algorithms and feature selection in rapeseed (Brassica napus L.) breeding for seed yield. Plant Methods 19, 57 (2023). https://doi.org/10.1186/s13007-023-01035-9

Download citation

Received: 04 March 2023
Accepted: 05 June 2023
Published: 16 June 2023
DOI: https://doi.org/10.1186/s13007-023-01035-9

Application of machine learning algorithms and feature selection in rapeseed (Brassica napus L.) breeding for seed yield

Abstract

Background

Results

Conclusion

Background

Materials and methods

Plant material and field experiments

Data preprocessing

Learning curve

Algorithm development

Multiple linear regression

Ridge regression

Stochastic gradient descent

Generalized linear model

Bayesian ridge regression

Automatic relevance determination

Support vector regression

Multilayer perceptron neural network

Hyperparameter optimization

Algorithm performance

Feature selection and sensitivity analysis of input features

Results

Seed yield prediction using all measured traits

Feature selection and SY prediction using selected traits

Sensitivity analysis

Discussion

Polynomial kernels of SVR algorithms: efficient tools for SY prediction using all traits as inputs

Hyperparameter optimization: the first approach to avoid overfitting

Feature selection

Feature selection: the second approach to avoid overfitting

Evaluating algorithms with all and selected traits: the influence of feature selection

Indirect selection criteria

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Plant Methods

Contact us