 Research
 Open Access
 Published:
Application of machine learning algorithms and feature selection in rapeseedĀ (Brassica napus L.) breeding for seed yield
Plant Methods volumeĀ 19, ArticleĀ number:Ā 57 (2023)
Abstract
Background
Studying the relationships between rapeseed seed yield (SY) and its yieldrelated traits can assist rapeseed breeders in the efficient indirect selection of highyielding varieties. However, since the conventional and linear methods cannot interpret the complicated relations between SY and other traits, employing advanced machine learning algorithms is inevitable. Our main goal was to find the best combination of machine learning algorithms and feature selection methods to maximize the efficiency of indirect selection for rapeseed SY.
Results
To achieve that, twentyfive regressionbased machine learning algorithms and six feature selection methods were employed. SY and yieldrelated data from twenty rapeseed genotypes were collected from field experiments over a period of 2Ā years (2019ā2021). Root mean square error (RMSE), mean absolute error (MAE), and determination coefficient (R^{2}) were used to evaluate the performance of the algorithms. The best performance with all fifteen measured traits as inputs was achieved by the Nusupport vector regression algorithm with quadratic polynomial kernel function (R^{2}ā=ā0.860, RMSEā=ā0.266, MAEā=ā0.210). The multilayer perceptron neural network algorithm with identity activation function (MLPNNIdentity) using three traits obtained from stepwise and backward selection methods appeared to be the most efficient combination of algorithms and feature selection methods (R^{2}ā=ā0.843, RMSEā=ā0.283, MAEā=ā0.224). Feature selection suggested that the set of pods per plant and days to physiological maturity along with plant height or first pod height from the ground are the most influential traits in predicting rapeseed SY.
Conclusion
The results of this study showed that MLPNNIdentity along with stepwise and backward selection methods can provide a robust combination to accurately predict the SY using fewer traits and therefore help optimize and accelerate SY breeding programs of rapeseed.
Background
RapeseedĀ (Brassica napus L.) is the second global oilseed production source after soybean, producing 13% of worldwide oil [1, 2]. The extensively cultivated doublelow rapeseed, also known as canola, contains a very low amount of saturated fatty acids, palmitic C16:0 and stearic C18:0 (about 7% in total), and rich amount of unsaturated fatty acids, oleic C18:1 (about 62%), linoleic C18:2 (20%), linolenic C18:3 (10%) and eicosenoic C20:1 (1%) making it a healthy and nutritiously rich edible oil for humans [3, 4]. Owing to the energy crisis, rapeseed is also increasingly considered as a promisingĀ green energy source withĀ minimal air pollution, and renewability [5,6,7]. Due to the growing demand for rapeseed oil in the food and industrial sectors, attempts to increase its yield have become inevitable [8,9,10,11].
Increasing seed yield (SY) has always been one of the major aims of breeding programs [12]. However, measuring SY in large breeding populations with thousands of genotypes is laborintensive and timeconsuming [13, 14]. Controlled by various genes and greatly affected by the environment, seed yield breeding is a highly complicated and nonlinear process [15, 16]. As a result, breeding strategies based on secondary traits (e.g., yieldrelated traits) that are highly linked to a primary trait enable plant breeders to efficiently identify promising lines at early stages of growth [17].
Thus far, conventional statistical methods, for instance, correlation coefficient analysis, principle component analysis (PCA), path analysis, and multiple linear regression (MLR), have been widely used in rapeseed to elucidate relationships between SY and other traits [18,19,20,21]. Nonetheless, they presume a linear relationship between the variablesĀ and are neither adequate nor comprehensive in displaying the interactions of traitsĀ and SY and would be incapable of analyzingĀ highly nonlinear and complicated relationships between SY and other traits [22].
Machine learning algorithms have been effectively appliedĀ to optimization and prediction ofĀ manyĀ complicated biological systems [23]. The use of nonlinear machine learning algorithms in yield component analysis and indirect selection researches allows for a better understanding of nonlinear relations between yield and yieldrelated traits, and consequently, more precise yield prediction, which can efficiently improve breeding programs [24].
Lately, the multilayer perceptron neural networks (MLPNNs), one of the most wellknownĀ artificial neural networks (ANNs), has been widely employed for predictionĀ and modeling complicated characteristics, such as yield, in several breeding programs and also other areas of plant sciences [17, 25]. This algorithm consists of various highly interconnected functioning neurons that can be simultaneously employed to solve a particular problem. MLPNN algorithms can also realize the intrinsic knowledge in datasets and determine the interaction between output and input variables without prior physical considerations [25, 26].
Support vector machine (SVM) is another advanced and popular machine learning algorithm with the ability to find both linear and nonlinear relationships in data [12, 27]. The benefits of employing SVMs are a large number of hidden units and better learning problem formulation, which leads to the quadratic optimization task [28]. Support Vector Regression (SVR) is the regression version of SVM and has recently been used to solve problems in agricultural and plant sciences fields [17, 25, 29,30,31]
Although some studies have used ANNs to predict the yield of rapeseed, they have been based on meteorological data (air temperature and precipitation) and information about mineral fertilization [4, 32, 33]. No study regarding the application of machine learning algorithms using agronomical yieldrelated traits has been conducted to predict the SY of rapeseed and also introducing indirect selection criteria. Furthermore, apart from MLR, ANN and SVR algorithms there are other methods such as ridge regression (RR), stochastic gradient descent (SGD) and Bayesian regression, which have not been widely used to predict SY and have remained relatively unknown to scientists in plant breeding. Therefore, in the present study, we aimed to (a) develop and optimize regressionbased machine learning algorithms to predict the SY of rapeseed, (b) introduce the most important indirect selection criteria for SY of rapeseed through feature selection methods, and (c) maximize the efficiency of indirect selection for SY of rapeseed by means of finding the best combination of regressionbased machine learning algorithms and feature selection methods. According to the best of our knowledge, this study is the first comprehensive report on applying a diverse range of machine learning algorithms in the field of plant breeding.
Materials and methods
Plant material and field experiments
Field experiments were conducted in the research farm of Seed and Plant Improvement Institute (SPII), Karaj, Iran, in the 2019ā2020 and 2020ā2021 growing seasons. Twenty genotypes were cultivated in the first year, and nineteen genotypes were cultivated in the second year (due to insufficient seed availability for one of the genotypes). The experiment carried out in a randomized complete block design (RCBD) with three replicates. The genotypes comprise 7 lines obtained from a pedigree experiment, a restorer line (R2000), 7 hybrids obtained from crosses between the 7 lines and R2000 and 5 cultivars (Nilufar, Neptune, Nima, Okapi and Nafis). Each plot consisted of four rows with 4Ā m length and with 30 and 5Ā cm between and within lines, respectively. Also, the distance between two plots was 60Ā cm. At the end of each growing season, seed yield (Kg per plot, SY) along with some important yieldrelated traits such as plant height (cm, PH), pods per main branch (number, PMB), pods per axillary branches (number, PAB), pods per plant (number, PP), branches per plant (number, BP), main branch length (cm, MBL), first pod height from the ground (cm, FPH), pod length (cm, PL), days to start of flowering (number, DSF), days to end of flowering (number, DEF), days to physiological maturity (number, DPM), flowering period (number, FP), thousand seed weight (g, TSW), seeds per pod (number, SP) and stem diameter (mm, SD) were recorded using 10 randomly selected plants from two intermediate rows in each plot (to prevent marginal effects) and their averages were used for training and testing datasets of algorithms.
Data preprocessing
Data normalization is an essential preprocessing step for learning from data [34]. Moreover, when the numerical input variables have very varied scales, machine learning algorithms do not perform effectively because the algorithms could be dominated by the variables with large values [35]. To address these issues, data were normalized using YeoJohnson normalization method [36], and all the traits were scaled to a [0, 1] range using the Eq.Ā (1):
where \({X}_{scaled}\) is the scaled value for \(X\) input, \({X}_{max}\) and \({X}_{min}\) are the maximum and minimum values of\(X\), respectively.
Learning curve
A learning curve displays an algorithm's validation and training scores for different numbers of training samples. It is a fundamentalĀ technique to determine how much we would benefit from including extra training data, and consequently the optimal numbersĀ of a training set [37]. To achieve this, different number of samples (from 25 to 90) were entered into MLR and ridge regression algorithms as the training set. In order to evaluate each training sample number, a 5folds crossvalidation was implemented, and then mean and 95% confidence interval of mean square errors (MSEs) were calculated in both training and validation sets. The training and the validation scores in both of the algorithms converge to a value that is quite low with increasing size of the training set (Fig.Ā 1). MSE of validation sets approximately reached its lowest value in training sizeā=ā80 with a confidence interval overlap with the training set. Thus, training sizeā=ā80 is the proper size for the training set, and there is no benefit of more training data. The dataset was randomly divided into two subsets with 81 samples (70%) and 36 samples (30%) for training and testing data, respectively.
Algorithm development
Multiple linear regression
Multiple linear regression (MLR) is a predictive technique based on linear and additive relationships of explanatory variables. MLR aims to describe the relationship between two or more explanatory variables and a dependent variable by assuming a linear relationship [38]. MLR algorithm was developed according to Eq.Ā (2).
where \(\widehat{y}\) is the predicted SY, \({\theta }_{0}\) is the bias term, \(\theta\) _{1}ā\(\theta\) _{n} are the coefficients of regression (aka feature weights), \({x}_{1}{x}_{n}\) are the input features (traits), and Īµ is the error associated with the \({i}^{th}\) observation. EquationĀ (2) can be concisely written in a vectorized form:
where \({\theta }^{T}\) is the transpose of the algorithmās parameter vector (\(\theta\)), containing the bias term \({\theta }_{0}\) and the feature weights \(\theta\) _{1} to \(\theta\) _{n}. X is the feature vector, containing \({x}_{0}\) to \({x}_{0}\), with \(x\) always equal to 1 and \({h}_{\theta }\) is the hypothesis function, using the algorithm parameters\(\theta\). The error of the algorithm is:
where \(E\left(X, {h}_{\theta }\right)\) is the error, \(m\) is the number of samples, and \({\theta }^{T}{X}^{(i)}\) and \({y}^{(i)}\) denote the predicted and actual amounts of SY for the \({i}^{th}\) sample, respectively.
Ridge regression
Ridge regression (RR) is a regularized version of MLR. Compared to MLR, RR algorithm has an additional L2 regularization term equal to \(\alpha \frac{1}{2}\sum_{j=1}^{n}{\theta }_{j}^{2}\) where \(\alpha\)Ā is a nonnegative hyperparameter that controls the regularization strength. The L2 regularization term is added to the error function and forces the learning algorithm to not only fit the data but also keep the algorithm weights as small as possible [35].
Stochastic gradient descent
Stochastic gradient descent (SGD) employs approximate gradients computed from subsets of the training dataset to update the parameters in realtime. The majorĀ advantage of utilizing this strategy is that many of the feature weights will become zero throughout training. Another benefit is that it enables us to apply the L1 regularization, bypassing the need to update the weights of features that are not used in the current sample, resulting in substantially quicker training when the feature space dimension is large [39]. EquationĀ 5 can be used to minimize the error of the SGD algorithm:
where \({y}_{i}\) and \(f({x}_{i})\) are the actual and predicted amounts of SY, respectively. \(L\)Ā is a loss function that measures the algorithm fitting or misfitting andĀ \(\mathrm{\alpha R}\left(\uptheta \right)\)Ā is a regularization term that penalizes the algorithm complexity. Squared error (Eq.Ā (6)), huber (Eq.Ā (7)), epsilon insensitive (Eq.Ā (8)), and squared form of epsilonĀ insensitive are the loss functions that can be applied to SGD algorithm.
Generalized linear model
Generalized Linear Model (GLM) is an extended form of MLR which uses a link function, and also its loss function can be differently computed based on the given distribution [40,41,42]. \(\widehat{y}\) is calculated through \(\widehat{y}=f({\theta }^{T}X+{\theta }_{0})\), where \(f\) is the link function.
Bayesian ridge regression
Using Bayesian theory in linear regression helps an algorithm avoid overfitting and also leads to automatic methods of determining algorithm complexity using the training dataset alone [42]. Bayesian ridge regression (BRR) is similar to the RR method, except that BRR has an additional noise precision parameter (\(\lambda\)) other than \(\alpha\). Both \(\alpha\) and \(\lambda\) are estimated concurrently when the algorithm is fitting, and their priors are selected from the gamma distribution. The probabilistic model of \(y\) is:
and Gaussian prior of coefficients \(\theta\) is:
A comprehensive description of Bayesian regression can be found in [42, 43].
Automatic relevance determination
Automatic relevance determination (ARD) (aka relevance vector machine) was first introduced by [44] and typically results in algorithms that areĀ sparser, which allows for quicker performance on testing dataset while preserving the same generalization error. Similar to BRR, ARD is also based on Bayesian theory with the difference that each coefficientĀ \({\theta }_{i}\)Ā can itself be obtained from a Gaussian distribution, centered on zero and with a precision \({\lambda }_{i}\):
where \(A\) is a positive definite diagonal matrix with a diagonal equal to: \(\lambda =\left\{{\lambda }_{1}, \dots , {\lambda }_{n}\right\}\). More information on developing an ARD algorithm is available in [44, 45].
Support vector regression
In linear support vector regression (LSVR) we aim to minimize the Eq.Ā (11):
where \(b\) represents bias,Ā \(C\)Ā is regularization parameter and \(\varnothing\) is the loss function (epsilon insensitive and squared epsilon insensitive can be applied).
Epsilon support vector regression (ESVR) is another form of SVR employed in this study. ESVR should be trained in such a way that the following statement would be minimized:
In this case, we penalize samples whose predictions are at least \(\epsilon\) off from their real target. In accordance with whether or not their predictions are placedĀ above or below the \(\epsilon\) tube, these samples penalize the objective by \({\upzeta }_{\mathrm{i}}\) or \({\upzeta }_{\mathrm{i}}^{*}\) (Fig.Ā 2A). As having high dimensional data causes complex computational possess, it is usually more advantageous to apply the dual problem to reduce the features from N to S. The dual problem is:
where \(e\) is the vector of all ones, \(Q\) is a n by n positive semidefinite matrix, and \({Q}_{is}=K\left({x}_{i},{x}_{s}\right)\) is the kernel function. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function \(\varnothing\). EquationĀ (14) shows the estimation function of ESVR algorithm.
Different kernel functions Eqs.Ā (15), (16), (17), and Eq.Ā (18)) can be employed to ESVR algorithm.
where \(\upgamma\) and \(r\) are hyperparameters, and \(d\) specifies the degree of the polynomial kernel function. NuSupport Vector Regression (NuSVR) adopts a similar approach to ESVR with an additional Nu hyperparameter which controls the number of support vectors.
Multilayer perceptron neural network
The MLPNNs, one of the most wellknown forms of ANNs, comprise an input layer, one or more hidden layers, and an output layer (Fig.Ā 2B). A MLPNN algorithm uses Eq.Ā (19) as loss function, which should be minimized through the training process.
To compute the \(\widehat{y}\) in the MLP with u neurons in the hidden layer and z output features, the Eq.Ā (20) is implemented:
where \({x}_{i}\) denotes the \({i}^{th}\) input feature, \({w}_{j}\) indicates the weighted input data into the \({j}_{th}\) hidden neuron, \({w}_{ij}\) showsĀ the weight of the direct association between input neuron \(i\) and the hidden neuron \(j\), \({w}_{j0}\) represents the bias for node \({j}_{th}\), \({w}_{0}\) denotes the bias related to the neuron of output, and \(g\) is the activation function and can be one the following items:
Hyperparameter optimization
In order to find the optimized values of the hyperparameters, a crossvalidation method was implemented. The training dataset was first shuffled and then randomly split into train (70%), and validation (30%) sets with 150 replications, and as a result, 150 independent trainvalidation sets were developed. To find the optimized value of a hyperparameter in an algorithm, we first set aside the validation sets. Then we trained algorithms on train sets using a range of values for a specific hyperparameter. The trained algorithms were applied to validation sets, and the average error of each hyperparameter value was calculated. Finally, the value with the minimum amount of error was considered as the optimized value of the hyperparameter.
As hyperparameter optimization of MLPNN algorithms is computationally intensive, a fivefold crossvalidation was used to optimize the hyperparameters and also the numbers of hidden layers and neurons in each hidden layer of MLPNN algorithms. We first divided the training dataset into five groups (folds). We then fitted MLPNN algorithms using four folds and then applied the algorithm to the remaining fold, and measured the error. We repeated this procedure for each of the five folds in turn. Over the 5 folds, the optimized hyperparameters were selected based on the minimum average of error.
Algorithm performance
The algorithm performance to predict desired output was calculated using three statistical quality parameters, including root mean square error (RMSE), mean absolute error (MAE), and determination coefficient (R^{2}) as follows:
where \(m\) is the number of data, \({O}_{i}\) is the observed values, \({P}_{i}\) is the predicted values, and the bar denotes the mean of the feature.
Feature selection and sensitivity analysis of input features
Different methods, including principle component analysis (PCA), forward selection (FS), backward selection (BS), stepwise selection (SS) [46], Pearson correlation coefficient, and lasso [47] were used to reduce the number of the yieldrelated traits and find the most effective traits which can justify the SY variance. FigureĀ 3 presents a general illustration of the connection between different stages in this study. A sensitivity analysis was also performed to study the effects of various independent traits on the output and provides insight into the helpfulness of individual traits. FS, BS, and SS were conducted using caret (version 6.090) and leaps (version 3.1) packages in R (version 4.1), and other feature selection methods, algorithm development, sensitivity analysis, and visualization were conveniently implemented in Python (version 3.7.7). Trait clustering was carried out via cluster package (version 2.1.4) in R.
Results
Seed yield prediction using all measured traits
A total of 25 algorithms were developed and optimized to predict the SY of rapeseed. All measured yieldrelated traits were entered into the algorithms as inputs and their performances were evaluated using R^{2}, RMSE, and MAE values (Tables 1, 2). According to the results, the least amounts of RMSE and the highest R^{2} values were achieved using the NuSVR algorithm with quadratic polynomial kernel function (NuSVRQP) in both training and testing stages (Fig.Ā 4A, B), followed by the MLPNN algorithm with tanh activation function (MLPNNTanh) and the NuSVR algorithm with Cubic polynomial kernel function (NuSVRCP) in the training and testing datasets, respectively. The least amounts of training MAE were seen in the MLPNN algorithm with tanh and relu activation functions, respectively. MLPNN algorithm with logistic activation function (MLPNNLogistic) had the least testing MAE value (Fig.Ā 4D) prior to NuSVRQP. The least accuracy of the algorithms was achieved by ESVR algorithm with sigmoid kernel function (ESVRSigmoid) in all statistical criteria and both training and testing datasets (Fig.Ā 4E, F), followed by MLPNNLogistic in the training stage and MLR in the testing stage. The predicted and measured values of SY in both training and testing datasets were presented and contrasted as box plots to provide a better understanding of the data distribution and the effectiveness of algorithms to predict SY (Fig.Ā 5).
In the present study, the reduction of R^{2} value and the increase of RMSE and MAE amount between testing and training datasets of MLR (with R^{2}_{Test}āR^{2}_{Train}ā=āāĀ 0.07, RMES_{Test}āRMSE_{Train}ā=ā0.082, MAE_{Test}āMAE_{Train}ā=ā0.063) demonstrated that MLR is the most overfitted algorithm followed by GLM algorithm (with R^{2}_{Test}āR^{2}_{Train}ā=āāĀ 0.04, RMES_{Test}āRMSE_{Train}ā=ā0.058, MAE_{Test}āMAE_{Train}ā=ā0.049). It has also been shown in the scatter plot of the MLR and GLM algorithms (Fig.Ā 6A, B, E, F) that they fit very well in the training stage; however, they have not been capable of repeating the same performance in the testing stage.
Feature selection and SY prediction using selected traits
In order to reduce the dimensions of the data and find the most important variables in predicting SY in rapeseed genotypes, 6 different feature selection methods including Pearson correlation coefficient, principal component analysis (PCA), stepwise selection (SS), forward selection (FS), backward selection (BS), and lasso were used in this study. To avoid overfitting in the SS, FS, and BS methods, leaps and caret packages in R with a fivefold crossvalidation were employed to create 10 trait subsets. The first subset included the first trait selected by each method, and in the following subsets, one trait was added to the previous trait(s). Based on the R^{2}, RMSE and MAE values of the crossvalidation stage, the best subsets were achieved using PP, FPH, and DPM in the SS and BS methods and PP, PH, and DPM in the FS method (Table 3).
Using the ability of the lasso method to effectively reduce the number of features by giving zero coefficients to less important variables led to the Eq.Ā (28)
where the SY is seed yield, the PH is plant height, the PP is pods per plant, and the DPM is days to physiological maturity. As can be seen from the results of FS and lasso methods, both had the same traits as output.
Since having 3 traits in all variable selection methods could enable us to compare the methods with the same number of variable subsets, three traits were also selected in Pearson correlation coefficient and PCA methods. The results of the Pearson correlation coefficient showed that PP, PAB, and SD had the highest positive correlations with SY of rapeseed genotypes (Fig.Ā 7). PP, PAB, and BP were the selected traits based on PCA results (Table 4).
The traits given by feature selection methods were applied to the algorithms developed in the āāSeed yield prediction using all measured traitsāā Sect as inputs to estimate the power of feature selection methods and find the most compatible algorithms to predict the SY of rapeseed genotypes using fewer traits. Additional file 1 displays the performance of the algorithms using the traits obtained from each feature selection method and a summarized table has been presented in Table 5. The best training performance was seen in the NuSVR algorithm with RBF kernel function and SS/BS methods (NuSVRRBFSS/BS) (Fig.Ā 8C). Also, using the same algorithm with lasso/FS methods (NuSVRRBFlasso/FS) resulted in the least amount of MAE in the testing dataset (Fig.Ā 8D). The highest R^{2} value of the testing dataset was seen in the MLPNN algorithm with identity activation function and SS/BS methods (MLPNNIdentitySS/BS) (Fig.Ā 8B). Using SS/BS methods along with 3 algorithms including GLM and MLPNN with tanh and identity activation functions showed the least amount of testing RMSE simultaneously (Table 5). The ESVR algorithm with cubic polynomial kernel function and SS/BS methods (ESVRCPSS/BS) had the worst performance in all three statistical criteria of both training and testing datasets (Fig.Ā 8E, F). A comparative box plot has been presented in Fig.Ā 9 that shows the obvious difference between the performance of algorithms.
Some algorithms were differentially performed using all measured traits or selected traits as inputs.Ā For instance, NuSVR and ESVR algorithms with QP and CP kernel functions performed well when all measured traits were used as inputs; however, applying selected traits by feature selection methods led to lower performance (Fig.Ā 10). Nevertheless, there was no noticeable difference in the performance of NuSVR and ESVR algorithms with linear kernel function, nor in LSVR algorithms when all measured traits or selected traits were applied as inputs (Fig.Ā 11). Likewise, using all measured traits or selecting traits by feature selection methods as inputs did not significantly affect the performance of regularized linear algorithm (ridge, BRR, ADR, and SGD) (Fig.Ā 12). Compared to using all measured traits as inputs, MLPNN algorithm with identity, tanh, and relu activation functions demonstrated better testing performance when selected traits by SS, FS, BS, and lasso methods were entered into these algorithms as inputs (Fig.Ā 13).
In order to assess the efficiency of feature selection methods and compare them with using all measured traits as inputs to the algorithms, the mean of algorithms performance using all measured traits and selected traits by feature selection methods was calculated in both training and testing stages (Table 6). According to the results, using all measured traits as inputs to predict the SY of rapeseed genotypes resulted in highest R^{2} value and least amount of RMSE and MAE. Among the feature selection methods, the best performance in all 3 statistical criteria was achieved using the lasso and FS methods in both training and testing datasets, while PCA exhibited the worst. Moreover, based on the testing R^{2} and RMSE values, the most efficient algorithmsĀ with selected traits by correlation and PCA as inputs rankedĀ thirtyfifth and fortyfifth among all combinations of the algorithms and feature selection methods, respectively (Additional file 1).
Sensitivity analysis
To find the most important input traits affecting the SY of rapeseed, sensitivity analysis was conducted using the MLPNN algorithm with identity activation function, NuSVR algorithm with quadratic kernel function, and MLR algorithm. The results of sensitivity analysis showed that the highest RMSE and MAE, and the lowest R^{2} were achieved without DPM in all 3 algorithms (Table 7). The PP was also among the first 4 traits, which its elimination from the 3 algorithms caused an increase in RMSE and MAE, as well as a reduction in R^{2} value. FigureĀ 14 shows the status of high and lowyielding genotypes from the perspective of DPM and PP traits.
Discussion
Increasing SY has always been a central objective in breeding programs [12]. However, assessing SY in large populations of diverse genotypes is a laborious and timeconsuming task [13, 14]. Due to the intricate interaction of genetic and environmental factors, seed yield breeding is a complex and nonlinear process [15, 16]. Consequently, breeders have adopted strategies that employ secondary traits closely associated with the primary trait to efficiently identify promising genotypes at early growth stages [17]. While conventional statistical methods have been widely used in rapeseed research to explore the relationships between SY and other traits, their assumption of linear relationships falls short in capturing the interactions and highly nonlinear associations between SY and other traits [18,19,20,21,22]. In contrast, the application of machine learning algorithms has proven effective in optimizing and predicting complex biological systems and, therefore, can be employed to facilitate more precise yield prediction and enhance the efficiency of breeding programs [23, 24].
Polynomial kernels of SVR algorithms: efficient tools for SY prediction using all traits as inputs
SY is a quantitative and complex trait with a nonlinear and complicated relationship with other yieldrelated traits [9, 22]. Applying linear algorithms cannot fully show the relationship between SY and other traits. Using nonlinear methods such as polynomial regression can be a solution to this issue. Polynomial regression involves including polynomial terms (quadratic, cubic, etc.) in a regression equation and, as a result making new combinatorial features and allowing learning of nonlinear models [48]. However, there is a problem with polynomial regression; it is too slow and computationally intensive [35]. To address that, polynomial kernel functions in the SVR algorithms could be employed, which performs operations in the original dimension without adding any combinatorial feature and subsequently is much more computationally effective [35]. In the present study, the NuSVR and ESVR algorithms with the QP and CP kernel functions were the first four most efficient algorithms in the testing stage based on R^{2} and RMSE values (Table 1, Fig.Ā 5B), which proved the high capability of SVR algorithms in combination with polynomial kernel functions to predict a complex trait such as SY in rapeseed.
Hyperparameter optimization: the first approach to avoid overfitting
Overfitting is one of the major issues in the machine learning area, which occurs when an algorithm fails to generalize successfully from observed data to new data. Due to the presence of overfitting, the algorithm performs flawlessly on the training set while fitting badly on the testing set [49]. MLR and GLM algorithms with all measured traits as inputs appeared to be the most overfitted algorithm in this study (Table 1, Fig.Ā 6A, B, E, F). Algorithm training is actually a process of hyperparameter optimization. Welloptimized parameters make a good balance between training accuracy and regularity and then inhibit the effect of overfitting. Regularizationbased algorithms help us distinguish noises, meaning and meaningless features, and assign different weights to them [49,50,51]. In this study, MLR was the only algorithm without any hyperparameter. Hyperparameter optimization led to a better performance in the rest of the algorithms. As a result, using regularizationbased algorithms with hyperparameter optimization can be a solution to overcome overfitting in rapeseed SY prediction. One of the most important advantages of these results is the reduction of required time for optimizing predictive algorithms and therefore expediting the rapeseed breeding programs.
Feature selection
Stepwise selection is widely used to find the most important traits related to SY in plant breeding. However, discovering the best subset of the traits is an issue because all subset regression methods (SS, FS and BS) are insample methods for assessing and tuning models. Consequently, model selection may suffer from overfitting (fitting the noise in the data) and may not perform as well on new data [48]. To avoid this, we validated the models by using crossvalidation. In accordance with the results of the SS, BS, and FS methods (Table 3), previous studies which used stepwise regression have demonstrated that pods per plant, growth duration, and pods on the main raceme [52], and pods per plant, number of branches, and duration of flowering [21] had significant effects on the SY in rapeseed genotypes. There are similarities between the result of the correlation analysis (Fig.Ā 7) and other studies which have reported a positive and significant correlation between SY and pods per plant [18,19,20, 53,54,55,56,57], branch number [18, 55, 58, 59] and plant height [18, 54, 58] in rapeseed genotypes. Branch per plant and pods per plant were also reported as the effective traits in the first principal component associated with the yield of rapeseed accessions [19]. TSW and SP were not selected by any feature selection method and also showed a negative correlation with SY (Fig.Ā 7). It indicates that they are not suitable indirect criteria for rapeseed SY breeding. Similar to our results, some studies reported a negative correlation between SY and TSW [15, 52, 54, 57, 59] and SP [15, 55].
Our findings would seem to demonstrate that correlation and PCA are not efficient methods to find proper indirect selection criteria for SY of rapeseed (Table 6). To provide a better understanding of how the traits were selected by feature selection methods, the measured traits were clustered using the Euclidean distanceĀ and ward method (Fig.Ā 15). The results showed that all traits selected by correlation and PCA methods were in the first cluster, while SS, BS, FS, and lasso chose the traits from three different clusters, which has resulted in more efficient performance. The lack of considering the combined effects of the traits could be one of the factors that caused the inefficiency of the correlation and PCA methods. Unlike these two methods, in SS, BS, and Lasso methods, the combined effect of features is taken into account, and the combination with the best fit is chosen [35, 46].
Feature selection: the second approach to avoid overfitting
Results from additional file 1 and Table 5 can be compared with the data in Table 1, which shows that feature selection methods could positively affect the overfitted algorithms. Compared to using all measured traits as inputs, when the traits selected by feature selection methods were applied, the amount of overfitting in the MLR algorithm was reduced, and the testing performance of the GLM algorithm dramatically improved and became among the best testing performance results which indicates an improvement in the performance of these algorithms if fewer inputs are used (Fig.Ā 6C, D, G, H).
Evaluating algorithms with all and selected traits: the influence of feature selection
Although using all measured traits as inputs in NuSVR and ESVR algorithms with QP and CP kernel functions led to efficient performances (Table 1, Fig.Ā 5), applying selected traits by feature selection methods reduced their performance (Fig.Ā 10). This revealed that the complex essence of polynomial algorithms is helpful when the data is dimensional and also nonlinear and complex relationship exists between dependent and independent variables. Nonetheless, the RBF kernel function in NuSVR and linear kernel function in ESVR showed an effective performance with selected traits by feature selection (Table 5). Therefore, one of the benefits of NuSVR and ESVR algorithms is their ability to work with different kernel functions that can provide them a flexible characteristic with different inputs. In contrast to polynomial kernel functions, no considerable difference was seen in the performance of NuSVR and ESVR algorithms with linear kernel function and also LSVR algorithms using all measured traits or selected traits as inputs (Fig.Ā 11). Similarly, the performance of the other regularized linear algorithms (ridge, BRR, ADR and SGD) did not significantly change using all measured traits or selected traits by feature selection methods (Fig.Ā 12). One of the major advantages of regularized linear algorithms is their ability to systematically weigh the more important features through the training process [60] and therefore, showing relatively similar performance with or without using feature selection.
The use of all measured traits as inputs to the MLPNN algorithm with identity, tanh, and relu activation functions caused overfitting of these algorithms, while the reduction of inputs by applying feature selection methods prevented overfitting or significantly reduced it (Fig.Ā 13). Furthermore, they showed better testing performance using selected traits by SS, FS, BS, and lasso methods compared to utilizing all measured traits (Fig.Ā 13). [61, 62] have also mentioned the crucial role of feature selection in the performance of neural networks and removing the overfitting effect. Comparing the performance of the MLPNNs with other algorithms when selected traits by feature selection methods were used, indicated that the performance of MLPNNs with fewer number of traits was more efficient than other algorithms (Table 5). Moreover, the insignificant reduction of the performance of MLPNNIdentity with traits obtained from SS and BS methods as inputs compared to the most efficient algorithm using all measured traits as inputs (NuSVRQP) (Tables 1, 5) shows that the combination of MLPNNIdentity and SS and BS methods is an efficient approach for precise SY prediction using a much smaller number of traits (three instead of fifteen). It can greatly help breeders to effectively and simply select highperformance plants in the SY breeding programs of rapeseed since the direct selection or indirect selection via many traits for SY is practically impossible when it comes to using thousands of genotypes in a breeding program. While this paper focuses on the development of specific artificial neural networks, MLPNNs, it is important to mention that there are a diverse range of ANN algorithms beyond those presented here. Deep neural network genomic prediction (DNNGP) is a notable example, particularly in the field of plant genomic prediction, where it has been recently utilized with great success. [63].
Indirect selection criteria
The results of sensitivity analysis (Table 7) were fully consistent with the results of feature selection since DPM and PP were the mutual traits in SS, FS, BS and lasso as the efficient feature selection methods. Rapeseed genotypes can be divided into two almost distinct groups in such a way that highyielding genotypes has a greater number of pods per plant and longer physiological maturity time than lowyielding genotypes (Fig.Ā 14), which is another indication that selection based on these traits can be effective in developing rapeseed varieties with higher SY performance. Comparing the results of sensitivity analysis and feature selection also indicated that DPM and PP along with PH or FPH are the most important combination traits that can greatly affect the SY of rapeseed, and as a result, can be considered as the most important indirect indicators in the breeding programs to increase rapeseed SY. Many studies have noted the direct positive effect of pods per plant on SY [19,20,21, 54, 59]. Increasing the number of pods per plant is the strategy that rapeseed plants employ to enhance the SY rather than improving the number or weight of seeds per pod [15]. Likewise, nitrogen availability increases the SY of rapeseed through producing more pods compared to influencing seed or pod weight [15, 64]. The direct positive effect of plant height on SY was reported by [20, 59]. This is also an indirect contribution of PP to increase the SY because a taller plant usually has more pods and thus a higher yield [18]. [65]Ā reported that delayed maturity was a contributing factor to SY increasing, and the high potential crops for high SY had late maturity. Similarly, [18] observed a direct connection between maturity time and SY in some of their experiments.
Conclusion
Nonlinear and complex relations between SY and yieldrelated traits is one of the main issues that has limited the application of conventional multivariate models to find the most effective traits for indirect selection. Regressionbased machine learning algorithms along with feature selection methods, can provide a robust solution for accurate SY prediction and also introducing effective indirect selection criteria. To achieve that, different regressionbased machine learning algorithms and feature selection methods were used in the present study. NuSVR and ESVR algorithms with polynomial kernel functions had the best performance when all the measured yieldrelated traits were used as inputs to predict the SY of rapeseed. It revealed the high potential of SVR algorithms in interpreting the nonlinear relations of dimensional data in complex biological processes. Although polynomial kernels are not proper options when fewer features are supposed to enter the SVR algorithms as inputs, RBF (with NuSVR) and linear (with ESVR) kernel functions showed effective performance with selected traits by feature selection. It showed the flexibility of NuSVR and ESVR to efficiently work with different inputs. Employing feature selection methods to find the most effective traits on the SY and using the selected features as inputs to the algorithms showed that the MLPNN algorithm with identity activation function is the most efficient and compatible algorithm with selected traits by SS and BS methods. MLPNNs are wellknown and powerful algorithms, however they are sensitive to the input variables, and employing them together with proper feature selection methods would result in efficient performance. Regularized linear algorithms are effective to overcome overfitting as one of the main issues in regression and also are capable of maintaining a stable performance using numerous or selected features as inputs. According to the results of feature selection methods and sensitivity analysis, DPM, PP, and PH or FPH were the most important traits that greatly affected the SY of rapeseed. As optimizing and finding the most efficient algorithms for predicting complex biological processes is a timeconsuming and challenging procedure, the optimized algorithms of this study can be used to have quicker and more efficient SY breeding programs of rapeseed, one of the most important oil crops.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 SY:

Seed yield
 PH:

Plant height
 PMB:

Pods per main branch;
 PAB:

Pods per axillary branches;
 PP:

Pods per plant
 BP:

Branches per plant
 MBL:

Main branch length
 FPH:

First pod height from the ground
 PL:

Pod length
 DSF:

Days to start of flowering
 DEF:

Days to end of flowering
 DPM:

Days to physiological maturity
 FP:

Flowering period
 TSW:

Thousand seed weight
 SP:

Seeds per pod
 SD:

Stem diameter
 RMSE:

Root mean square error
 MAE:

Mean absolute error
 MLR:

Multiple linear regression
 RR:

Ridge regression
 BRR:

Bayesian ridge regression
 ARD:

Automatic relevance determination
 GLM:

Generalized linear model
 SGD:

Stochastic gradient descent
 NuSVR:

Nusupport vector regression
 ESVR:

Epsilon support vector regression
 LSVR:

Linear support vector regression
 MLPNN:

Multilayer perceptron neural network
 ANN:

Artificial neural network
 SVM:

Support vector machine
 SVR:

Support vector regression
 RBF:

Radial basis function
 QP:

Quadratic polynomial
 CP:

Cubic polynomial
 PCA:

Principal component analysis
 SS:

Stepwise selection
 FS:

Forward selection
 BS:

Backward selection
References
Raza A. Ecophysiological and biochemical responses of rapeseed (Brassica napus L) to abiotic stresses: consequences and mitigation strategies. J Plant Growth Regul. 2021;40(4):1368ā88. https://doi.org/10.1007/s0034402010231z.
Hu D, Jing J, Snowdon RJ, Mason AS, Shen J, Meng J, et al. Exploring the gene pool of Brassica napus by genomicsbased approaches. Plant Biotechnol J. 2021;19(9):1693ā712. https://doi.org/10.1111/pbi.13636.
Spasibionek S, MikoÅajczyk K, ÄwiekKupczyÅska H, PiÄtka T, KrĆ³tka K, Matuszczak M, et al. Marker assisted selection of new high oleic and low linolenic winter oilseed rape (Brassica napus L) inbred lines revealing good agricultural value. PLoS One. 2020;15(6):e0233959. https://doi.org/10.1371/journal.pone.0233959.
NiedbaÅa G. Application of artificial neural networks for multicriteria yield prediction of winter rapeseed. Sustainability. 2019;11(2):533. https://doi.org/10.3390/su11020533.
Tian HY, Channa SA, Hu SW. Relationships between genetic distance, combining ability and heterosis in rapeseed (Brassica napus L). Euphytica. 2017;213(1):1ā11. https://doi.org/10.1007/s106810161788x.
Wang T, Wei L, Wang J, Xie L, Li YY, Ran S, et al. Integrating GWAS, linkage mapping and gene expression analyses reveals the genetic control of growth period traits in rapeseed (Brassica napus L). Biotechnol Biofuel. 2020;13(1):1ā19. https://doi.org/10.1186/s13068020017740.
Hossain Z, Johnson EN, Wang L, Blackshaw RE, Cutforth H, Gan Y. Plant establishment, yield and yield components of Brassicaceae oilseeds as potential biofuel feedstock. Indust Crops Product. 2019;141:111800. https://doi.org/10.1016/j.indcrop.2019.111800.
Abbadi A, Leckband G. Rapeseed breeding for oil content, quality, and sustainability. Eur J Lipid Sci Technol. 2011;113(10):1198ā206. https://doi.org/10.1002/ejlt.201100063.
Raboanatahiry N, Chao H, Dalin H, Pu S, Yan W, Yu L, et al. QTL alignment for seed yield and yield related traits in Brassica napus. Front Plant Sci. 2018;9:1127. https://doi.org/10.3389/fpls.2018.01127.
Dh F, Ly J, Mason AS, Ml X, Lr Z, Lz L, et al. Research progress and strategies for multifunctional rapeseed: a case study of China. J Integ Agric. 2016;15(8):1673ā84. https://doi.org/10.1016/S20953119(16)613849.
Yahya M, Dutta A, Bouri E, WadstrĆ¶m C, Uddin GS. Dependence structure between the international crude oil market and the European markets of biodiesel and rapeseed oil. Renewable Energy. 2022;197:594ā605. https://doi.org/10.1016/j.renene.2022.07.112.
Niazian M, NiedbaÅa G. Machine learning for plant breeding and biotechnology. Agriculture. 2020;10(10):436. https://doi.org/10.3390/agriculture10100436.
Cai G, Yang Q, Chen H, Yang Q, Zhang C, Fan C, et al. Genetic dissection of plant architecture and yieldrelated traits in Brassica napus. Sci Rep. 2016;6(1):1ā16. https://doi.org/10.1038/srep21625.
Xiong Q, Tang G, Zhong L, He H, Chen X. Response to nitrogen deficiency and compensation on physiological characteristics, yield formation, and nitrogen utilization of rice. Front Plant Sci. 2018;9:1075. https://doi.org/10.3389/fpls.2018.01075.
Bennett EJ, Brignell CJ, Carion PW, Cook SM, Eastmond PJ, Teakle GR, et al. Development of a statistical crop model to explain the relationship between seed yield and phenotypic diversity within the Brassica napus genepool. Agronomy. 2017;7(2):31. https://doi.org/10.3390/agronomy7020031.
Collins NC, Tardieu F, Tuberosa R. Quantitative trait loci and crop performance under abiotic stress: where do we stand? Plant Physiol. 2008;147(2):469ā86. https://doi.org/10.1104/pp.108.118117.
YoosefzadehNajafabadi M, Earl HJ, Tulpan D, Sulik J, Eskandari M. Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. Frontiers Plant Sci. 2021;11:624273. https://doi.org/10.3389/fpls.2020.624273.
Luo Z, Wang M, Long Y, Huang Y, Shi L, Zhang C, et al. Incorporating pleiotropic quantitative trait loci in dissection of complex traits: seed yield in rapeseed as an example. Theor Appl Genet. 2017;130(8):1569ā85. https://doi.org/10.1007/s0012201729117.
Chen B, Xu K, Li J, Li F, Qiao J, Li H, et al. Evaluation of yield and agronomic traits and their genetic variation in 488 global collections of Brassica napus L. Genet Resour Crop Evol. 2014;61(5):979ā99. https://doi.org/10.1007/s1072201400918.
Kumar S, Ali B, Khaldun A, Islam SS, Uddin MS, Akanda ML, et al. Genetic diversity, correlations and path coefficient analysis among the F5 populations of Brassica Species. Asian J Adv Agric Res. 2021;16(2):20ā31. https://doi.org/10.9734/ajaar/2021/v16i230170.
Rameeh V. Multivariate analysis of some important quantitative traits in rapeseed (Brassica napus) advanced lines. J Oilseed Brassica. 2016;1(2):162ā9.
Abdipour M, YounessiHmazekhanlu M, Ramazani SHR. Artificial neural networks and multiple linear regression as potential methods for modeling seed yield of safflower (Carthamus tinctorius L). Indust Crops Products. 2019;127:185ā94. https://doi.org/10.1016/j.indcrop.2018.10.050.
Hesami M, Pepe M, Monthony AS, Baiton A, Jones AMP. Modeling and optimizing in vitro seed germination of industrial hemp (Cannabis sativa L). Indust Crops Product. 2021;170:113753. https://doi.org/10.1016/j.indcrop.2021.113753.
Niazian M, SadatNoori SA, Abdipour M. Modeling the seed yield of Ajowan (Trachyspermum ammi L) using artificial neural network and multiple linear regression models. Indust Crops Product. 2018;117:224ā34. https://doi.org/10.1016/j.indcrop.2018.03.013.
Hesami M, Naderi R, Tohidfar M, YoosefzadehNajafabadi M. Development of support vector machinebased model and comparative analysis with artificial neural network for modeling the plant tissue culture procedures: effect of plant growth regulators on somatic embryogenesis of chrysanthemum, as a case study. Plant Methods. 2020;16(1):1ā15. https://doi.org/10.1186/s13007020006559.
Sheikhi A, Mirdehghan SH, Arab MM, Eftekhari M, Ahmadi H, Jamshidi S, et al. Novel organicbased postharvest sanitizer formulation using box Behnken design and mathematical modeling approach: a case study of fresh pistachio storage under modified atmosphere packaging. Postharvest Biol Technol. 2020;160: 111047.
Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST). 2011;2(3):1ā27. https://doi.org/10.1145/1961189.1961199.
Moravej M, Amani P, HosseiniMoghari SM. Groundwater level simulation and forecasting using interior search algorithmleast square support vector regression (ISALSSVR). Groundwater Sustain Develop. 2020;11:100447. https://doi.org/10.1016/j.gsd.2020.100447.
Zhou Z, Morel J, Parsons D, Kucheryavskiy SV, Gustavsson AM. Estimation of yield and quality of legume and grass mixtures using partial least squares and support vector machine analysis of spectral data. Computers Electron Agric. 2019;162:246ā53. https://doi.org/10.1016/j.compag.2019.03.038.
Khosla E, Dharavath R, Priya R. Crop yield prediction using aggregated rainfallbased modular artificial neural networks and support vector regression. Environ Dev Sustain. 2020;22(6):5687ā708. https://doi.org/10.1007/s1066801900445x.
Fan J, Zheng J, Wu L, Zhang F. Estimation of daily maize transpiration using support vector machines, extreme gradient boosting, artificial and deep neural networks models. Agric Water Manag. 2021;245:106547. https://doi.org/10.1016/j.agwat.2020.106547.
NiedbaÅa G, Piekutowska M, Weres J, Korzeniewicz R, Witaszek K, Adamski M, et al. Application of artificial neural networks for yield modeling of winter rapeseed based on combined quantitative and qualitative data. Agronomy. 2019;9(12):781. https://doi.org/10.3390/agronomy9120781.
NiedbaÅa G. Simple model based on artificial neural network for early prediction and simulation winter rapeseed yield. J Integr Agric. 2019;18(1):54ā61. https://doi.org/10.1016/S20953119(18)621100.
Ogasawara E, Martinez LC, De Oliveira D, ZimbrĆ£o G, Pappa GL, Mattoso M, editors. Adaptive normalization: A novel data normalization approach for nonstationary time series. The 2010 International Joint Conference on Neural Networks (IJCNN); 2010: IEEE.
GĆ©ron A. Handson machine learning with ScikitLearn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O'Reilly Media; 2019.
Yeo IK, Johnson RA. A new family of power transformations to improve normality or symmetry. Biometrika. 2000;87(4):954ā9.
Perlich C, Provost F, Simonoff J. Tree induction vs. logistic regression: A learningcurve analysis. J Machine Learning Res. 2003.
Darlington RB, Hayes AF. Regression analysis and linear models: concepts, applications, and implementation. Guilford: Guilford Publications; 2016.
Tsuruoka Y, Tsujii Ji, Ananiadou S, editors. Stochastic gradient descent training for l1regularized loglinear models with cumulative penalty. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2009.
McCullagh P, Nelder J. Generalized linear models. 2nd ed. London: Chapman & Hall; 1989.
JĆørgensen B. The theory of exponential dispersion models and analysis of deviance. IMPA. 1992.
Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Berlin: Springer; 2006.
MacKay DJ. Bayesian interpolation. Neural Comput. 1992;4(3):415ā47. https://doi.org/10.1162/neco.1992.4.3.415.
Tipping ME. Sparse Bayesian learning and the relevance vector machine. J Machine Learn Res. 2001;1:211ā44.
Wipf D, Nagarajan S. A new view of automatic relevance determination. Adv Neural Inf Process Syst. 2007;49(1):641.
Andersen CM, Bro R. Variable selection in regressionāa tutorial. J Chemometrics. 2010;24(11ā12):728ā37. https://doi.org/10.1002/cem.1360.
Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol). 1996;58(1):267ā88. https://doi.org/10.1111/j.25176161.1996.tb02080.x.
Bruce P, Bruce A, Gedeck P. Practical statistics for data scientists: 50+ essential concepts using R and Python. O'Reilly Media; 2020.
Ying X, editor. An overview of overfitting and its solutions. Journal of Physics: Conference Series; 2019: IOP Publishing.
Hutter F, Hoos H, LeytonBrown K, editors. An efficient approach for assessing hyperparameter importance. International conference on machine learning; 2014: PMLR.
Weerts HJ, Mueller AC, Vanschoren J. Importance of tuning hyperparameters of machine learning algorithms. arXiv. 2020. https://doi.org/10.48550/arXiv.2007.07588.
Gy LU, Zhang F, Py Z, Cheng Y, FengIan L, Gp FU, et al. Relationship among yield components and selection criteria for yield improvement in early rapeseed (Brassica napus L). Agric Sci Chin. 2011;10(7):997ā1003. https://doi.org/10.1016/S16712927(11)600862.
Ćzer H, Oral E, DOÄRU Ć. Relationships between yield and yield components on currently improved spring rapeseed cultivars. Tur J Agric Forest. 1999;23(6):603ā8.
MarjanoviÄJeromela A, MarinkoviÄ R, Ivanovska S, Jankulovska M, MijiÄ A, Hristov N. Variability of yield determining components in winter rapeseed (Brassica napus L) and their correlation with seed yield. GenetikaBelgrade. 2011;43(1):51ā66.
Hashemi AS, Ghorban AN, Nadali BJ, Omid GC. Genetic evaluation of yield and yield components at advanced generations in rapeseed (Brassica napus L). Afr J Agri Res. 2010;5(15):1958ā64.
Rameeh V. Correlation and path analysis in advanced lines of rapeseed (Brassica napus) for yield components. J Oilseed Brassica. 2016;1(2):56ā60.
SzaÅa L, Kaczmarek Z, WĆ³jtowicz M, Sosnowska K, CegielskaTaras T. Genetic variability in oilseed rape DH line population developed from F1 hybrids obtained by crossing blackand yellowseeded DH lines. I yield and yield components Euphytica. 2021;217(6):1ā11. https://doi.org/10.1007/s10681020027450.
Aftab M, Fatima S, Qamar H, Hassan M, Zubair M, Arshad A, et al. Study of morphological characters give an insight into the genetic variation present in Brassica napus L germplasm. Life Sci J. 2020;17(3):56ā61.
Tariq H, Tanveer SK, Qamar M, Javaid RA, Vaseer SG, Jhanzab HM, et al. Correlation and path analysis of Brassica napus genotypes for yield related traits. Life Sci J. 2020;17(8):22ā34. https://doi.org/10.7537/marslsj170820.05.
Wang Y, Ding Y, Wei W, Cao Y, Davidson DF, Hanson RK. On estimating physical and chemical properties of hydrocarbon fuels using midinfrared FTIR spectra and regularized linear models. Fuel. 2019;255:115715. https://doi.org/10.1016/j.fuel.2019.115715.
Punitha S, AlTurjman F, Stephan T. An automated breast cancer diagnosis using feature selection and parameter optimization in ANN. Computer Electr Eng. 2021;90:106958. https://doi.org/10.1016/j.compeleceng.2020.106958.
May R, Dandy G, Maier H. Review of input variable selection methods for artificial neural networks. Artif Neural NetworksMethodol Adv Biomed Appl. 2011;10(1):19ā45.
Wang K, Abid MA, Rasheed A, Crossa J, Hearne S, Li H. DNNGP, a deep neural networkbased method for genomic prediction using multiomics data in plants. Mol Plant. 2023;16(1):279ā93. https://doi.org/10.1016/j.molp.2022.11.004.
Allen E, Morgan D. A quantitative analysis of the effects of nitrogen on the growth, development and yield of oilseed rape. J Agric Sci. 1972;78(2):315ā24. https://doi.org/10.1017/S0021859600069161.
HabekottĆ© B. Options for increasing seed yield of winter oilseed rape (Brassica napus L): a simulation study. Field Crops Res. 1997;54(2ā3):109ā26. https://doi.org/10.1016/S03784290(97)000415.
Acknowledgements
The authors sincerely appreciate Mrs. Leili Alizadeh for her valuable support during the field experiments of this study.
Funding
This study has not been funded by any organization or institute in the public, commercial, or notforprofit sectors.
Author information
Authors and Affiliations
Contributions
MSH conducted the field experiments, designed and carried out the computational section, and prepared the main manuscript text. VM and BA designed the field experiments, provided the plant materials and revised the manuscript. HA revised the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
The performance of regressionbased machine learning algorithms using selected traits by feature selection methods as inputs to predict the seed yield of rapeseed.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Shahsavari, M., Mohammadi, V., Alizadeh, B. et al. Application of machine learning algorithms and feature selection in rapeseedĀ (Brassica napus L.) breeding for seed yield. Plant Methods 19, 57 (2023). https://doi.org/10.1186/s13007023010359
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13007023010359
Keywords
 Rapeseed
 Machine learning
 Feature selection
 Selection criteria
 Seed yield prediction