Deep convolutional neural network for automatic discrimination between Fragaria × Ananassa flowers and other similar white wild flowers in fields

Background The images of different flower species had small inter-class variations across different classes as well as large intra-class variations within a class. Flower classification techniques are mainly based on the features of color, shape and texture, however, the procedure always involves too many heuristics as well as manual labor to tweak parameters, which often leads to datasets with poor qualitative and quantitative measures. The current study proposed a deep architecture of convolutional neural network (CNN) for the purposes of improving the accuracy of identifying the white flowers of Fragaria × ananassa from other three wild flower species of Androsace umbellata (Lour.) Merr., Bidens pilosa L. and Trifolium repens L. in fields. Results The explored CNN architecture consisted of eightfolds of learnable weights including 5 convolutional layers and 3 fully connected layers, which received a true color 227 × 227 pixels flower image as its input. The developed CNN detector was able to classify the instances of flowers at overall average accuracies of 99.2 and 95.0% in the training and test procedure, respectively. The state-of-the-art CNN model was compared with the classical models of the scale-invariant feature transform (SIFT) features and the pyramid histogram of orientated gradient (PHOG) features combined with the multi-class support vector machine (SVM) algorithm. The proposed model turned out to be much more accurate than the traditional models of SIFT + SVM at overall average accuracies of 82.9 and 55.6% in the training and test procedure and PHOG + SVM at overall average accuracies of 78.3 and 63.1%, respectively. Conclusions The proposed state-of-the-art CNN method demonstrates that artificial intelligence is capable of precise classification of the white flower images, whose accuracy is comparable to traditional algorithms. The presented algorithm can be further used for the discrimination of white wild flowers in fields.


Background
The distribution and yield of flowers in fields are of significant agronomic importance, being the precursor of quality of fruits and seeds [8,24,33]. Despite exploiting several systems to manage them in the past decade, the development of fine flower detection systems is still one

Open Access
Plant Methods *Correspondence: billrange007@gmail.com 1 College of Electrical Engineering, Yancheng Institute of Technology, No.1 Middle Road Hope Avenue, Yancheng 224051, Jiangsu Province, People's Republic of China Full list of author information is available at the end of the article of the important issues in modern smart agriculture [13,15,30]. Discrimination of flower species is a difficult mission for the current detection algorithms, because there are great variations in viewpoint and scale, illumination, partial occlusions, multiple instances etc. in the typical flower images [6,22,34]. The complex backgrounds also make the discrimination task more difficult, for risking probably discriminating background scenes rather than the object itself [18,23]. Perhaps the greatest challenge originates from the intra-category versus inter-category diversification, i.e. there is a slighter difference between images of different categories than within a category itself, and yet subtle variation between instances determine their species [5,6,19].
The traditional flower classification is mainly based on the three features: color, shape and texture. This case requires people to select features for classification. An approach using the various features including color, shape, and texture was proposed to distinguish the flower categories [25]. However, Nilsback's approach only used a single scale to extract the flower features. The multiple scale features such as scale-invariant feature transform (SIFT) and Gabor-based descriptors were proposed to improve the identification accuracy. A new method using multiple color SIFT features was proposed to improve the performance of flower image classification [32]. Guru et al. [14] presented a model extracting the grey-grade co-occurrence matrix, color texture moments and Gabor descriptors from the flower images for dealing with the flower classification issues [14]. In order to fuse multiple features from one image, the visual vocabulary method is presented to map certain feature through the clustering process and the image can be represented by histogram representation based on independent features. Hu et al. [16] explored a visual vocabulary methods to describe the four kinds of color-SIFT features for the discrimination of flower images [16]. In addition to improving recognition accuracy in feature extraction algorithms, scholars also attempted to improve recognition performance on feature recognition algorithms. A marginalized kernel algorithm was developed by utilizing the responses of the logistic regression-based fusion model for detecting the flower images [11]. Those models have demonstrated effectiveness for image classification to a certain degree. However, plenty of parameters of feature extraction algorithms needed to be tuned and many different types of features needed to be reshaped to species semantics. The spatial information and correlations sometimes were neglected when considering the local features. Besides, the encoding of local features causes some information loss which also hinders the final image classification performance. These algorithms always involve too many heuristics as well as manual labor to tweak parameters according to the domain to reach a decent level of accuracy.
Recently, the biologically inspired two-dimensional convolutional neural network (CNN), has been used as an effective tool for extracting the image features, giving superior accuracy on the classification, segmentation and retrieval tasks [21]. The basic idea of CNN is to build invariance properties into neural networks by creating models that are invariant to certain inputs transformation [35]. The proposed CNN architecture consists of alternatively stacked convolutional layers and spatial pooling layers. The convolutional layer is used to extract feature maps by linear convolutional filters followed by nonlinear activation functions such as the rectified linear units. Spatial pooling is performed to group the local features together from spatially adjacent pixels, which is typically done to improve the robustness to slight deformations of objects [10]. Our network consists of eightfolds of units which is similar to the AlexNet network structures [27] with learnable weights: 5 convolutional layers, and 3 fully connected layers. The convolutional layers and the max pooling layers in the whole CNN are presented to cope with the deep-level information of flower images. The intractable over-fitting case in the process of determining the characteristic parameters of network is solved by the stochastic gradient descent methods. The classical algorithms of SIFT and pyramid histogram of oriented gradients (PHOG) [4] combined with the multi-class support vector machine (SVM) [3] are compared with the stateof-the-art algorithm using multi-level convolutional architecture of CNN on the flower dataset to exhibit the advantage of the proposed architecture.
One of the main goals is that we want to build an artificial intelligent flower recognition system to accurately and automatically distinguish different species of flowers in the Fragaria × Ananassa fields. The presented system transferred the true color 227 × 227 pixels white flower images to 8 layers with learnable weights including 5 convolutional layers and 3 fully connected layers. Therefore, the input level has 51,529 neuron units at the beginning, and the following convolutional layers have a set of 96 filters. The subsampling layers contain rectified linear units layers and pooling layers. The final level is the fully connected layer with 4 neurons. The intractable over-fitting problem in determining the characteristic parameters of the network is solved by the stochastic gradient descent method. To this end, our team has set up a CNN architecture to recognize flower dataset which consists one flower species of Fragaria × ananassa and other three different wild flower species of Androsace umbellata (Lour.) Merr., Bidens pilosa L. and Trifolium repens L. There are blur, scale-variant, intra-class variant and interclass similar objects among the experimental image dataset. The photographs of flowers are all captured in natural settings with rich and complex backgrounds. Although the background usually serves as distractor to detection model, sometimes it can supply useful information, so background content is also considered as the feature information for detection target. The rest of the paper is organized as follows: firstly, we presented the experimental data and device; The experimental methods are introduced subsequently; Then, the experimental results are analyzed and discussed. The conclusions are drawn finally.

Experiment data
The experimental database composes of four distinct flower varieties of Androsace umbellata (Lour.) Merr., Bidens pilosa L., Trifolium repens L. and Fragaria × ananassa. These photos of white flowers were taken from the digital cameras in wild. The flower objects with petals and sepals were cropped individually from the raw digital photos by hand. There are blur, scale-variant, intra-class variant and inter-class similar objects among the experimental image dataset. The photographs of flowers were all captured in natural settings with rich and complex backgrounds. Although the background usually serves as distractor to detection model, sometimes it where, the symbol ⊗ denotes convolution operator, ξ is a constant, σ is the scale factor, I x, y is the given input image and G x, y, σ = 1 2πσ 2 e (1) DoG x, y, σ = G x, y, ξσ − G x, y, σ ⊗ I x, y can supply useful information, so background content is also considered as the feature information for detection target. Some primary properties of these white flowers are summarized in Table 1. There are a total 400 flower images in the database, where each variety contains 100 images. For modeling the relationship between the flower features and the corresponding logical attributes, the experiment employed 60 and 40 images for both training and test aims for each type, respectively.

Experimental devices
The classification algorithm of CNN was trained on the flower image dataset with a hardware solution of a Alienware 17 R4 laptop (DELL, USA) consisting of a NVIDIA GeForce GTX 1070 integrated RAMDAC 16 GB graphics card and Intel Core(TM) i7-6700H CPU. The algorithms were performed in Matlab R2017a (The Math Works, Natick, USA) on Windows 10 (Microsoft, USA) operating system. Caffe originally developed by the Berkeley vision and learning center was used as the deep learning framework [17]. The universal purpose computing on graphics processing units with NVIDIA GPUs using the parallel computing platform and application programming interface CUDA 8 with the deep neural network library CUDNN 7 were supported by Caffe. In our experiment, we took advantage of the NVIDIA GTX 1070 graphics card with 8 GB memory and 1024 kernels.

Scale invariant feature transform (SIFT) descriptor
The algorithm of SIFT intends to extract distinctive invariant features to represent the image. It uses the difference of Gaussian function of DoG x, y, σ in the scale space to discover potential interest points: The gradient magnitudes and orientations of the adjacent pixels around the candidate interest point are used to construct the gradient-orientation histogram. In experiments, 4×4 arrays of 8 bin histogram is used, a total of 128-dimensional SIFT descriptor for representing the key point [32].

Pyramid histogram of orientated gradient (PHOG) descriptor
PHOG is a spatial pyramid extension of the histogram of gradients (HOG) descriptors. HOG is an effective method to characterize the target edge or gradient orientation by extracting the magnitude and orientation of gradient distribution in a localized area of an image I x, y . Magnitude M x, y and orientation φ x, y of the gradient on a pixel are computed as: Nevertheless, HOG descriptor does not take into account the division of the image at different spatial scales. The PHOG descriptor is computed by using each edge orientation weighted according to its magnitude at different spatial levels. PHOG descriptor extend HOG descriptor for descriptions of the global shape and the local details of image [4].

Support vector machine (SVM)
SVM aims to assign labels to instances based on the binary SVM, where the labels are drawn from a finite set of several elements. Given training dataset , a set of N points is: where y i belongs 1 to M , indicating the class to which the point x i attaches. The multi-class method builds binary classifiers which distinguish between one of the labels and the rest (one-versus-all). The ith class is trained with all the training instances of the ith class with positive labels, and all the rest with negative labels. The oneversus-all approach takes the advantage of the decision hyper plane f i (x) = ω T i ϕ(x) + b i to evaluate the class by solving the following optimization problem: where C is the tuning parameter and ζ i j is the slack variable. If y j belongs to the ith class, ŷ j = 1 , otherwise ŷ j = −1 . Finally, the ith class to which an unknown instance x belongs can be determined according the corresponding largest value of f i (x) [4]:

CNNs architecture
The typical CNN for classification target usually consists of the architecture of the input layer, convolutional layers, rectified linear units (ReLU) layers, pooling layers, fully connected layers and dropout layer [10,35]. The overall deep architecture of CNN for detecting four species of white flowers including Fragaria × ananassa, Androsace umbellata (Lour.) Merr., Bidens pilosa L. and Trifolium repens L. are illustrated in Fig. 1. The network specifies the fixed 227 × 227 pixels of a true color image as its input. The following convolutional operation estimates the outcome of neurons connect to local regions in the input layer. The input image is to be convolved with 96 filters of receptive field size 11 × 11 × 3 at stride 4. Iterating this process in the input at stride of 4 gives 55 locations along both width and height, leading to an output matrix of size 11 × 11 × 3 × 55 × 55. The result of a convolution is equivalent to performing one large matrix multiply, which evaluates the dot product between every filter and every receptive field location. The output of this operation would be 96 × 55 × 55, giving the output of the dot product of each filter at each location. The next ReLU layer uses an elementwise maximum value activation function with thresholding at zero. The ReLU is presented to take the place of the earlier standard Sigmoid units in the architecture of neural networks, because the classical Sigmoid function sometimes produces the vanishing gradient issues when calculating the derivative in the saturating region. The ReLU function avoids such issues over and learns much faster than the Sigmoid function, so it was arranged after each and every convolutional and fully-connected layers. The following pooling layer will take a downsampling action along the width and height spatial dimensions. The subsequent fully connected layer is employed to produce a category score corresponding to the input attributes. In this layer each neuron will be linked to all the numbers in the previous neurons. The final dropout layer appears after every fully connected layer. It separately applied a probability score at every neuron of the response map and randomly switches off the activation with the probability to diminish the over-fitting problems. The mentioned deep structure of CNNs for will be applied to automatic discrimination between the Fragaria × ananassa flowers and other similar white wild flowers in fields.

Feature extraction
The first step of pipeline of a standard CNN architecture is the feature extraction. CNN deals with an input white flower image and uses a convolutional feature map φ H ×W ×D with the input image to generate different level features for the final classifiers, where the parameters of H , W and D are the height, width and the number of filters. In order to quickly learn effective features in a new classification task using a relative small number of training images, we use the transfer learning methods to fine tune the pre-trained network. This training method is usually much faster and easier than training a network with randomly initialized weights from scratch. Most of these have been trained on the ImageNet dataset, which has 1000 object categories and 1.2 million training images. An analogous illustration has been used previously in discriminative tasks taking on high recognition performance based on CNNs related detectors. Thereby, the network structure originally trained on Ima-geNet for the task of image classification is used for the feature extraction [20]. The layers property of the CNN architecture is listed Table 2. The network consists of twenty-five layers, which are summarized into 8 layers according to the local function to process the features.
There are eight folds with learnable weights comprising of five convolutional layers and three fully connected layers.

Stochastic gradient descent method
The algorithm of gradient descent [28] is performed to optimize the network parameters in order to minimize the back-propagation error on the training dataset. The gradient descent algorithm updates the parameter vector so as to minimize the loss function by taking small steps in the direction of the negative gradient of the loss function: where is the learning rate, χ is the parameter vector, ψ(χ ) is the loss function and i denotes the iteration number. The standard gradient descent algorithm sometimes oscillates along the steepest decreasing route to search the optimum. In order to reduce the oscillation, a momentum item is supplemented to the above gradient descent function: where τ ∈ [0, 1] is the momentum coefficient. The normal gradient descent algorithm estimates the gradient of the loss function ψ(χ) using the entire dataset at once. The stochastic gradient descent algorithm estimates the gradient of the loss function ψ(χ ) and renews the parameters using a stochastic subset of the dataset. In this paper, the number of stochastic subset using to train the CNN model is set as 10.

Training networks
The CNN uses a receptive field-like layout in which each neuron receives connections only from a subset of neurons in the lower layer. The receptive field of a neuron in one of the lower layers encompasses only a small region of the image. The convolutional layer is sensitive to the size of receptive field of image. When the original image sizes are around 200 × 200-700 × 700, the area of receptive field can be set between the sizes of 7 × 7 and 15 × 15 [27]. The neurons of structure properties are sometimes generated by using the large convolutional kernels, while the texture properties are captured by using small convolutional kernels. Generally, the decent size kernels might reach the balance between two tendencies. Figure 2 illustrates the 96 channels of captured rich structure and texture feature information from the Fragaria × ananassa flower image in the first convolutional layer by using size of 11 × 11 convolutional kernels. These images contain from a different variety of frequency-, orientation-and color-selective features. There were 256, 384, 384 and 256 channels of captured more rich structure and texture feature information from the second to fifth convolutional layer. The layers in the network can produce more complex structure and texture features of flower image for the subsequent neurons. These features further exhibit the superior performance in the task of identifying the white flower images.

Results and discussion
Momentum parameter determination  for the CNNs. The momentum item is actually the contribution of the previous gradient change. It can be seen that the contribution of the gradient changes from the previous iteration to the current iteration in the training set greatly affects the convergence of the loss function. Along with the growth of momentum coefficient values from τ = 0.1, 0.3 and 0.5 the convergence performance is gradually improved. Along with the growth of momentum coefficient values from τ = 0.7 and 0.9 the convergence performance become worse. It indicates that the attached momentum item is able to reduce the oscillation when algorithm searches the optimum along the convex route. Although the convergence speed of curve with τ = 0.7 is faster than the one with τ = 0.5 at the beginning stage, the convergence performance of curve with η = 0.7 obviously shocks severely at the iteration locations between 50 and 90. Thereby, the momentum parameter τ = 0.5 in the stochastic gradient descent function is chosen for training the CNN model.

Accuracy by CNNs
The bottom layer of the CNN framework was used as filters for capturing blob and edge features. These primary features were then processed by deeper network framework, which combined the early features to form higher-level semantic features. These higher-level semantic features were better suited for following recognition tasks [7]. In this paper, we used a multiclass SVM classifier at the top of the CNN-based classification architecture for training the high-level CNN image features. The stochastic gradient descent algorithm was used for speeding up the training the high-dimensional CNN feature vectors. Firstly, we presented the accuracy achieved by using such CNN architecture. The training CNN work was implemented offline, i.e., before employing CNN for the classification of 240 white flower images. The identification process itself performed species identification on 160 white flower images. The confusion matrix [9] diagram is employed to summarize and visualize the results of the performance of an algorithm of classification performance of white flower using the CNN algorithm. As shown on Fig. 4

Comparing performance of algorithms
The precision-recall metric [29] is used to estimate the algorithm quality of detecting the flower varieties. The precision-recall curve shows the tradeoff between precision and recall for different threshold. The high precision relates to a low false positive rate, and high recall relates to a low false negative rate. The large scores indicate that the classification model is returning accurate results as well as returning a majority of all positive results. We compared our method with category discovery methods of SVM combined with the traditional hand-engineered features of SIFT and PHOG. As shown in Fig. 5, as the threshold of recall rates increase, the corresponding precision rates of CNN are much higher than other two algorithms of SIFT + SVM and PHOG + SVM. The overall performance of the algorithms is measured with the mean average precision (mAP) score [12], which is the average precision at the ranks where recall changes. The geometric interpretation of mAP score is the area below the curve. A large area under the precision-recall curve denotes the overall superior performance of algorithm with the high mAP score. The CNN-based model achieves the highest mAP scores of 0.983 and 0.974 on the training and test flower image dataset, respectively (See Table 3). The compared results illustrated that the improvement of the proposed model for classification of the white flower images with complex background on both of the training and test dataset is substantial. It appears that, more detailed features are abstracted effectively from the original images of white flowers by using the deep learning methods of CNNs compared with other two algorithms.
In order for flower recognition task to be implemented, the algorithm must have good ability in dealing with variability of flower appearance. The SIFT feature descriptor is invariant to uniform scaling, orientation  Table 3. The algorithm of SIFT + SVM attains the comprehending accuracy in the training and test sets are 82.9 and 55.6%, respectively. The algorithm of PHOG + SVM achieves the detection accuracy in the training and test sets are 78.3 and 63.1%, respectively. The identification accuracy of CNNs is 99.2 and 95.0% in the training and test procedure, respectively, which is much higher than the above two methods. The SIFT and HOG features are low-level features which don't make use of hierarchical layer-wise representation learning while the CNN is a hierarchical deep learning model which is able to learn low-level features similar to SIFT and HOG features from training examples alone for more and more abstract representations. The multi-level deep convolutional structure can attain more detailed features from images and improving the accuracy of measurement results. The state-of-the-art proposal methods provides a superior alternative for the precise classification of the white flowers of Fragaria × ananassa from other three wild species of Androsace umbellata (Lour.) Merr., Bidens pilosa L. and Trifolium repens L. in fields.

Conclusions
In this investigation, we have presented a CNN architecture for the deeply classifying four species of white flowers including Androsace umbellata (Lour.) Merr., Bidens pilosa L., Trifolium repen L. and Fragaria × ananassa. The CNN-based algorithm achieved outstanding 99.2% training and 95.0% test accuracy in the application of identifying the white flower images, respectively. The proposed model in this study turns out to be much more accurate than traditional models of SIFT + SVM and PHOG + SVM. The state-of-the-art proposal CNN method demonstrated an artificial intelligence capable of precise classification of the white flower images with a level of competence comparable to general algorithms. Our team plans to enlarge current flower dataset  which will consist of more wild flower species and numbers. Further research is also necessary to evaluate performance in a real-time detection setting, in order to validate this technique across the full distribution and spectrum of Fragaria × ananassa flower fields encountered in typical practice. The technologies can be potentially used to quickly and exactly check the number of strawberry flowers in fields from the images captured from unmanned ground vehicle.