Wheat ear counting using K-means clustering segmentation and convolutional neural network

Background Wheat yield is influenced by the number of ears per unit area, and manual counting has traditionally been used to estimate wheat yield. To realize rapid and accurate wheat ear counting, K-means clustering was used for the automatic segmentation of wheat ear images captured by hand-held devices. The segmented data set was constructed by creating four categories of image labels: non-wheat ear, one wheat ear, two wheat ears, and three wheat ears, which was then was sent into the convolution neural network (CNN) model for training and testing to reduce the complexity of the model. Results The recognition accuracy of non-wheat, one wheat, two wheat ears, and three wheat ears were 99.8, 97.5, 98.07, and 98.5%, respectively. The model R2 reached 0.96, the root mean square error (RMSE) was 10.84 ears, the macro F1-score and micro F1-score both achieved 98.47%, and the best performance was observed during late grain-filling stage (R2 = 0.99, RMSE = 3.24 ears). The model could also be applied to the UAV platform (R2 = 0.97, RMSE = 9.47 ears). Conclusions The classification of segmented images as opposed to target recognition not only reduces the workload of manual annotation but also improves significantly the efficiency and accuracy of wheat ear counting, thus meeting the requirements of wheat yield estimation in the field environment.


Background
Wheat is one of the most important food crops that play a significant role in national food security. The wheat grain-filling period is the key growth period that determines yield formation, and the number of ears per unit area is an important factor of yield [1][2][3]. Thus, it is of great significance to estimate wheat yield by rapidly determining the ear number. During production, the manual counting method is often used to estimate production, which is time-consuming and labor-intensive.
Conversely, machine vision, machine learning, and image processing technologies can be used to rapidly and accurately identify wheat ear per unit area. This is of great significance to wheat yield estimation and provides technical support and a foundation for the acquisition of wheat plant phenotypic information.
The development of high spatial resolution computer vision-based phenotype identification [4][5][6] has produced high-throughput phenotyping platforms [7]. Image processing technology has been used to identify the number of ears of wheat [8,9], but the methods focus on texture features, color segmentation, morphological extraction, and other feature extraction methods. Cointault et al. [10] used a color texture image analysis

Open Access
Plant Methods *Correspondence: wheatdoctor@163.com 1 Henan Agricultural University, Zhengzhou 450002, China Full list of author information is available at the end of the article method based on mixed space to realize the recognition and counting of wheat ear. Fernandez-Gallego et al. [11] used local maximum peak values to count ears based on RGB color images in field conditions [12]. The current recognition methods based on image processing technology require extensive artificial image feature extraction, which places high demand on the environment and technology.
In recent years, machine learning has been shown to have a significant advantage in the field of machine vision, such as in image segmentation and object recognition [13][14][15]. Zhu et al. [16] used a support vector machine segmentation (SVM) model to realize wheat ear counting, and Li et al. [17] used a neural network based on texture features to detect ears, the accuracy of which exceeded 80%. Hasan et al. [18] used an in-depth learning method to detect and count wheat ears, achieving a highest accuracy of 94%. Madec et al. [19] used CNN to identify wheat ears from low-spatial-resolution RGB images. Machine learning methods provide automatic feature extraction and excellent parameter adjustment, which greatly reduce manual feature extraction and interpretation. However, the use of machine learning to identify grains requires the manual extraction of the image feature building the data set. Thus, these methods are prone to some human error and also have the disadvantages of identification inaccuracy caused by the adhesion of multiple wheat ears. At the same time, a simple and rapid counting system for wheat ears is lacking, and the development of such a system would have a significant impact on wheat production.
Image processing methods are influenced by the extraction of image features, lighting conditions, shadows, and complex backgrounds [20], and the requirements of the environment and technology are limited by the data set itself [21]. Although wheat ear recognition methods based on CNN are advantageous, image features (wheat ears) need to be manually extracted in order to construct the dataset [22]. To overcome the above issues, we propose the use of image processing technology to extract wheat ear features rapidly, combining this with CNN to reduce the workload of manual labeling and improve the recognition accuracy.
In this paper, we use mobile devices to rapidly acquire wheat ear images in the field environment and extract the contour features of the wheat ears automatically based on the K-means clustering algorithm, thus reducing the workload of the manual extraction of wheat ear features. On this basis, we constructed an image classification dataset with four types of labels: non-wheat ear, one wheat ear, two wheat ears, and three wheat ears. Ultimately, a CNN model was constructed to realize the rapid and accurate identification of wheat ears in the complex field environment as well as to provide technical support for the accurate yield estimation of wheat.

Field experiments
Experiment 1 was conducted in Xuchang, China, at the Campus of Henan Agricultural University in 2018, 2019 in the experimental farm (34°08′N, 113°48′E). The Xuchang site is in the center of China, with a typical temperate and monsoonal climate. The previous crop was soybean. The tested wheat varieties included AK58, XN509, YM49, and ZM27. The experimental plot was 10 m long, 2 m wide, and with a row spacing of 20 cm. A split-plot design was adopted and was repeated three times. In order to facilitate sampling and field operation, a 1 m wide channel was set up between each plot. Nitrogen fertilization was applied as ammonium nitrate in the winter at rates of 120 kg ha −1 for every year, and watering once in overwintering period and jointing period respectively. Experiment 2 was conducted in Yuanyang, Xinxiang, China, at the Yuanyang Science and Education Park (35°6″N, 113°56″E) of Henan Agricultural University in 2018. The Yuanyang site is in the center of China, with a warm temperate continental monsoon climate. The previous crop was maize. There were 10 wheat varieties tested, namely, SM159, XN20, XN511, YM11, ZM119, ZM136, ZM158, ZM318, ZM32, and ZM36. The area of the community was 25 × 5 m, and the row spacing was 20 cm. Nitrogen fertilization was applied as ammonium nitrate in the winter at rates of 127.5 kg ha −1 , and watering once in overwintering period and jointing period respectively.

Image acquisition
Wheat ear image data were captured during the flowering and filling period (Table 1). Image acquisition was conducted in a Redmi Note 7 mobile phone (Xiaomi, Beijing, China), HUAWEI nova 3i (HUAWEI, Shenzhen, China) and DJI Phantom 3 Pro (DJI, Shenzhen, China). The Redmi Note 7 mobile phone has 48 million + 5 million pixels in the rear cameras, the HUAWEI nova 3i mobile phone has 24 million + 2 million pixels in the rear cameras, and the Phantom 3 Pro has a battery capacity of 23 min for each flight and can take auxiliary hovering pictures. Three devices are high quality with full color. Image acquisition was carried out on both sunny and cloudy days. The image acquisition mode was vertical shooting. The ground resolution was 0.18-1.0. One flight of 3 m altitudes was completed in Xuchang (Table 1), the purpose of which was to verify the portability of the research method on the unmanned aerial vehicle (UAV) platform. One data collection was in Yuanyang in order to verify the applicability of the method in different wheat varieties. On May 13, 2020 in Xuchang, a 12 × 30 cm white board was placed as the ground standard when the image was taken, and one square meter area was selected in the center of the shooting area for wheat ear manual measurement and counting, to verify the applicability of proposed method in field condition.

Image processing
Wheat ear images were processed with image processing technology and were clustered and segmented, following which they were sent to the CNN model for learning and recognition. The algorithm flow chart is shown in Fig. 1.
To accelerate image processing, the original image was reduced by 1400 × 1400 from the center of the acquired image and scaled to 700 × 700. Following enhancement by histogram equalization, the contours of the wheat ears were extracted by K-means clustering segmentation. The segmented images were divided into four categories: non-wheat ear, one wheat ear, two wheat ears, and three wheat ears. Image processing algorithm was developed in Python (3.7, Python Software Foundation) using the OpenCV library (4.2) [23].

Image denoising and enhancement
Due to the reflection of the wheat leaves under sunlight, the instability of the camera during shooting, and the influence of the natural environment, some noise will appear in the images. In addition, the image may be interfered with by random signals during the transmission process. It was thus necessary to enhance and denoise the wheat ear images.
The image was transformed into CIELAB [24], and the L channel with a threshold of 2 was used for adaptive histogram equalization to enhance the image (using Python with OpenCV library, the createCLAHE function with parameter clipLimit = 2.0, tileGridSize = (8, 8)), and the size of kernel 3 was used to perform median filtering to remove noise (using Python with OpenCV library, medianBlur function with parameter ksize = 3). Figure 2 shows the original wheat ear image and the enhanced wheat ear image. The wheat ear image is mainly composed of the ear, leaf, stem, and soil, and the ear color characteristics are more obvious when the wheat is in the filling stage. During the filling stage, the wheat ear turns yellow gradually, showing obvious color differences with the leaf and stem, as well as the ground, but the difference between the wheat leaf color and stem color is small (Fig. 2). Enhancing the image increases the brightness of the wheat ears in the image, which makes the contrast between the wheat ear and the background of the stem and leaf more obvious, which is advantageous for the extraction of wheat ears' features.

Table 1 Summary of the main image acquisition characteristics of the two experimental sites
The images collected by the mobile phones were taken by holding mobile phones or holding selfie sticks at an altitude of 1.5-2.2 m. The UAV images were taken at an altitude of 3 m

Image segmentation and wheat ear contour extraction
The K-means algorithm is a clustering algorithm based on iterative solution [25][26][27]. It uses distance as the index of similarity, meaning that the closer the two data points are, the greater the similarity. The traditional method of extracting features by hand is time-consuming and laborintensive and can easily produce errors in the images of dense wheat ears. In this study, a K-means-based image segmentation algorithm was used for wheat ear segmentation to replace the traditional manual feature extraction of wheat ear color features and thus reduce the error of manual extraction, which was realized in Python Scikitlearn [28] library using the KMeans function. After image enhancement, there were obvious differences between the color of the wheat ear and the background color of the stem, leaf, and transition colors. If these are directly clustered into two groups, it will lead to segmentation errors in the color transition area. Therefore, three clustering centers were selected to use K-means clustering to quantify the color of the wheat ear image. After clustering, the wheat ear image will only contain a specified number of categories. The process is as follows: the wheat ear image is clustered, three clustering centers are selected, the clustered wheat ear image is converted into a gray image, and the color of the wheat ear is assigned to black. A flow chart of this process is shown in Fig. 3.
According to the color characteristics of the wheat ear after clustering, the image of the wheat ear after clustering is binarized (black for wheat ears, white for the background area, gray for the stalk and leaf ). As there is noise in the ear image after segmentation, some of the ears stick to each other. For the binary image, a morphological opening with anchor 6 × 6 was used to remove background noise and the burr around the wheat ear, and then morphological closing with anchor 3 × 3 was used to fill in the holes in the wheat ear, as indicated in Fig. 4a, b. The black area is the contour of the wheat ear after morphological processing.
By comparing the binary image with the original image, the wheat ear image was obtained using the contour  Wheat ear image segmentation algorithm flow. Three segmentation categories are beneficial for the segmentation accuracy: soil background, stem and leaf, and wheat ear, which were realized in Python Scikit-learn library using the KMeans function feature of the wheat ear and the information of the center of mass, area, perimeter, and boundary frame of each black connected area, which were developed in OpenCV library using the findContours function with parameters contours = 1 and hierarchy = 5. After obtaining the boundary frame of each black connected region, the wheat ears were marked on the original image by a mask, and then the marked wheat ears were divided into small images and saved. A border was added to the original image to prevent the ears near the border from becoming indivisible. A complete wheat ear segmentation map was obtained as shown in Fig. 4c.

Data set construction
Seventy of 490 images of Xuchang on May 20, 2019 and 50 of the 324 images of Yuanyang on May 15, 2019 were reserved for testing. The remaining 694 images were segmented into 160,784 small images as a training set and a verification set. Other collected images were used to measure the generalization ability of the method. After batch segmentation, it was found that due to strong light, part of the wheat leaves had strong reflection, resulting in them being mistaken as wheat ears. Second, the wheat ear after segmentation was basically one wheat ear, two wheat ears, or three wheat ears, and more than three wheat ears in one image was rare. Therefore, to reduce complexity in the establishment of the CNN model, the recognition categories were output into four categories: non-wheat ear, one wheat ear, two wheat ears, and three wheat ears. Following the segmentation, the two types of images with more images were non-wheat ears and one wheat ear, whereas the images with more than three wheat ears, particularly three wheat ears, were less. Therefore, to maintain the equilibrium of the data set, four types of wheat ear were selected from the segmentation images. Four categories of labeled image data sets were selected, and the number of non-wheat ear, one wheat ear, two wheat ears, and three wheat ears were 1483, 4246, 1173, and 893, respectively. Some of these results are provided in Fig. 5.
To provide sufficient data for model training, 12,000 augmented images of non-wheat ear, one wheat ear, two wheat ears, and three wheat ears were produced by randomly cutting, flipping, rotating, and adjusting the brightness of the original image [29][30][31]. The expanded data set was divided into a training and test set, and each class included 11,000 training sets and 1000 test sets.

CNN model construction and recognition
Deep learning allows the neural network to grasp data features by itself, providing a more abstract high-level representation by combining low-level features to  describe the high-level attribute categories or features of the identified objects [32][33][34][35]. A large amount of data was available following clustering segmentation, and the segmented image was composed of four types of images: non-wheat ear, one wheat ear, two wheat ears, and three wheat ears. The CNN model was established to train and recognize the four categories of segmented images. Through clustering segmentation, a large number of wheat ear images were obtained and could effectively scale the data without feature engineering. Furthermore, the algorithm exhibited strong adaptability and was easily convertible.
The CNN model was composed of five convolution layers, five pooling layers, 3 × 3 convolution layer convolution kernels to extract features, and two fully connected layers. The structure is indicated in Fig. 6, the active function is Rectified Linear Unit (ReLU), and the softmax cross entropy loss function is used to quantify the CNN method accurate. Following model training, the images of the test set were segmented after image enhancement, color reversal, and clustering. The trained CNN model was loaded, and the segmented photos were provided to the model for recognition and classification. Then the number of each classification was recorded, finally adding all of the different quantities to obtain the number of ears.

Statistical analysis
The Xuchang site test data set was divided into three parts: random test, different shooting time, and UAV shooting using SPSS software (25.0, SPSS, Chicago, IBM, USA) ( Table 2), and 120 images were used to evaluate the performance. Fifty images of 10 different cultivars in the Yuanyang site data set were used to evaluate the repeatability.
To evaluate the classification performance of the CNN model, the precision (P), recall (R), macro F1-score (F 1,ma ), and micro F1-score (F 1,mi ) were calculated to evaluate the performance of multi-label classification model [36,37], which are defined as follows: where TP i , is true positive, which denotes the number of images correctly classified as change type i; FP i is false positive, which denotes the number of images incorrectly classified as change type i; and FN i is the false negative for class i, which denotes the number of images of type i that are incorrectly classified as other types. P i and R i are respectively precision and recall for class i, n is the number of classes (this study, n = 4), and P mi and R mi are respectively precision and recall for Micro-F1.
In addition, R 2 and RMSE, the relative root means square error (RRMSE) [38], and bias were used to quantify the counting performance of the model:

Results
The CNN framework was trained and tested in PyCharm (2019.3, PyCharm, Prague, JetBrains, Czech) using the TensorFlow framework (TensorFlow1.15, Google, California, USA) on a Windows 10 PC Intel Core i7 processor (3.6 GHz) with 16 GB RAM. In this paper, a 1400 × 1400 image was cut from the original image from the center position and then scaled to 700 × 700. After segmentation, the four categories images were uniformly scaled to 100 × 100. On this basis, the performance evaluation of the CNN machine learning method could be compared to the manual annotation and counting of the image.

Model accuracy evaluation
To assess the classification results, after 8000 epochs of training, we adopted indices of macro F1-score and micro F1-score calculated on a multiclass confusion matrix. The classification results obtained by the methods are shown in Fig. 7 and Table 3. Figure 7 lists the confusion matrix in detail, which calculates the statistics of the classified

Evaluation of performance of wheat ear images
Test images were preprocessed and clustered, and then each image was segmented, saved, and sent to the CNN model for recognition and counting to test the generalization ability of the model. The comparison between the detected ears of wheat images and the manual counting results is shown in Fig. 8. The performances evaluated over the test data sets showed only a slight degradation in comparison with the test and different datasets, providing some confidence on the robustness of the K-means-CNN method ( Fig. 8 and Table 4). The model-based identification of the wheat ears was in good agreement with the manual identification ( Fig. 8a and Table 4). The result demonstrated that the high R 2 = 0.96 of the K-means-CNN counting was highly correlated with manual counting and demonstrated low data dispersion (Table 4).
However, performances of identify degrade for the different dates of grain filling stage ( Table 4). The bias between the identified and the manual ear values ranged from 0.1 ears (May 20) to 11.60 (May 14) ears for Xuchang ( Table 4). The poorer performances observed on May 6 (R 2 = 0.82, RMSE = 22.54) may be attributed to the early stage when the wheat ear is not yet mature. In these conditions, the contrast between the wheat ears with the stems and leaves is poor, whereas the characteristics are more obvious and easily identifiable in the later stages, and thus the best performance was observed on May 20 (R 2 = 0.99, RMSE = 3.24, Table 3). The results suggested that the images should be taken at the later grain-filling stage around May 20. Our results are in good agreement with those of earlier studies [11].
To further evaluate the robustness of the proposed method, 20 UAV images not involved in training were used for verification. The relationship between the K-means-CNN model and manual ear counting was positive and strong, with an R 2 of 0.97 and an RMSE of 9.47 ears. This result showed that the images collected by the UAVs and hand-held devices all achieved high recognition accuracy using the proposed method (Fig. 9). In addition, the UAV data set bias values were -5.00, indicating a slight overestimation of the number of ears.

Repeatability across different cultivars
Fifty subsamples with 10 different cultivar extracts of the subsample were selected in the Yuanyang site to evaluate the repeatability of the estimation when the images were taken under slightly different cultivation conditions. High consistency between the 10 cultivars was observed (Fig. 10), with the residuals showing a standard deviation of about 12.43 ears.
The performance of the algorithm was further tested using the 50 images. Manual counting was used as the validation data, as before. Table 5 provides the statistical summary results obtained for the Yuanyang plots. The results showed a decrease by up to 0.04 in R 2 while maintaining a similar correlation, and the bias between the identified and the manual ear values ranged from − 15.40 ears (ZM32) to 18.80 (ZM158) ears for Yuanyang ( Table 4). The R 2 value remained close to the Xuchang values for all but the ZM119 and ZM136 images, where the correlation values shifted slightly from the original values. The best performance was observed in XN511 (R 2 = 0.99, RMSE = 8.97, Table 5), and the lowest was observed in ZM119 (R 2 = 0.81, RMSE = 10.14, Table 5). This suggested that the genotypes of the different cultivars will slightly affect the identification results. These results suggest that more genotype images are needed to contribute to model training to achieve higher accuracy.

Evaluation of performances in field condition
Forty-eight subsamples of wheat ear images with ground standard were selected in the Xuchang site to evaluate

Table 3 Quantitative comparison of the classification accuracy for different classes using the test data
The recognition accuracy of non-wheat ear, one wheat ear, two wheat ears, and three wheat ears all have higher precision the accuracy and practicality of the proposed method. It can be concluded that 48 samples are highly correlated with measurement counts in the field condition (Fig. 11).

Class Precision (%) Recall (%) F1-score (%) Macro F1-score (%) Micro F1-score (%)
The performance of the method was further tested using 48 subsamples. One Square meter area was selected in the center of the image area for manual counting in the field condition, which was used as the test data. Figure 11 a b c d  shows the results obtained, with the residuals showing a standard deviation of about 23.96 ears/m 2 , the relationship between the method and measurement ear counting was positive and strong, with an R 2 of 0.91 and RRMSE of 4.04%. The results showed a decrease in R 2 , indicating a slight reduction the identification results in field condition. The reason may be related to the small number of wheat ears hidden under the stems and leaves during field counting.

Discussion
The results showed that the number of wheat ears identified by K-means and CNN was consistent with the manual ear counting results ( Fig. 8 and Table 4). The difference between the two methods ( Fig. 8) indicated that the accuracy is poor in the earlier grain-filling stage.
The results of Alkhudaydi et al. [39] also suggested that this model performed well during the grain-filling stage. These results confirm that better-quality images can be obtained from the later grain-filling stage.
Our method is based on target localization. Adding a later stage would probably have led to a marginal improvement, as the ears in the grain-filling stage are a relatively homogeneous yellow color. Furthermore, the images were grouped into three groups to avoid the discarded region where the contrast between the ear and background is not great enough in the K-means segmentation. The identification error caused by the adhesion of the wheat ear and background proposed by Fernandez-Gallego et al. [11] was effectively reduced. In addition, the wheat ear images were divided into non-wheat ear, one wheat ear, two wheat ears, and three wheat ears, which could effectively reduce the identification inaccuracy caused by the adhesion of multiple wheat ears, which has   been a significant issue in traditional image processing methods [10]. Overall, the proposed K-means and CNN algorithm showed suitable performance in identifying wheat ears at early or later growth stages in all datasets (R 2 = 0.96, RMSE = 10.84 ears, Table 3), and similar outcomes were presented by Zhou [40]. The result using K-means to segment the wheat ear features accurately and train the machine learning model not only improved the model training efficiency but also improved the recognition accuracy. This method was used to classify the wheat ear instead of using target recognition to reduce the complexity of the algorithm, and together with the CNN model, could effectively and accurately identify and count the wheat ears.
Our work is useful for the development of a low-cost, rapid, and easy-to-implement method to identify wheat ears. We used images collected by the UAV platform to verify the training model of the mobile phone photo collection, which also achieved good results. However, determining the actual area represented in the photos still needs to be resolved. Current research mainly uses measures such as placing a reference substance as a ground standard [22] or fixing the shooting height [18], which reduces the practicability of the method. In the future, augmented reality (AR) technology could be used to solve this problem, which is one of our research aims.
It should be noted that different cultivars had a slight influence on the identification results. Although the training data of the CNN model were constructed on May 14, 2019 and May 20, 2019, and thus the sample size of the training data set was not large, the model still achieved good recognition of the images collected on the other dates. In our opinion, the best shooting date is at the late stage of grouting, when the wheat ears turn yellow and the stems and leaves are still green. In addition, we believe that the use of mobile devices to shoot images at the height of 1.5-2.2 m in sunny cloudy is a better way of shooting, as it matches the height of the person and facilitates the practical application of this method.

Conclusion
In this study, wheat ear images were collected using hand-held equipment, which is fast and convenient. Through K-means clustering segmentation, complete wheat ear images were automatically segmented, and automatic feature extraction of the wheat ear images was realized. The code can be found at https ://githu b.com/ xuxin 468/earco uting .
The segmented images were divided into four types, and the CNN model was established to realize the recognition and counting of the wheat ear images. The correlation coefficient R 2 was 0.96. The recognition accuracies of the non-wheat ear, one wheat ear, two wheat ears, and three wheat ears were 99.8, 97.5, 98.07, and 98.5%, respectively. The results showed that the recognition accuracy of the CNN model could be improved by using image processing technology to accurately locate and segment wheat ears before training and recognizing, thus meeting the requirements of field-based wheat ear counting.
The present study has several improvements over previous studies: (1) K-means clustering was used to automatically and accurately segment the wheat ear, thus reducing the traditional workload of manual labeling and the associated human errors. (2) The wheat ear adhesion problem was resolved by creating four types of labeled datasets, including the non-wheat ear, single wheat ear, two wheat ears, and three wheat ears, which transforms the task of wheat ear recognition into the task of wheat ear image classification. (3) K-means was used to segment the wheat ear features accurately, and as a result, the efficiency and accuracy of the machine learning model was significantly improved.
The wheat ear recognition model based on CNN demonstrates strong generalization ability and robustness and can be applied to UAV platform as well. This paper combined automatic image processing and CNN methods, which is of great technical value for the recognition and counting of wheat ears in the field.
Our aim was to help reduce the cost of image acquisition and improve the application scope of this method. This method can be used to estimate wheat ear numbers and improve the efficiency of wheat yield estimation. At Fig. 11 Comparison between the wheat ears identified using the model with the corresponding values by manual counting in field condition. 48 samples are highly correlated with measurement counts in the field condition the same time, it can also provide breeders with a fast and automated high-throughput wheat ear counting system to improve breeding efficiency. Although this method is applied to the segmentation and counting of wheat ears, it can also be applied to the segmentation and counting of other plants. In future work, our aim is to use AR measurement technology, which can provide a ground standard for the images.