Detection and analysis of wheat spikes using Convolutional Neural Networks

Background Field phenotyping by remote sensing has received increased interest in recent years with the possibility of achieving high-throughput analysis of crop fields. Along with the various technological developments, the application of machine learning methods for image analysis has enhanced the potential for quantitative assessment of a multitude of crop traits. For wheat breeding purposes, assessing the production of wheat spikes, as the grain-bearing organ, is a useful proxy measure of grain production. Thus, being able to detect and characterize spikes from images of wheat fields is an essential component in a wheat breeding pipeline for the selection of high yielding varieties. Results We have applied a deep learning approach to accurately detect, count and analyze wheat spikes for yield estimation. We have tested the approach on a set of images of wheat field trial comprising 10 varieties subjected to three fertilizer treatments. The images have been captured over one season, using high definition RGB cameras mounted on a land-based imaging platform, and viewing the wheat plots from an oblique angle. A subset of in-field images has been accurately labeled by manually annotating all the spike regions. This annotated dataset, called SPIKE, is then used to train four region-based Convolutional Neural Networks (R-CNN) which take, as input, images of wheat plots, and accurately detect and count spike regions in each plot. The CNNs also output the spike density and a classification probability for each plot. Using the same R-CNN architecture, four different models were generated based on four different datasets of training and testing images captured at various growth stages. Despite the challenging field imaging conditions, e.g., variable illumination conditions, high spike occlusion, and complex background, the four R-CNN models achieve an average detection accuracy ranging from 88 to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$94\%$$\end{document}94% across different sets of test images. The most robust R-CNN model, which achieved the highest accuracy, is then selected to study the variation in spike production over 10 wheat varieties and three treatments. The SPIKE dataset and the trained CNN are the main contributions of this paper. Conclusion With the availability of good training datasets such us the SPIKE dataset proposed in this article, deep learning techniques can achieve high accuracy in detecting and counting spikes from complex wheat field images. The proposed robust R-CNN model, which has been trained on spike images captured during different growth stages, is optimized for application to a wider variety of field scenarios. It accurately quantifies the differences in yield produced by the 10 varieties we have studied, and their respective responses to fertilizer treatment. We have also observed that the other R-CNN models exhibit more specialized performances. The data set and the R-CNN model, which we make publicly available, have the potential to greatly benefit plant breeders by facilitating the high throughput selection of high yielding varieties. Electronic supplementary material The online version of this article (10.1186/s13007-018-0366-8) contains supplementary material, which is available to authorized users.


Faster R-CNN Architechture
Faster R-CNN consists of three main networks as follows: 1 Convolution Neural Network (CNN) 2 Region Proposal Network (RPN) 3 Classification Network Figure S2.1 shows the schematic diagram of the Faster R-CNN architecture composed of overall three networks used for spike detection.
The first few layers are initialized from a pre-trained network constitute the CNN as "head" network. The convolutional feature maps produced by this network are then passed through the RPN which uses a series of convolutional and fully connected layers to produce promising regions that are likely to contain spike regions showed as bounding boxes (key factor-1 mentioned above). These identified regions are then used to cropped out (ROI Pooling) corresponding regions from the feature maps produced by the CNN which are then forwarded to a classification network to classify the spike regions contained in each bounding boxes. The spikes with higher classification probability is considered as detected and then total amount is counted for further processing. Below, we describe in details its main components.

Pre-trained Network
The spike detected problem is experimented with several pre-trained networks as below to determine the best possible CNN.
• AlexNet [1], which has 5 shareable convolution layers, 1  • VGG-16 [2], which has 13 shareable convolutional layers, and • VGG-19 [2], which has 16 shareable convolutional layers. Among them, VGG-16 is selected for our spike detection task since it is known to achieve a high object detection and recognition performance on most commonly used open-access datasets [3,4] as well as for our spike detection task. VGG-16 is known to perform well on tasks such as feature detection in a complex scenarios even when using small training sets [5]. The network takes pre-computed proposals from images and classifies them into object categories and regresses a box around them.

Anchor Generation Layer
Anchors are basically bounding boxes which play an important role for RPN. An anchor has a center, which is also the center of the sliding window, a scale, and an aspect ratio. An anchor generation layer produces a set of bounding boxes of varying size and aspect ratios spread all over the input image. In our experiments, we define a set of 9 anchors at each position (x i , y i ) of an image i by setting three scales and three aspect ratios as shown in Figure S2.3.

Region Proposal Layer
A Region Proposal Network takes as input image, which can be of various sizes, and returns for various regions of the image a score which corresponds to the probability that the region being a spike (foreground) or not a spike, e.g., canopy or soil (background). Region proposal layer transforms the anchors according to the bounding box regression coefficients to generate transformed anchors which are then re-adjusted by applying non-maximum suppression threshold to the probability of an anchor being a foreground region. The anchors generated earlier are forwarded to the region proposal layer as shown in Figure S2.2. In this experiment, considering the small spike regions in an image, anchor proposal scales are set to 2, 4 and 8 and aspect ratios are adjusted as 0.5, 1 and 2.
RPN is constructed with an n × n convolutional layer followed by two 1 × 1 convolutional layers (for reg and cls). This is explained in detail in the training section of Faster-RCNN. The goal of the anchor target layer is to produce a set of promising anchors and the corresponding foreground/background labels and apply regression coefficients to train the RPN. Promising foreground anchors are those whose overlap with any ground-truth bounding box is higher than a threshold. In our experiment for each of the anchors, probability p * i is computed to show its overlap with the ground-truth bounding boxes as follows: Here, IoU is Intersection over Union, a commonly used overlap measure [6] based on pixel regions and is defined as follows; In RPN the goal of the region proposal layer is to generate good bounding boxes. To do so from a set of anchor boxes, the RPN layer must learn to classify an anchor box as background or foreground and calculate the regression coefficients to modify the position, width and height of a foreground anchor box to make it a "better" foreground box (fit a foreground object more closely). The anchor target layer also generates a set of bounding box regressors as output. Regressor is actually a measure of how far each anchor target is from the closest foreground bounding box. The process of calculating bounding box regression loss and classification loss to generate RPN loss is described later in loss function subsection.
To generate region proposals of spikes, we slide a window over the convolutional feature map generated by the last shared (fc6) layer of CNN followed by RELU. This layer takes as input, an n × n window of the convolutional feature map which is run through (1,1) kernel convolutional layer to mapped to a lower-dimensional feature . This feature is fed into two fully connected layers-B-box regression and Classification layer. To estimate class probabilities of being spike or not, two kernels and to generate output bounding box regressor four kernel convolution layers have been applied.
For each sliding-window location, we predict multiple spike region proposals for proposal target layer with the maximum number of proposal for each location is denoted as k. Now, the regression layer has 4k outputs encoding the coordinates of k boxes and classification which is actually a two-class softmax layer outputs 2k scores to estimate the probability of spike or non-spike for each proposal. The scores are parameterized relative to k reference boxes, known as anchors. We fine-tuned a pre-trained Faster-RCNN for training with RPN foreground fraction of 0.5 to facilitate maximum number of foreground examples along with the batch size of 256. Batch size defines the number of samples that going to be propagated through the network. To reduce the number of overlapping bounding boxes we applied a non-maximum threshold of 0.7 to the RPN proposal to keep the top 2000 boxes as RPN proposals and the values are determined after several trials to get better accuracy.

ROI Pooling Layer
The Fast R-CNN detector places an ROI pooling layer after the final convolution. It takes mini-batch of region proposals as input. After RPN, we get proposed regions as feature maps of different sizes. The region of interest pooling simplifies these sizes by making the feature maps having the same size. The pooling layer takes the ROI boxes output by the proposal target layer and the convolutional feature maps output by the CNN and outputs square feature maps. The feature maps are then fed the fully connected layer (7 × 7 in fc6 for VGG-16). The result is a one-dimensional feature vector for each ROI.

Classification Layer
The classification layer takes the output feature maps produced by the ROI Pooling Layer and passes them through a series of convolutional layers. The output is fed through two fully connected layers. The first layer produces the class probability distribution for each region proposal and the second layer produces a set of class specific bounding box regressors. For example, an input image of size 2000 × 1500 shrinks 16 times to a 125 × 95 feature map after applying CNNs. Every map contains 9 anchors classified either spike or non-spike. All these feature map and classification label is contained in a feature vector map called logit, which is passed to a softmax and a regression layer to predict and classify the regions and the corresponding bounding boxes respectively. Similar to RPN loss, classification loss is the metric that is minimized during optimization to train the classification network. During back propagation, the error gradients flow to the RPN network as well, so training the classification layer modifies the weights of the RPN network as well. The process of calculating bounding box regression loss and classification loss in this layer is described later in loss function subsection.

Training Faster R-CNN
The RPN training minibatch consists of 256 proposals from one image with a 1 : 1 ratio of positives (IoU with a ground-truth box larger than 0.7) to negatives (IoU smaller than 0.3). The Fast R-CNN training minibatch has 128 proposals with a 5 : 1 ratio of foreground (IoU with a ground-truth box of the same class larger than 0.2) to background (IoU > 0.7, with any foreground ground-truth box < 0.2 based on several trials to achieve better accuracy in detection and mAP. During the implementation, we considered three anchor ratios, namely 1 : 1, 1 : 2, 2 : 1. Also, around 12000 anchors are considered during the training process, but by fixing NMS to 0.7 we decrease these proposals to approximately 2000 by taking the top 2k proposals used to train R-CNN network. For testing, 600 top scoring boxes are kept after applying NMS 0.8 based on different trials to fulfil the requirement of our approach, which is then used for applying bounding boxes to the detected objects. Training the overall network requires estimating the weights and minimizing the loss function [6]. Faster R-CNN minimizes a weighted sum of four loss functions: (1) softmax loss for presence/absence of a foreground object in the RPN, (2) smooth L1 loss for anchor co-ordinate regression in RPN, (3) softmax loss for the classification of region proposals for detection require object, and (4) bounding-box regression loss for coordinating regression to refine the final bounding boxes of the detected objects.

Loss Function for Classification and B-Box Regression
The spatial features extracted after the convolution layer are fed to a network which has two tasks: classification (cls) and regression (reg) by calculating a loss function in each epoch which is back propagated to adjust the final weights of the R-CNN output layer until reaching to a reasonable loss. The loss function measures the compatibility between Here, i is an anchor's index, p i is the probability of an anchor being an object, and p * i is theL ground truth label representing the bounding box coordinates of an anchor. It is set to 1 for the positive anchor and 0 for a negative anchor. Also, t * i is the ground-truth box associated with a positive anchor, and L cls is the classification log loss over two classes (object or non-object). The two terms are normalized by N cls and N reg and weighted by a balancing parameter γ = 3.0 for RPN and γ = 1.0 for the detector for smooth L1 loss specified by [7]. Here, L cls (p i , p * i ) = −log(p i ) p * i is log loss for true class p * i . The regression loss in R-CNN is applied by the following equation as defined in [7]: where R is the robust loss function (smooLth L1), and p i and t i respectively, are the outputs of the cls and reg layers. The output of the regressor determines a refined bounding-box as the final output. The output of the classification sub-network is a probability p indicating whether the predicted box contains an object (1 for foreground object) or it is from the background (0 for no object). For bounding box regression in RPN, we adopt the parameterizations of the four coordinates of ground-truth box t * i = [t * x , t * y , t * w , t * h ] based on the approach of bL [8]. Figure S2.4 shows the GTBox and anchor's properties. Their corresponding mathematical formulation is given by Equation 5 below: t y = (y i − y a )/h a , t w = log(w i /w a ), t * x = (x bi − x a )/w a , t * y = (y bi − y a )/h a , t * w = log(w bi /w a ), We use the publicly available Python implementation of Faster R-CNN, which implements joint approximate training as shown in Figure S2.1. During training, images are horizontally flipped with probability 0.5 to generate more training datasets. No other data augmentation is used. We used an initial learning rate of 10 −3 which is then reduced by a factor of 10 every 200 epochs. The training was performed for 400 epochs. Initial weight of the CNN is adjusted with 5.0 × 10 −4 along with learning using Stochastic Gradient Descent with momentum. The momentum value is set to 0.9. This allows to train Faster R-CNN with any other optimizer without bumping into any big problem like the symmetry when backpropagation. The learning rate multiplier of 0.2 is used for all biases in the network. This, essentially, trains the biases with twice the current learning rate.
Author details