Skip to main content

DeepCob: precise and high-throughput analysis of maize cob geometry using deep learning with an application in genebank phenomics



Maize cobs are an important component of crop yield that exhibit a high diversity in size, shape and color in native landraces and modern varieties. Various phenotyping approaches were developed to measure maize cob parameters in a high throughput fashion. More recently, deep learning methods like convolutional neural networks (CNNs) became available and were shown to be highly useful for high-throughput plant phenotyping. We aimed at comparing classical image segmentation with deep learning methods for maize cob image segmentation and phenotyping using a large image dataset of native maize landrace diversity from Peru.


Comparison of three image analysis methods showed that a Mask R-CNN trained on a diverse set of maize cob images was highly superior to classical image analysis using the Felzenszwalb-Huttenlocher algorithm and a Window-based CNN due to its robustness to image quality and object segmentation accuracy (\(r=0.99\)). We integrated Mask R-CNN into a high-throughput pipeline to segment both maize cobs and rulers in images and perform an automated quantitative analysis of eight phenotypic traits, including diameter, length, ellipticity, asymmetry, aspect ratio and average values of red, green and blue color channels for cob color. Statistical analysis identified key training parameters for efficient iterative model updating. We also show that a small number of 10–20 images is sufficient to update the initial Mask R-CNN model to process new types of cob images. To demonstrate an application of the pipeline we analyzed phenotypic variation in 19,867 maize cobs extracted from 3449 images of 2484 accessions from the maize genebank of Peru to identify phenotypically homogeneous and heterogeneous genebank accessions using multivariate clustering.


Single Mask R-CNN model and associated analysis pipeline are widely applicable tools for maize cob phenotyping in contexts like genebank phenomics or plant breeding.


High-throughput precision phenotyping of plant traits is rapidly becoming an integral part of plant research, plant breeding, and crop production [4]. This development complements the rapid advances in genomic methods that, when combined with phenotyping, enable rapid, accurate, and efficient analysis of plant traits and the interaction of plants with their environment [65]. However, for many traits of interest, plant phenotyping is still labor intensive or technically challenging. Such a bottleneck in phenotyping [17] limits progress in understanding the relationship between genotype and phenotype, which is a problem for plant breeding [24]. The phenotyping bottleneck is being addressed by phenomics platforms that integrate high-throughput automated phenotyping with analysis software to obtain accurate measurements of phenotypic traits [28, 46]. Existing phenomics platforms cover multiple spatial and temporal scales and incorporate technologies such as RGB image analysis, near-infrared spectroscopy (NIRS), or NMR spectroscopy [31, 32, 60]. The rapid and large-scale generation of diverse phenotypic data requires automated analysis to convert the output of phenotyping platforms into meaningful information such as measures of biological quantities [11, 22]. Thus, high-throughput pipelines with accurate computational analysis will realize the potential of plant phenomics by overcoming the phenotyping bottleneck.

A widely used method for plant phenotyping is image segmentation and shape analysis using geometric morphometrics [70]. Images are captured in standardized environments and then analyzed either manually or automatically using image annotation methods to segment images and label objects. The key challenge in automated image analysis is the detection and segmentation of relevant objects. Traditionally, object detection in computer vision (CV) has been performed using multivariate algorithms that detect edges, for example. Most existing pipelines using classical image analysis in plant phenotyping are species-dependent and assume homogeneous plant material and standardized images [40, 45, 68]. Another disadvantage of classical image analysis methods is low accuracy and specificity when image quality is low or background noise is present. Therefore, the optimal parameters for image segmentation often need to be fine-tuned manually through experimentation. In recent years, machine learning approaches have revolutionized many areas of CV such as object recognition [37] and are superior to classical CV methods in many applications [48]. The success of machine learning in image analysis can be attributed to the evolution of neural networks from simple architectures to advanced feature-extracting convolutional neural networks (CNNs) [64]. The complexity of CNNs could be exploited because deep learning algorithms offered new and improved training approaches for these more complex method networks. Another advantage of machine learning methods is their robustness to variable image backgrounds and image qualities when model training is based on a sufficiently diverse set of training images. Through their capability to learn from small training datasets, these deep learning techniques have a huge potential to carry out few-shot learning in agriculture, thereby saving work effort and costs in generating large real-world training datasets [5, 67]. Although CNN have been very successful in general image classification and segmentation, their application in plant phenotyping is still limited to a few species and features. Current applications include plant pathogen detection, organ and feature quantification, and phenological analysis [16, 31, 62].

Maize cobs can be described with few geometric shape and color parameters. Since the size and shape of maize cobs are important yield components with a high heritability and are correlated with total yield [43, 53], they are potentially useful traits for selection in breeding programs. High throughput phenotyping approaches are also useful for characterizing native diversity of crop plants to facilitate their conservation or utilize them as genetic resources [41, 47]. Maize is an excellent example to demonstrate the usefulness of high throughput phenotyping because of its high genetic and phenotypic diversity, which originated since its domestication in South-Central Mexico about 9,000 years ago [27, 34, 42]. A high environmental variation within its cultivation range in combination with artificial selection by humans resulted in many phenotypically divergent landraces [8, 69]. Since maize is one of the most important crops worldwide, large collections of its native diversity were established in ex situ genebanks, whose genetic and phenotypic diversity are now being characterized [56]. This unique pool of genetic and phenotypic variation is threatened by genetic erosion [23, 49,50,51] and understanding its role in environmental and agronomic adaptation is essential to identify valuable genetic resources and develop targeted conservation strategies.

In the context of native maize diversity we demonstrate the usefulness of a CNN-based deep learning model implemented in a robust and widely applicable analysis pipeline for recognizing, semantic labeling and automated measurements of maize cobs in RGB images for large scale plant phenotyping. Highly variable traits like cob length, kernel color and number were used for classification of the native maize diversity of Peru ([52] and are useful for the characterization of maize genetic resources because cobs are easily stored and field collections can be analyzed at a later time point. We demonstrate the application of image segmentation to photographs of native maize diversity in Peru. So far, cob traits have been studied for small sets of Peruvian landraces, only such as cob diameter in 96 accessions of 12 Peruvian maize landraces [2], or cob diameter in 59 accessions of 9 highland landraces ([49, 50]. Here we use image analysis to obtain cob parameters from 2,484 accessions of the Peruvian maize genebank hosted at Universidad Nacional Agraria La Molina (UNALM) by automated image analysis. We also show that the DeepCob image analysis pipeline can be easily expanded to different image types of maize cobs such as segregating populations resulting from genetic crosses.


Comparison of image segmentation methods

To address large-scale segmentation of maize cobs, we compared three different image analysis methods for their specificity and accuracy in detecting and segmenting both maize cobs and measurement rulers in RGB images. Correlations between true and derived values for cob length and diameter show that Mask R-CNN far outperformed the classical Felzenszwalb-Huttenlocher image segmentation algorithm and a window-based CNN (Window-CNN) (Fig. 1). For two sets of old (ImgOld) and new (ImgNew) maize cob images (see Materials and Methods), Mask R-CNN achieved correlations of 0.99 and 1.00, respectively, while correlation coefficients ranged from 0.14 to 0.93 with Felzenszwalb-Huttenlocher segmentation and from 0.03 to 0.42 with Window-CNN, respectively. Since Mask R-CNN was strongly superior in accuracy to the other two segmentation methods, we restricted all further analyses to this method only.

Fig. 1
figure 1

Pearson correlation between true and estimated cob length for three image segmentation methods (Felzenszwalb-Huttenlocher segmentation,Window-CNN,Mask R-CNN). True (x-axis) and estimated (y-axis) mean cob length (a, c) and diameter (b, d) per image with each approach, split by dataset, ImgOld and ImgNew are shown. In all cases, MaskRCNN achieves the highest correlation of at least 0.99 with the true values

Parameter optimization of Mask R-CNN

We first describe parameter optimizations during training of the Mask R-CNN model based on the old (ImgOld) and new (ImgNew) maize cob image data from the Peruvian maize genebank. A total of 90 models were trained, differing by the parameters learning rate, total epochs, epochs.m, mask loss weight, monitor, minimask (see Material and Methods), using a small (200) and a large (1,000) set of randomly selected images as training data. The accuracy of Mask R-CNN detection depends strongly on model parameters, as \(AP@\)[0.5:0.95] values for all models ranged from 5.57 to 86.74 for 200 images and from 10.49 to 84.31 for 1,000 images for model training (Additional file 1: Table S1). Among all 90 models, M104 was the best model for maize cob and ruler segmentation with a score of 86.74, followed by models M101, M107, and M124 with scores of 86.56. All four models were trained with the small image dataset.

Given the high variation of the scores, we evaluated the contribution of each training parameter to this variation with an ANOVA (Table 1). There is an interaction effect between the size of the training set and the total number of epochs trained, as well as an effect of a minimask, which is often used as a resizing step of the object mask before fitting it to the deep learning model. The other training parameters learning rate, monitoring, epochs.m (mode to train only heads or all layers), and mask loss weight had no effect on the \(AP@\)[0.5:0.95] value. The lsmeans show that training without minimask leads to higher scores and more accurate object detection. Table 1 shows an interaction between the size of the training set and the total number of epochs. Model training with 200 images over 200 epochs was not significantly different from training over 50 epochs or from model training with 1,000 images over 200 epochs at \(p<0.05\). With the same number of training epochs, we did not observe an advantage 1,000 over 200 training images. In contrast, model training over 15 epochs only resulted in lower AP@[0.5:0.95] values.

Table 1 Lsmeans of AP@[.5:.95] in the ANOVA analysis for Mask R-CNN model parameters minimask and the interaction of training set size x total number of epochs

Loss behavior of Mask R-CNN during model training

Monitoring loss functions of model components (classes, masks, boxes) during model training identifies components that need further adjustments to achieve full optimization. Compared to the other components, mask loss contributed the highest proportion to all losses (Fig. 2), which indicates that the most challenging process in model training and optimization is segmentation by creating masks for cobs and rulers. The training run with the parameter combination generating the best model M104 in epoch 95 shows a decreasing training and validation loss during the first 100 epochs and a tendency for overfitting in additional epochs (Fig. 2A, B). This suggests that model training on the Peruvian maize images over 100 epochs is sufficient. Other parameter combinations like M109 (Fig. 2c) exhibit overfitting with a tenfold higher validation loss than M104. Instead of learning patterns, the model memorizes training data, which increases the validation loss and results in weak predictions for object detection and image segmentation.

Fig. 2
figure 2

Mask R-CNN training and validation losses during training for 200 epochs on ImgOld and ImgNew maize cob images from the Peruvian genebank. a Loss curves leading to the best model M104 in epoch 95. c With the same scaling on the y-axis, parameter combination M109 shows substantial overfitting as indicated by much higher validation losses resulting in an inferior model based on \(AP@\)[.5:.95]. b Loss curves of M104 with a zoomed scale on the y-axis, highlighting the mask loss as highest contributor to overall loss, indicating that masks are most difficult to optimize. Other losses, like class loss or bounding box loss, are of minor importance

Visualization of feature maps generated by Mask R-CNN

Although neural networks are considered a "black box" method, a feature map visualization of selected layers shows interpretable features of trained networks. In a feature map, high activations correspond to high feature recognition activity in that area, as shown in Fig. 3A for the best model M104. Over several successive CNN layers, the cob shape is increasingly well detected until, in the last layer (res4a) the feature map indicates a robust distinction between foreground with the cob and ruler objects and the background. High activations occur at the top of the cobs (Fig. 3A, res4g layer), which may contribute to localization. Because the cobs were oriented according to their lower (apical) end in the images, it may be more difficult for the model to detect the upper edges, which are variable in height. Overall, the feature maps show that the network learned specific features of the maize cob and the image background.

Fig. 3
figure 3

Feature map visualizations and improved segmentation throughout learning A Examples of feature map visualizations on resnet-101 (for an explanation, see Materials and methods). a An early layer shows activations around the cob shape and the ruler on the right. b The next layer shows more clarified cob shapes with activations mainly at the top and bottom of cobs. c A later layer shows different activations inside the cob. d The latest layer masks the background very well masked from cobs and rulers. B Visualization of the main detection procedure of Mask R-CNN. a The top 50 anchors obtained from the region proposal network (RPN), after non-max suppression. b, c, d show further bounding box refinement and e shows the output of the detection network: mask prediction, bounding box prediction and class label. All images are quadratic with a black padding because images are internally resized to a quadratic scale for more efficient matrix multiplication operations

The Mask R-CNN detection process can be visualized by its main steps, which we demonstrate using the best model (Fig. 3B). The top 50 anchors are output by the Region Proposal Network (RPN) and the anchored boxes are then further refined. In the early stages of refinement, all boxes already contain a cob or ruler, but boxes containing the same image element have different lengths and widths. In later stages, the boxes are further reduced in size and refined around the cobs and rulers until, in the final stage, mask recognition provides accurate-fitting masks, bounding boxes, and class labels around each recognized cob and ruler.

The best Mask R-CNN model for detection and segmentation of both maize cobs and rulers is very robust to image quality and variation. This robustness is evident from a representative subset of ImgOld and ImgNew images that we did not use for training and show a high variation in image quality, backgrounds and diversity of maize cobs (Fig. 4). Both the identification of bounding boxes and object segmentation are highly accurate regardless of image variability. The only inaccuracies in the location of bounding boxes or masks occur at the bottom edge of cobs.

Fig. 4
figure 4

Examples of detection and segmentation performance on a representative example of diverse images from the Peruvian maize landrace ImgOld (a) and ImgNew (b) image sets including different cob and background colors

Maize model updating on additional image datasets

To extend the use of our model for images of maize cobs taken under different circumstances and in different environments (e.g., in the field), we investigated whether updating our maize model for new image types with additional image data included in the ImgCross and ImgDiv data sufficiently improves the segmentation accuracy of cob and ruler elements compared to a full training process starting again with the standard COCO model. We used the best maize model trained on ImgOld and ImgNew data (model M104, hereafter maize model), which is pre-trained only on the cob and ruler classes. In addition to updating to our maize model, we updated the COCO model with the same images. In this context, the COCO model serves as a validation, as it is a standard mask-R CNN model trained on the COCO image data [38], which contains 80 annotated object classes in 330 K images.

Overall, model updating using training images significantly improved the \(AP@\)[0.5:0.95] scores of the additional image datasets (Fig. 5), with scores differing between image sets, initial models, and training set sizes. With standard COCO model weights (Fig. 6a, c), \(AP@\)[0.5:0.95] scores were initially low, down to a value of 0, in which neither cobs nor rulers were detected. However, scores increased rapidly during up to 0.7 during the first 30 epochs. In contrast, with the pre-trained weights (Fig. 5b, d) of the maize model \(AP@\)[0.5:0.95] scores were already high during the first epochs and then rapidly improved to higher values than with the COCO model. Therefore, object segmentation using additional maize cob image data was significantly better with the pre-trained maize model from the beginning and throughout the model update.

Fig. 5
figure 5

Improvement of \(AP@\)[.5:.95] scores during 50 epochs of model updating to different maize cob image datasets (a, b: ImgCross; c, d: ImgDiv). Updating on the COCO initial weights/COCO model (a, c) in comparison to updating on the pre-trained maize model (b, d) depends on different amounts of training images, namely 10, 20, 30, 40 or 50 images

Fig. 6
figure 6

Detection of cob and ruler after model updating the pretained maize model with different image datasets. a Updating with 10 training images from ImgCross. The original maize model detected only one cob (epoch 0). After one epoch of model updating both cobs were accurately segmented and after epoch 12 the different ruler element was detected. Photo credit: K. Schmid, University of Hohenheim. b Segmentation of various genebank images after updating for 25 epochs with 20 training images from ImgDiv. Photo credits: (Left) (Center) Right: CIMMYT, All photos are available under a Creative Commons License. c Segmentation of cobs and rulers in post-harvest images of the Swiss Rheintaler Ribelmais landrace with the best model from ImgCross without updating on these images. Photo credit: Benedikt Kogler, Verein Rheintaler Ribelmais e.V., Switzerland

Given the high variation in these scores, we determined the contribution of the three factors starting model, training set size and training data set to the observed variation in \(AP@\)[0.5:0.95] scores with an ANOVA. In this analysis, the interactions between dataset and starting model were significant. By accounting for the lsmeans of these significant interactions (Table 2), updating of the pre-trained maize model than of the COCO model was better in both data sets. With respect to traing set sizes, \(AP@\)[0.5:0.95] scores of maize model were essentially the same for different sizes and were always higher than of the COCO model. In summary, there is a clear advantage in updating a pre-trained maize model over the COCO model for cob segmentation with diverse maize cob image sets.

Table 2 Lsmeans of AP@[.5:.95] score of the significant interactions for model updating, dataset x starting model and starting model x training set size

Descriptive data obtained from cob image segmentation

To demonstrate that the Mask R-CNN model is suitable for large-scale and accurate image analysis, we present the results of a descriptive analysis of 19,867 maize cobs that were identified and extracted from the complete set of images from the Peruvian maize genebank, i.e., the ImgOld and ImgNew data. Here, we focus on the question whether image analysis identifies genebank accessions which are highly heterogeneous with respect to cob traits by using measures of trait variation and multivariate clustering algorithms.

Our goal was to identify heterogeneous genebank accessions that either harbor a high level of genetic variation or are admixed because of co-cultivation of different landraces on farmers fields or mix-ups during genebank storage. We therefore analysed variation of cob parameters within images to identify genebank accessions with a high phenotypic diversity of cobs using two different multivariate analysis methods to test the robustness of the classification.

The first approach consisted of calculating a \(Z\)-score of each cob in an image as measure of deviation from the mean of the image (Within image \(Z\)-scores), clustering these scores with a PCA, followed by applying CLARA and determining the optimal number of clusters with the average silhouette method. The second approach consisted of calculating a centered and scaled standard deviation of cob parameters for each image, applying a PCA to the values of all images, clustering with \(k\)-means and determining the optimal cluster number with the gap statistic. With both approaches, the best-fitting numbers of clusters was \(k=2\) with a clear separation between clusters and little overlap along the first principal component (Fig. 7). The distribution of trait values between the two groups shows that they differ mainly by the three RGB colors and cob length (in the \(Z\)-score analysis only) suggesting that cob color tends to more variable than most morphological traits within genebank accessions. Additional file 1: Figure S1 shows images of genebank accessions classified as homogeneous and variable, respectively.

Fig. 7
figure 7

Clustering of individual images by their heterogeneity of maize cob traits within images. Clustering approaches with the extracted cob traits. A First two principal components showing the average color of individual cobs (\(n=\mathrm{19,867}\) cobs) (left) and average cob color per analyzed image (\(n=\mathrm{3,302}\) images) (right). The colors of each dot reflect the average RGB values (i.e., the color) of each cob, or image, respectively. B PCA plots showing clusters identified with the multivariate clustering methods CLARA (left) and \(k\)-means clustering (right). C Distribution of cob traits within each method and cluster


Our comparison of three image segmentation methods showed Mask R-CNN to be superior to the classic image analysis method Felzenszwalb-Huttenlocher segmentation and Window-CNN for maize cob detection and segmentation. Given the recent success of Mask R-CNN for image segmentation in medicine or robotics, its application for plant phenotyping is highly promising as demonstrated in strawberry fruit detection for harvesting robots [72], orange fruit detection [18], pomegranate tree detection [73], disease monitoring in wheat [59], and seed analysis in rice and soybean [30, 71]. Here we present another application of Mask R-CNN for maize cob instance segmentation and quantitative phenotyping in the context of genebank phenomics. In contrast to previous studies we performed a statistical analysis on the relative contribution of Mask R-CNN training parameters, and our application is based on more diverse and larger training image sets of 200 and 1,000 images. Finally, we propose a simple and rapid model updating scheme for applying the method on different maize cob image sets to make this method widely useful for cob phenotyping. The provided manuals offer a simple application and update of the deep learning model on custom maize cob datasets.

Advantages and limitations of the method for few-shot learning in agriculture

After optimizing various model parameters, the final Mask R-CNN model detected and segmented cobs and rulers very reliably with a very high \(AP@\left[.5:.95\right]\) score of 87.7, enabling accurate and fast extraction of cob features. Since such scores have not been reported for existing pipelines for maize cob annotation because they are mainly used for deep learning, we compared them to other contexts of image analysis and plant phenotyping where these parameters are available. Our score is higher than the original Mask R-CNN implementation on COCO with Cityscapes images [55], possibly due to a much smaller number of classes (2 versus 80) in our dataset. Depending on the backend network, the score of the original implementation ranged between 26.6 and 37.1. The maize cob score is also greater than 57.5 in the test set for pomegranate tree detection [73] and comparable to a score of 89.85 for strawberry fruit detection [72]. Compared to such Mask R-CNN implementations on other crops, our method reached similar or even higher accuracy by requiring substantially less images. Only a small dataset of 200 images was required for the initial training, and only a few images (10–20) for model updating on a custom image set are needed. Thereby, this method has the potential to contribute to few-shot learning in agriculture if applied to other crops or plant phenotypes. By releasing relevant Mask R-CNN parameters for fine-tuning the model to a specific crop like maize cobs, the development of standard Mask R-CNN models for different crops or plant phenotypes is facilitated by this work. A unique Mask R-CNN model covering many crops and plant phenotypes is unrealistic in short-term due to the very different plant features and unavailability of large annotated image data sets. However, such a goal could be created in an open source project with a large and diverse set of annotated crop images and extensive model training, similar to the Model Zoo project ( Although both maize cob and ruler detection and segmentation performed well, we observed minor inaccuracies in some masks. A larger training set did not improve precision and eliminate these inaccuracies, as the resolution of the mask branch in the Mask R-CNN framework may be too low, which could be improved by adding a convolutional layer of, for example, 56 \(\times\) 56 pixel instead of the usual 28 \(\times\) 28 pixel at the cost of longer computing time.

Mask R-CNN achieved higher correlation coefficients between true and predicted cob measurements than existing image analysis methods, which reported coefficients of \(r=0.99\) for cob length, \(r=0.97\) for cob diameter [40] and \(r=0.93\) for cob diameter [45]. Our Mask R-CNN achieved coefficients of \(r=0.99\) for cob diameter and \(r=1\) for cob length. Such correlations are a remarkable improvement considering that they were obtained with the highly diverse and inhomogeneous ImgOld and ImgNew image data (Fig. 8 and Additional file 1: Table S4), whereas previous studies used more homogeneous images with respect to color and shape of elite maize hybrid breeding material taken with uniform backgrounds. The high accuracy of Mask R-CNN indicates the advantage of the learning on specific cob and ruler patterns in deep learning.

Fig. 8
figure 8

Variability of image properties among the complete dataset (containing ImgOld and ImgNew)

Another feature of our automated pipeline is the simultaneous segmentation of cob and ruler, which allows pixel measurements to be instantly converted to centimeters and morphological measurements to be returned. Such an approach was also used by Makanza et al. [40], but no details on ruler measurements or accuracy of ruler detection were provided. The ability to detect rulers and cobs simultaneously is advantageous in a context where professional imaging equipment is not available, such as agricultural fields.

Selection of training parameters to reduce annotation and training workload

Our Mask R-CNN workflow consists of annotating the data, training or updating the model, and running the pipeline to automatically extract features from the maize cobs. The most time-consuming and resource-intensive step was the manual annotation of cob images to provide labeled images for training, which took several minutes per image, but can be accelerated by supporting software [12]. In the model training step, model weights are automatically learned from the annotated images in an automated way, which is a major advantage over existing maize cob detection pipelines that require manual fine-tuning of parameters for different image datasets using operations such as thresholding, filtering, water-shedding, edge detection, corner detection, blurring and binarization [40, 45, 68].

Statistical analysis of each Mask R-CNN training parameters helps to reduce the amount of annotation and fine-tuning required (Tables 1 and 2). For example, there was no significant improvement on a large training set of 1,000 compared to 200 images, as learning on and segmenting of two object classes only seems to be a simple task for Mask R-CNN. Therefore, we do not expect further model improvement on a set of more than 1,000 images and the significant amount of work involved in manual image annotation can be reduced if no more than 200 images need to be annotated. Since many training parameters did not have a strong impact on the final model result, this suggests that such parameters do not need to be fine-tuned. For example, using all layers instead of only the network heads (only the last part of the network involving the fully-connected layers) did not improve significantly the final detection result. Training image datasets with only a few object classes on network heads greatly reduces the runtime for model training.

Technical equipment and computational resources for deep learning

The robustness of the Mask R-CNN approach imposes only simple requirements for creating images for both training and application purposes. RGB images taken with a standard camera are sufficient. In contrast, neural network training requires significant computational resources and is best performed on a high performance computing cluster or on GPUs with significant amounts of RAM. Training of the 90 different models (Additional file 1: Table S6) was executed over 3 days, using 4 parallel GPUs on a dedicated GPU cluster. However, once the maize model is trained, model updating with only a few annotated images from new maize image data does not require a high performance computing infrastructure anymore, as in our case updating with 20 images was achieved in less than an hour on a normal workstation with 16 CPU threads and 64 GB RAM.

Model updating with the pre-trained maize model on two different image datasets ImgCross and ImgDiv significantly improved the \(AP@\left[.5:.95\right]\) score for cob and ruler segmentation on the new images. The improvement was achieved despite additional features in the new image data that were absent from the training data. New features include rotated images, cobs in different orientation (horizontal instead of vertical) and different backgrounds (Fig. 6). The advantage of a pre-trained maize model over the standard COCO model was independent of the image data set and achieved higher \(AP@\left[.5:.95\right]\) scores with a small number of epochs (Fig. 5) because it saves training time for new image types, is widely applicable, and can be easily transferred to new applications for maize cob phenotyping. Importantly, the initial training set is not required for model updating. Our analyses indicate that only 10–20 annotated new images are required and the update can be limited to 50 epochs. The updated model can then be tested on the new image dataset, either by visual inspection of the detection or by annotating some validation images to obtain a rough estimate of the \(AP@\left[.5:.95\right]\) score. The phenotypic traits can then be extracted by the included post-processing workflow, which itself only needs to be modified if additional parameters are to be implemented.

The runtime of the pipeline after model training is very fast. Image segmentation with the trained Mask R-CNN model and parameter estimation of eight cob traits took on average of 3.6 s per image containing an average of six cobs. This time is shorter than previously published pipelines (e.g., 13 s per image in [45]), although it should be noted that any such comparisons are not based on the same hardware and the same set of traits. For example, the pipeline for three dimensional cob phenotyping performs a flat projection of the surface of the entire cob, but is additionally capable of annotating individual cob kernels and the total time for analyzing a single cob is 5–10 min [68]. The ear digital imaging (EDI) pipeline of Makanza et al. [40] processes more than 30 unthreshed ears at the same time and requires more time per image at 10 s, but also extracts more traits. However, this pipeline was developed on uniform and standardized images and does not involve a deep learning approach to make it generally applicable.

Application of the Mask R-CNN pipeline for genebank phenomics

To demonstrate the utility of our pipeline, we applied it to original images of maize cobs from farmer’s fields during the establishment of the official maize genebank in Peru in the 1960s and 1970s (ImgOld) and to more recent photographs taken during the regeneration of existing maize material in 2015 (ImgNew). The native maize diversity of Peru was divided into individual landraces based mainly on cob traits. Our interest was to identify genebank accessions with high or low diversity of cob traits within accessions to classify accessions as ’pure’ representatives of a landrace or as accessions with high levels of native genetic diversity, evidence of recent gene flow, or random admixture of different landraces. We used two different approaches to characterize the amount of variation for each trait within the accessions based on the eight traits measured by our pipeline. Unsupervised clustering of variance measure identified two groups of accessions that differed in their overall level of variation. The distribution of normalized variance parameters (Z-scores and standard deviations) within both groups indicate that variation in cob color has the strongest effect on variation within genebank accessions, suggesting that cob color is more variable that morphometric characters like cob length or cob diameter. This information is useful for subsequent studies, in terms of the relationship between genetic and phenotypic variation in native maize diversity, the geographic patterns of phenotypic variation within landraces, or the effect of seed regeneration during ex situ conservation on phenotypic diversity, which we are currently investigating in a separate study.


We present the successful application of deep learning by Mask R-CNN to maize cob segmentation in the context of genebank phenomics by developing a pipeline written in Python for a large-scale image analysis of highly diverse maize cobs. We also developed a post-processing workflow to automatically extract measurements of eight phenotypic cob traits from cob and ruler masks obtained with Mask R-CNN. In this way, cob parameters were extracted from 19,867 individual cobs with a fast automated pipeline suitable for high-throughput phenotyping. Although the Mask R-CNN model was developed based on native maize diversity of Peru, the model can be easily used and updated for additional image types in contexts like the genetic mapping of cob traits or in breeding programs. It therefore is of general applicability in maize breeding and research and for this purpose, we provide simple manuals for maize cob detection, parameter extraction and deep learning model updating. Future developments of the pipeline may include linking it to mobile phenotyping devices for real-time measurements in the field and using the large number of segmented images to develop refined models for deep learning, for example, to estimate additional parameters such a row numbers or characteristics of individual cob kernels.

Materials and methods

Plant material

The plant material used in this study is based on 2,484 genebank accessions of 24 Peruvian maize landraces collected from farmer’s fields in the 1960s and 1970s, which are stored the Peruvian maize genebank hosted at the Universidad Agraria La Molina (UNALM), Peru. These accessions originate from the three different ecogeographical environments (coast, highland and rainforest) present in Peru and therefore represent a broad sample of Peruvian maize diversity.

Image data of maize cobs

All accessions were photographed during their genebank registration. An image was taken with a set of 1–12 maize cobs per accession laid out side by side with a ruler and accession information. Because the accessions were collected over several years, the images were not taken under the same standardized conditions of background, rulers and image quality. Prints of these photographs were stored in light-protected cupboards of the genebank and were digitized with a flatbed scanner in 2015 and stored as PNG files without further image processing. In addition, all genebank accession were regenerated in 2015 at three different locations reflecting their ecogeographic origin and the cobs were photographed again with modern digital equipment under standardized conditions and also stored as PNG images. The image data thus consist of 1,830 original (ImgOld) and 1,619 new (ImgNew) images for a total of 3,449 images. Overall, the images show a high level of variation due to technical and genetic reasons, which are outlined in Fig. 8. These datasets were used for training and evaluation of the image segmentation methods. Passport information available for each accession and their assignment to the different landraces is provided in Additional file 1: Table S5. All images were re-scaled to a size of 1000 × 666 pixels with OpenCV, version 3.4.2 [7].

We used two different datasets for updating the image segmentation models and evaluating their robustness. The ImgCross image dataset contains images of maize cobs and spindles derived from a cross of Peruvian landraces with a synthetic population generated from European elite breeding material and therefore reflects genetic segregation in the F2 generation. The images were taken with digital camera at the University of Hohenheim under standardized conditions and differ from the other data sets by a uniform green background, a higher resolution 3888 × 2592 pixels (no re-sizing), a variable orientation of the cobs, orange labels and differently colored squares instead of a ruler.

A fourth set of images (ImgDiv) was obtained mainly from publicly available South American maize genebank catalogs and from special collections available as downloadable figures on the internet. The ImgDiv data vary widely in terms of number and color of maize cobs, image dimensions and resolution, number, position and orientation of cobs. Some images also contain rulers as in ImgOld and ImgNew.

Software and methods for image analysis

Image analysis was mainly performed on a workstation running Ubuntu 18.04 LTS and the analysis code was written in Python (version 3.7; [63]) for all image operations. OpenCV (version 3.4.2 [7]) was used to perform basic image operations like resizing and contour finding.

For Window-CNN and Mask R-CNN, deep learning was performed with the Tensorflow (version 1.5.0; [1]) and Keras (version 2.2.4; [10]) libraries. In Mask R-CNN, the framework [25] from the matterport implementation ( Mask_RCNN) was used and adapted to the requirements of the maize cob image datasets. Statistical analyses for evaluating the contribution of different parameters in Mask R-CNN and for the clustering of the obtained cob traits was carried out with R version 3.6.3 [54].

Due to the lack of previous studies on cob image analysis in maize genetic resources, we tested three very different approaches (Felzenszwalb-Huttenlocher segmentation, Window-CNN and Mask R-CNN) for cob and ruler detection and image segmentation. Details on their implementation and comparison can be found in the Additional file 2: Text, but our approach is briefly described below. For image analysis using traditional approaches, we first applied various tools such as filtering, water-shedding, edge detection and corner detection to representative subsets of ImgOld and ImgNew. These algorithms can be tested fast and easily on image subsets, however they are usually not robust towards changes in image properties (i.e. color, brightness, contrast, object size) and require manual fine-tuning of parameters. With our image dataset, the best segmentation results were obtained with the graph-based Felzenszwalb-Huttenlocher image segmentation algorithm [15] implemented in the Python scikit-image library version 0.16.2 [66] and the best ruler detection with the naive Bayes Classifier, implemented in the PlantCV library [19]. The parameters had to be manually fine-tuned for each of the two image datasets.

To evaluate deep learning, we used a windows-based (Window-CNN) and a Mask R convolutional neural network (Mask R-CNN), both of which require training on annotated and labeled image data. Convolutional Neural Networks [36] (CNNs) are known to be the most powerful feature extractors and their popularity for image classification dates back to the ImageNet classification challenge, which was won by the architecture AlexNet [35]. Generally, a CNN consists of 3 different layer types, which are subsequently connected: Convolutional layers, Pooling Layers and Fully-Connected (FC) Layers. In a CNN for cob detection the classes ‘cob’ and ‘ruler’ can be learned as a feature using deep learning, which provides maize cob feature extraction independent of the challenges in diverse images like scale, cob color, cob shape, background color and contrast.

Since our goal was to localize and segment the cobs within the image, we first used sliding window CNN (Window-CNN), which passes parts of an image to a CNN at a time and returns the probability that it contains a particular object class. Sliding windows have been used in plant phenotyping to detect plant segments [3, 9]. The main advantage of this method is the ability to customize the CNN structure to optimize automatic learning of object features. Our implementation of Window-CNN is described in detail in Additional file 2: Text.

Since sliding window CNNs have low accuracy and very long runtime, feature maps are used to filter out putative regions of interest on which boxes are refined around objects. Mask R-CNN [25] is the most recent addition to the family of R-CNNs [21] and includes a Region Proposal Network (RPN) to reduce the number of bounding boxes by passing only \(N\) region proposals that are likely to contain some object to a detection network block. The detection network generates the final object localizations along with the appropriate classes from the RPN proposals and the appropriate features from the feature CNN. Mask R-CNN extends a Fast R-CNN [20] with a mask branch of two additional convolutional layers that perform additional instance segmentation and return a pixel-wise mask for each detected object containing a bounding box, a segmentation mask and a class label. We tested Mask R-CNN on our maize cob image set to investigate the performance of a state-of-the-art deep learning object detection, classification and segmentation framework. The method requires time-consuming image annotation and expensive computational resources (high memory and GPU’s).

Implementation of Mask R-CNN to detect maize cobs and rulers

The training image data (200 or 1,000 images) were randomly selected from the two datasets ImgOld and ImgNew to achieve maximum diversity in terms of image properties (Additional file 1: Tables S1, S8). Both subsets were each randomly divided into a training set (75%) and a validation set (25%). Both image subsets were annotated using VGG Image Annotator (via; version 2.0.8 [13]). A pixel-precise mask was drawn by hand around each maize cob (Additional file 1: Figure S2). The ruler was labeled with two masks, one for the horizontal part and one for the vertical part, which facilitates later prediction of the bounding boxes of the ruler compared to annotating the entire ruler element as one mask. Each mask was labeled as "cob" or "ruler", and the annotations for training and validation sets were exported separately as JSON files.

The third step consisted of model training on multiple GPUs using a standard tensorflow implementation of Mask R-CNN for maize cob and ruler detection. We used the pre-trained weights of the COCO model, which is the standard model [25] derived from training on the MS COCO dataset [38], in the layout of resnet 101 (transfer learning). The original Mask R-CNN implementation was modified by adding two classes for cob and ruler in addition to the background class. Instead of saving all models after each training epoch, only the best model with the least validation loss was saved to save memory. For training the Mask R-CNN models, we used Tesla K80 GPUs with 12 GB RAM each on the BinAC GPU cluster at the University of Tübingen.

We trained 90 different models with different parameter settings (Additional file 1: Tables S1, S6) on both image datasets. The learning rate parameter learningrate was set to vary from \({10}^{-3}\), as in the standard implementation, to \({10}^{-5}\), since models with smaller datasets often suffer from overfitting, which may require smaller steps in learning the model parameters. Training was performed over 15, 50, or 200 epochs (epochsoverall) to capture potential overfitting issues. The parameter epochs.m distinguishes between training only the heads, or training the heads first, followed by training on the complete layers of resnet101. The latter requires more computation time, but offers the possibility to fine tune not only the heads, but all the layers to obtain a more accurate detection. The mask loss weight (masklossweight) was given the value of 1, as in the default implementation, or 10, which means a higher focus on reducing mask loss. The monitor metric (monitor) for the best model checkpoint was set to vary between the default validation loss and the mask validation loss. The latter option was tested to optimize preferentially for mask creation, which is usually more challenging than determining object class, bounding box loss, etc. The use of the minimask (minimask) affects the accuracy of mask creation and in the default implementation consists of a resizing step before the masks are forwarded by the CNN during the training process.

The performance of these models for cob and ruler detection was evaluated by the IoU (Intersection over Union) score or Jaccard index [29], which is the most popular metric to evaluate the performance of object detectors. The IoU score between a predicted and a true bounding box is calculated by

$$IoU=\frac{\text{Area of Overlap}}{\text{Area of Union}}$$

The most common threshold for IoU is 50% or 0.5. With IoU values above 0.5, the predicted object is considered as true positive (TP), else as a false positive (FP). Precision is calculated by


The average precision (AP) was calculated by averaging \(P\) over all ground-truth objects of all classes in comparison to their predicted boxes, as demonstrated in various challenges and improved network architectures [14, 26, 57].

Following the primary challenge metric of the COCO dataset [44], the goodness of our trained models was also scored by \(AP@\left[.5:.95\right]\), sometimes also just called AP, which is the average AP over different IoU thresholds from 50 to 95% in 5% steps. In contrast to usual object detection models where IoU/AP metrics are calculated for boxes, in the following IoU relates to the masks [55], because this explores the performance of instance segmentation. We performed an ANOVA with 90 model results scores to evaluate the individual impact of the parameters on the \(AP@\left[.5:.95\right]\) score. Logit transformation was applied to fit the assumptions of heterogeneity of variance and normal distribution (Additional file 1: Figure S4). Model selection was carried out including parameters learningrate (\({10}^{-3},{10}^{-4},{10}^{-5}\), epochs.m (1:only heads, 2:20 epochs heads, 3:10 epochs heads; for the rest all model layers trained), epochsoverall (15, 50, 200), masklossweight (1,10), monitor (val loss, mask val loss) and minimask (yes, no). Also all two-way interactions were included in the model, dropping non-significant interactions first and then non-significant main effects if none of their interactions were significant.

These results allow to formulate the following final model to describe contributions of the parameters on Mask R-CNN performance:

$${y}_{ijh}=\mu +{b}_{i}+{v}_{j}+{k}_{h}+{\left(bk\right)}_{ih}+{e}_{ijh}$$

where \(\mu\) is the general effect, \({b}_{i}\) the effect of the \(i\)-th minimask, \({v}_{j}\) the effect of the \(j\)-th overall number of epochs, \({k}_{h}\) the effect of the \(h\)-th training set size, the interaction effect between the number of epochs and the training set size and \({e}_{ijh}\) the random deviation associated with \({y}_{ijh}\). We calculated ANOVA tables, back-transformed lsmeans and contrasts (confidence level of 0.95) for the significant influencing variables. As last step of model training, we set up a workflow with the best model as judged by its \(AP@\left[.5:.95\right]\) score and performed random checks whether objects were detected correctly.

Workflow for model updating with new pictures

To investigate the updating ability of Mask R-CNN on different maize cob image datasets, we annotated additionally 150 images (50 training, 100 validation images) from each of the ImgCross and ImgDiv datasets. For ImgCross, the high resolution of \(3888\times 2592\) pixels was maintained, but 75% of the images were rotated (25% by 90, 25% by 180, and 25% by 270) to increase diversity. The corn cob spindles on these images were also labeled as cobs and the colored squares were labeled as rulers. The ImgDiv images were left at their original resolution and annotated with the cob and ruler classes.

The model weights of the best model (M104) obtained by training with ImgOld and ImgNew were used as initial weights and updated with ImgCross and ImgDiv images. Based on the statistical analysis, optimal parameter levels of the main parameters were used and only the network heads were trained with a learning rate of \({10}^{-3}\) for 50 epochs without the minimum mask. Training was performed with different randomly selected sets (10, 20, 30, 40, and 50 images) to evaluate the influence of the number of images on the quality of model updating. For each training run, all models with an improvement step in validation loss were saved, and the \(AP@\)[0.5:0.95] score was calculated for each of them. For comparison, all combinations of models were also trained with the standard COCO weights.

Statistical analysis of model updating results

To evaluate the influence of the data set, the starting model, and the size of the training set, an ANOVA was performed on the data set of \(AP@\left[.5:.95\right]\) from all epochs and combinations. Logit transformation was applied to meet the assumptions of heterogeneity of variance and normal distribution. Epoch was included as a covariate. Forward model selection was performed using the parameters dataset (ImgCross, ImgDiv), starting model (COCO, pre-trained maize model), and training set size (10, 20, 30, 40, 50). All two-way and three-way parameter interactions were included in the model. Because the three-way interaction was not significant, the significant two-way interactions and significant main effects were retained in the final model, which can be denoted as follows:

$${y}_{ijh}=\mu +{c}_{i}+{n}_{j}+{k}_{h}+{\left(cn\right)}_{ih}+{\left(nk\right)}_{jh}+{e}_{ijh}$$

ANOVA tables, back-transformed lsmeans and p-values (Additional file 1: Tables S7 and S8; confidence level of 0.95) for the significant influencing variables were calculated.

Post-processing of segmented images for automated measurements and phenotypic trait extraction

Mask R-CNN images are post-processed (Fig. 9) with an automated pipeline to extract phenotypic traits of interest, being either relevant for maize yield (i.e. cob measurements) or for genebank phenomics (i.e. cob shape or color descriptors to differentiate between landraces). The Mask R-CNN model returns a list of labeled masks, which are separated into cob and ruler masks for subsequent analysis. Contour detection is applied to binarized ruler masks to identify individual black or white ruler elements, whose length in pixel is then average for elements of a ruler to obtain a pixel value per cm for each image. Length and diameter of cob masks are then converted from pixel into cm values using the average ruler lengths. The cob masks are also used to calculate the mean RGB color of each cob. In contrast to a similar approach by Miller et al. [45], who sampled pixels from the middle third of cobs for RGB color extraction, we used the complete cob mask because kernel color was variable throughout the cob in highly diverse image data. We also used the complete cob mask to extract cob shape parameters that include asymmetry and ellipticity similar to a previous study of avian eggs [58], who characterized egg shape diversity using the morphometric equations of Baker [6]. Since our image data contained a high diversity of maize cob shapes we reasoned that shape parameters like asymmetry and ellipticity are useful for a morphometric description of maize cob diversity. For demonstration examples of symmetrical/asymmetrical and round/elliptical cobs please refer to Additional file 1: Figure S3. Overall the following phenotypic traits were extracted from almost 19,867 cobs: Diameter, length, aspect ratio (length/diameter), asymmetry, ellipticity and mean RGB color separated by red, green, blue channels. Our pipeline returned all cob masks for later analysis of additional parameters as .jpg images.

Fig. 9
figure 9

Post-processing of segmented images using a Mask R-CNN workflow that analyses segments labeled as’cob’ and’ruler’ to extract the parameters cob length, diameter, mean RGB color,and shape parameters ellipticity and asymmetry. Cob length and diameter measures in pixels are converted to cm values by measuring the contours of single ruler elements

Quantitative comparison between Felzenszwalb-Huttenlocher segmentation, Window-CNN and Mask R-CNN

For quantitative comparisons between the three image segmentation methods, a subset of 50 images from ImgOld and 50 images from ImgNew were randomly selected. None of the images were included in the training data from Window-CNN or Mask R-CNN, and the subset is unbiased against the training data. Therefore, overfitting issues were avoided. True measurements of cob length and diameter were obtained using the annotation tool via [13]. Individual cob dimensions per image could not be directly compared to predicted cob dimensions because Felzenszwalb-Huttenlocher segmentation and Window-CNN often contained multiple cobs in a box or certain cobs were contained in multiple boxes. Therefore, the mean of the predicted cob width and length per image was calculated for each approach, penalizing incorrectly predicted boxes. Pearson correlation was calculated between the true and predicted mean diameter and length of the cob per image separately for the ImgOld and ImgNew sets.

Unsupervised clustering to detect images with high cob diversity

To identify genebank accessions with high phenotypic diversity in ImgOld and ImgNew images, we used two different unsupervised clustering methods. In the first approach, individual cob features (width, length, asymmetry, ellipticity, and mean RGB values) were scaled after their extraction from the images. The Z-score of each cob was calculated as \(Z_{ij} = \frac{{x_{ij} - \dot{X}_{j} }}{{S_{j} }}\), where \({Z}_{ij}\) is the Z-score of the \(i\) th cob in the \(j\) th image, \({x}_{ij}\) is a measurement of the \(i\) th cam of the \(j\) th image, and \(\dot{X}_{j}\) and \({S}_{j}\) are the mean and are the standard deviation of the \(j\)-th image, respectively. The scaled dataset was analyzed using CLARA (Clustering LARge Applications), which is a multivariate clustering method suitable for large datasets, using the cluster R package [39]. The optimal cluster number was determined by the average silhouette method implemented in the R package factoextra [33].

In the second approach, we used the standard deviations of individual measurements within each each image (\({S}_{j}\)) as input for clustering. The standard deviations of each image were centered and standardized so that the values obtained for all images were on the same scale. This dataset was then clustered with \(k\)-means and the number of clusters, \(k\), was determined using the gap statistic [61], which compares the sum of squares within clusters to the expectation under a zero reference distribution.

Availability of data and materials

Image files and annotations: Deep learning model and manuals with codes for custom detections and model updating:


\(AP@\left[.5:.95\right]\) :

AP@[ IoU = 0.50:0.95], sometimes also called mAP


Clustering Large Applications


Region Proposal Network


  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS et al. TensorFlow: large-scale machine learning on heterogeneous systems; 2015.

  2. Abu Alrob I, Christiansen JL, Madsen S, Sevilla R, Ortiz R. Assessing variation in peruvian highland maize: tassel, kernel and ear descriptors. Plant Genet Resour Newsltr. 2004;137:34–41.

    Google Scholar 

  3. Alkhudaydi T, Reynolds D, Griffiths S, Zhou Ji, De La Iglesia B, et al. An exploration of deep-learning based phenotypic analysis to detect spike regions in field conditions for UK bread wheat. Plant Phenomics. 2019;2019:7368761.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Araus JL, Cairns JE. Field high-throughput phenotyping: the new crop breeding frontier. Trends Plant Sci. 2014;19(1):52–61.

    Article  CAS  PubMed  Google Scholar 

  5. Argüeso D, Picon A, Irusta U, Medela A, San-Emeterio MG, Bereciartua A, Alvarez-Gila A. Few-shot learning approach for plant disease classification using images taken in the field. Comput Electron Agric. 2020;175:105542.

    Article  Google Scholar 

  6. Baker DE. A geometric method for determining shape of bird eggs. Auk. 2002;119(4):1179–86.

    Article  Google Scholar 

  7. Bradski G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools; 2000.

  8. Campos H, Caligari PDS. Genetic improvement of tropical crops. Berlin: Springer; 2017.

    Book  Google Scholar 

  9. Cap QH, Suwa K, Fujita E, Uga H, Kagiwada S, Iyatomi H. An End-to-end practical plant disease diagnosis system for wide-angle cucumber images. Int J Eng Technol. 2018;7(4.11):106–11.

    Article  Google Scholar 

  10. Chollet F et al. Keras; 2015.

  11. Czedik-Eysenberg A, Seitner S, Güldener U, Koemeda S, Jez J, Colombini M, Djamei A. The ‘PhenoBox’, a flexible, automated, open-source plant phenotyping solution. New Phytol. 2018;219(2):808–23.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Dias PA, Shen Z, Tabb A, Medeiros H. FreeLabel: a publicly available annotation tool based on freehand traces. arXiv:1902.06806 [cs], February; 2019.

  13. Dutta A, Zisserman A. The VIA annotation software for images, audio and video. In: Proceedings of the 27th ACM international conference on multimedia. MM ’19. New York, NY, USA: ACM; 2019.

  14. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (VOC) challenge. Int J Comput Vision. 2010;88(2):303–38.

    Article  Google Scholar 

  15. Felzenszwalb PF, Huttenlocher DP. Efficient graph-based image segmentation. Int J Comput Vision. 2004;59(2):167–81.

    Article  Google Scholar 

  16. Fuentes A, Yoon S, Kim S, Park D. A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors. 2017;17(9):2022.

    Article  PubMed Central  Google Scholar 

  17. Furbank RT, Tester M. Phenomics–technologies to relieve the phenotyping bottleneck. Trends Plant Sci. 2011;16(12):635–44.

    Article  CAS  PubMed  Google Scholar 

  18. Ganesh P, Volle K, Burks TF, Mehta SS. Deep orange: mask r-CNN based orange detection and segmentation. IFAC-PapersOnLine. 2019;52(30):70–5.

    Article  Google Scholar 

  19. Gehan MA, Fahlgren N, Abbasi A, Berry JC, Callen ST, Chavez L, Doust AN, et al. PlantCV V2: image analysis software for high-throughput plant phenotyping. PeerJ. 2017;5(December):e4088.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Girshick R. Fast r-Cnn. In: Proceedings of the IEEE international conference on computer vision; 2015, p. 1440–48.

  21. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2014, p. 580–87.

  22. Granier C, Vile D. Phenotyping and beyond: modelling the relationships between traits. Curr Opin Plant Biol. 2014;18:96–102.

    Article  PubMed  Google Scholar 

  23. Grobman A. Races of maize in Peru: their origins, evolution and classification. Vol. 915. National Academies; 1961.

  24. Großkinsky DK, Svensgaard J, Christensen S, Roitsch T. Plant phenomics and the need for physiological phenotyping across scales to narrow the genotype-to-phenotype knowledge gap. J Exp Bot. 2015;66(18):5429–40.

    Article  PubMed  CAS  Google Scholar 

  25. He K, Gkioxari G, Dollár P, Girshick R. Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, p. 2961–69; 2017.

  26. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016, p. 770–78.

  27. Heerwaarden J van, Hufford MB, Ross-Ibarra J. Historical genomics of North American maize. In: Proceedings of the National Academy of Sciences, July; 2012, p. 201209275.

  28. Houle D, Govindaraju DR, Omholt S. Phenomics: the next challenge. Nat Rev Genet. 2010;11(12):855–66.

    Article  CAS  PubMed  Google Scholar 

  29. Jaccard P. Étude Comparative de La Distribution Florale Dans Une Portion Des Alpes Et Des Jura. Bull Soc Vaudoise Sci Nat. 1901;37:547–79.

    Google Scholar 

  30. Jeong YS, Lee HR, Baek JH, Kim KH, Chung YS, Lee CW. Deep learning-based rice seed segmentation for phenotyping. J Korea Ind Inform Syst Res. 2020;25(5):23–9.

    Google Scholar 

  31. Jiang Yu, Li C, Rui Xu, Sun S, Robertson JS, Paterson AH. DeepFlower: a deep learning-based approach to characterize flowering patterns of cotton plants in the field. Plant Methods. 2020;16(1):156.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Jin X, Pablo Zarco-Tejada U, Schmidhalter MP, Reynolds MJ, Hawkesford RK, Varshney TY, et al. High-throughput estimation of crop traits: a review of ground and aerial phenotyping platforms. IEEE Geosci Remote Sens Mag. 2020;9(1):200–31.

    Article  Google Scholar 

  33. Kassambara A, Mundt F. Factoextra: extract and visualize the results of multivariate data analyses. R Package Version. 2020;1:7.

    Google Scholar 

  34. Kistler L, Yoshi Maezumi S, Gregorio J, de Souza NAS, Przelomska FM, Costa OS, Loiselle H, et al. Multiproxy evidence highlights a complex evolutionary legacy of maize in South America. Science. 2018;362(6420):1309–13.

    Article  CAS  PubMed  Google Scholar 

  35. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.

    Google Scholar 

  36. Le C, Yann LD, Jackel BB, Denker JS, Graf HP, Guyon I, Henderson D, Howard RE, Hubbard W. Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Commun Mag. 1989;27(11):41–6.

    Article  Google Scholar 

  37. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

    Article  CAS  PubMed  Google Scholar 

  38. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: common objects in context. In: European conference on computer vision; 2014, p. 740–55. Springer.

  39. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. Cluster: cluster analysis basics and extensions; 2019.

  40. Makanza R, Zaman-Allah M, Cairns JE, Eyre J, Burgueño J, Pacheco Á, Diepenbrock C, et al. High-throughput method for ear phenotyping and kernel weight estimation in maize using ear digital imaging. Plant Methods. 2018;14(1):49.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Mascher M, Schreiber M, Scholz U, Graner A, Reif JC, Stein N. Genebank genomics bridges the gap between the conservation of crop diversity and plant breeding. Nat Genet. 2019;51(7):1076–81.

    Article  CAS  PubMed  Google Scholar 

  42. Matsuoka Y, Vigouroux Y, Goodman MM, Sanchez J, Buckler E, Doebley J. A single domestication for maize shown by multilocus microsatellite genotyping. Proc Natl Acad Sci. 2002;99(9):6080–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Messmer R, Fracheboud Y, Bänziger M, Vargas M, Stamp P, Ribaut J-M. Drought stress and tropical maize: QTL-by-environment interactions and stability of QTLs across environments for yield components and secondary traits. Theor Appl Genet. 2009;119(5):913–30.

    Article  PubMed  Google Scholar 

  44. Metrics of COCO Dataset. n.d.

  45. Miller ND, Haase NJ, Lee J, Kaeppler SM, de Leon N, Spalding EP. A robust, high-throughput method for computing maize ear, cob, and kernel attributes automatically from images. Plant J. 2017;89(1):169–78.

    Article  CAS  PubMed  Google Scholar 

  46. Mir RR, Reynolds M, Pinto F, Khan MA, Bhat MA. High-throughput phenotyping for crop improvement in the genomics era. Plant Sci. 2019;282:60–72.

    Article  CAS  PubMed  Google Scholar 

  47. Nguyen GN, Norton SL. Genebank phenomics: a strategic approach to enhance value and utilization of crop germplasm. Plants. 2020;9(7):817.

    Article  CAS  PubMed Central  Google Scholar 

  48. O’Mahony N, Campbell S, Carvalho A, Harapanahalli S, Hernandez GV, Krpalkova L, Riordan D, Walsh J. Deep learning vs. traditional computer vision. In: Science and information conference, p. 128–44. Springer; 2019.

  49. Ortiz R, Crossa J, Franco J, Sevilla R, Burgueño J. Classification of Peruvian highland maize races using plant traits. Genet Resour Crop Evol. 2008;55(1):151–62.

    Article  Google Scholar 

  50. Ortiz R, Crossa J, Sevilla R. Minimum resources for phenotyping morphological traits of maize (zea Mays l.) genetic resources. Plant Genet Resour. 2008;6(3):195–200.

    Article  Google Scholar 

  51. Ortiz R, Taba S, Tovar VH, Mezzalama M, Xu Y, Yan J, Crouch JH. Conserving and enhancing maize genetic resources as global public goods—a perspective from CIMMYT. Crop Sci. 2010;50(1):13–28.

    Article  Google Scholar 

  52. Ortiz R, Sevilla R. Quantitative descriptors for classification and characterization of highland peruvian maize. Plant Genet Resourc Newsl. 1997;110:49–52.

    Google Scholar 

  53. Peng B, Li Y, Wang Y, Liu C, Liu Z, Tan W, Zhang Y, et al. QTL analysis for yield components and kernel-related traits in maize across multi-environments. Theor Appl Genet. 2011;122(7):1305–20.

    Article  PubMed  Google Scholar 

  54. R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.

  55. Ren S, He K, Girshick R, Sun J. Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2016;39(6):1137–49.

    Article  PubMed  Google Scholar 

  56. Romero Navarro J, Alberto MW, Burgueño J, Romay C, Swarts K, Trachsel S, Preciado E, et al. A study of allelic diversity underlying flowering-time adaptation in maize landraces. Nat Genet. 2017;49(3):476–80.

    Article  CAS  PubMed  Google Scholar 

  57. Russakovsky O, Deng J, Hao Su, Krause J, Satheesh S, Ma S, Huang Z, et al. Imagenet large scale visual recognition challenge. Int J Comput Vision. 2015;115(3):211–52.

    Article  Google Scholar 

  58. Stoddard MC, Yong EH, Akkaynak D, Sheard C, Tobias JA, Mahadevan L. Avian egg shape: form, function, and evolution. Science. 2017;356(6344):1249–54.

    Article  CAS  PubMed  Google Scholar 

  59. Su WH, Zhang J, Yang C, Page R, Szinyei T, Hirsch CD, Steffenson BJ. Automatic evaluation of wheat resistance to fusarium head blight using dual mask-RCNN deep learning frameworks in computer vision. Remote Sens. 2021;13(1):26.

    Article  Google Scholar 

  60. Tardieu F, Cabrera-Bosquet L, Pridmore T, Bennett M. Plant phenomics, from sensors to knowledge. Curr Biol. 2017;27(15):R770–83.

    Article  CAS  PubMed  Google Scholar 

  61. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B. 2001;63(2):411–23.

    Article  Google Scholar 

  62. Ubbens J, Cieslak M, Prusinkiewicz P, Stavness I. The use of plant models in deep learning: an application to leaf counting in rosette plants. Plant Methods. 2018;14(1):6.

    Article  PubMed  PubMed Central  Google Scholar 

  63. Van Rossum G, Drake FL. Python 3 reference manual. Scotts Valley: CreateSpace; 2009.

    Google Scholar 

  64. Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E. Deep learning for computer vision: a brief review. Comput Intell Neurosc. 2018.

    Article  Google Scholar 

  65. Wallace JG, Rodgers-Melnick E, Buckler ES. On the road to breeding 4.0: unraveling the good, the bad, and the boring of crop quantitative genomics. Annu Rev Genet. 2018;52(1):421–44.

    Article  CAS  PubMed  Google Scholar 

  66. van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, Gouillart E, Tony Y. Scikit-image: image processing in Python. PeerJ. 2014;2(June):e453.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Wang Y, Yao Q, Kwok JT, Ni LM. Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv. 2020;53(3):1–34.

    Article  Google Scholar 

  68. Warman C, Fowler JE. Custom built scanner and simple image processing pipeline enables low-cost, high-throughput phenotyping of maize ears. bioRxiv 2019;780650.

  69. Wilkes G. Corn, strange and marvelous: but is a definitve origin known. In: Smith CW, Betran J, Runge ECA, editors. Corn: origin, history, technology, and production. Hoboken: Wiley; 2004. p. 3–63.

    Google Scholar 

  70. Xu H, Bassel GW. Linking genes to shape in plants using morphometrics. Annu Rev Genet. 2020;54(1):417–37.

    Article  CAS  PubMed  Google Scholar 

  71. Yang S, Zheng L, He P, Wu T, Sun S, Wang M. High-throughput soybean seeds phenotyping with convolutional neural networks and transfer learning. Plant Methods. 2021;17(1):1–17.

    Article  CAS  Google Scholar 

  72. Yu Y, Zhang K, Yang L, Zhang D. Fruit detection for strawberry harvesting robot in non-structural environment based on mask-RCNN. Comput Electron Agric. 2019;163:104846.

    Article  Google Scholar 

  73. Zhao T, Yang Y, Niu H, Wang D, Chen Y. Comparing u-Net convolutional network with Mask r-CNN in the performances of pomegranate tree canopy segmentation. In: Multispectral, hyperspectral, and ultraspectral remote sensing technology, techniques and applications VII, 10780:107801J. International Society for Optics; Photonics; 2018.

Download references


We are grateful to Gilberto Garcia for scanning and photographing the maize genebank accessions at UNALM, Emilia Koch for annotating the images, and Hans-Peter Piepho for statistical advice.


Open Access funding enabled and organized by Projekt DEAL. This work was funded by the the Gips Schüle Foundation Award to K.S. and by KWS SEED SE Capacity Development Projekt Peru grant to R.B. and K.S. We acknowledge support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through Grant No INST 37/935–1 FUGG.

Author information

Authors and Affiliations



LK and KS designed the study. LK performed the image analysis, implemented Felzenszwalb-Huttenlocher segmentation, Window-CNN and Mask R-CNN on the datasets, developed the model updating and carried out the statistical analyses. MCA conducted the multivariate analysis of phenotypic cob data. RB coordinated and designed the acquisition of the maize photographs. LK and KS wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Karl Schmid.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Additional Tables and Figures.

Additional file 2.

Additional Text.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kienbaum, L., Correa Abondano, M., Blas, R. et al. DeepCob: precise and high-throughput analysis of maize cob geometry using deep learning with an application in genebank phenomics. Plant Methods 17, 91 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: