Deep learning based high-throughput phenotyping of chalkiness in rice exposed to high night temperature

Background Rice is a major staple food crop for more than half the world’s population. As the global population is expected to reach 9.7 billion by 2050, increasing the production of high-quality rice is needed to meet the anticipated increased demand. However, global environmental changes, especially increasing temperatures, can affect grain yield and quality. Heat stress is one of the major causes of an increased proportion of chalkiness in rice, which compromises quality and reduces the market value. Researchers have identified 140 quantitative trait loci linked to chalkiness mapped across 12 chromosomes of the rice genome. However, the available genetic information acquired by employing advances in genetics has not been adequately exploited due to a lack of a reliable, rapid and high-throughput phenotyping tool to capture chalkiness. To derive extensive benefit from the genetic progress achieved, tools that facilitate high-throughput phenotyping of rice chalkiness are needed. Results We use a fully automated approach based on convolutional neural networks (CNNs) and Gradient-weighted Class Activation Mapping (Grad-CAM) to detect chalkiness in rice grain images. Specifically, we train a CNN model to distinguish between chalky and non-chalky grains and subsequently use Grad-CAM to identify the area of a grain that is indicative of the chalky class. The area identified by the Grad-CAM approach takes the form of a smooth heatmap that can be used to quantify the degree of chalkiness. Experimental results on both polished and unpolished rice grains using standard instance classification and segmentation metrics have shown that Grad-CAM can accurately identify chalky grains and detect the chalkiness area. Conclusions We have successfully demonstrated the application of a Grad-CAM based tool to accurately capture high night temperature induced chalkiness in rice. The models trained will be made publicly available. They are easy-to-use, scalable and can be readily incorporated into ongoing rice breeding programs, without rice researchers requiring computer science or machine learning expertise. Supplementary Information The online version contains supplementary material available at 10.1186/s13007-022-00839-5.

expected to reach 9.7 billion by 2050 [3], agricultural production must be doubled in order to meet this demand [4]. As of 2008, rice yields are increasing on average by 1% annually and, at this rate, the production will only increase by 42% by 2050 which falls well short of the desired target [5].
In addition to the required increase in production, climate variability threatens future rice grain yields and quality attributes [6,7]. Temperatures above 33 °C during anthesis can cause significant spikelet sterility [8][9][10][11]. It is predicted that approximately 16% of the global harvested area of rice will be exposed to at least 5 days of elevated temperature during the reproductive period by 2030s [12]. In addition to yield losses, heat stress during the grain-filling period is shown to increase grain chalkiness in rice [13][14][15]. Disaggregating the mean increase in global temperature has resulted in identifying a more rapid increase in the average minimum night temperature than the average maximum day temperature [16]. High night temperature stress during the grain-filling period can lead to severe yield and quality penalties, primarily driven by increased night respiration [17][18][19]. An increased rate of night respiration during grain-filling ultimately impairs grain yield and quality through reduction in 1000 grain weight, grain width, reduced sink strength with lowered sucrose and starch synthase activity resulting in reduced grain starch content, and an increase in rice chalkiness [13,[19][20][21].
"Chalkiness is the opaque part of the milled rice grain and is one of the key factors that determines rice grain quality 1 . " More specifically, chalkiness is the visual appearance of loosely packed starch granules [13,22]. The poor packaging of starch granules leads to an increased number of oversized air pockets within the grain. The air pockets prevent reflection, giving the chalky portions of the grains an opaque appearance [23]. Chalkiness is an undesirable trait, and an increased proportion of chalk leads to a linear decrease in the market value of rice [15]. In addition, high levels of chalk lead to increased breakage during milling and degrade cooking properties, and lower palatability [14,15,22,24].
Three different processes have been considered to explain the cause of increased chalkiness under heat stress: (1) a reduction in carbon capture, photosynthetic efficiency, or the duration of the grain-filling period inhibits the plant's ability to provide a sufficient amount of assimilates to the seed, (2) reduced activity of starch metabolism enzymes, which are used to convert sugars to starch, and (3) hormonal imbalance between ABA and ethylene as a high ABA-to-ethylene ratio is vital during grain-filling [25]. Physiologically, the level of chalkiness is dependent on the source-sink relationships, with the primary tillers in rice having greater advantage of accessing the carbon pool compared to later formed tillers. We tested the hypothesis that, under higher night temperatures, increased carbon loss due to higher respiration would lead to different levels of grain chalkiness among the tillers with the least chalkiness from primary panicles and the highest chalkiness in the later formed tillers. Regardless of the cause or differential chalkiness among tillers, the ability to quickly and accurately identify and quantify the chalkiness in rice is extremely important to help not only to understand the cause of chalkiness, but also to breed for heat tolerant nutritional rice varieties [19,[26][27][28].
Traditional grain phenotyping has been performed by manual inspection [29]. As such, it is subjective, inefficient, tedious, and error-prone despite the fact that it is performed by a highly skilled workforce [30]. Over the past decade, interest has grown in applying image-based phenotyping to provide quantitative measurements of plant-environment interactions with a higher accuracy and lower labor-cost than previously possible [31].
In particular, several automated approaches for rice grain chalkiness classification, segmentation and/or quantification have been developed. For example, the K-means clustering approach performs instance segmentation (i.e., identifies the pixels that belong to each instance of an object of interest, in our case "chalkiness") by grouping pixels based on their values [32]. One advantage of the K-means clustering approach is that it works in an unsupervised manner and does not require manually labeled ground truth [33]. However, one disadvantage is that it involves extensive parameter tuning to identify good clusters corresponding to objects of interest in an image. Furthermore, the final clusters depend on the initial centroids and the algorithm needs to be run several times with different initial centroids to achieve good results [34].
In addition to the K-means clustering approach, threshold based approaches have been used for chalkiness identification and quantification. For example, a multithreshold approach based on maximum entropy was used for chalky area calculation [35] and another thresholdbased approach was used to detect broken, chalky and spotted rice grains [36]. However, such approaches need extensive fine-tuning to identify the right thresholds and are not easily transferable to seeds of different types or to images taken under different conditions. Support vector machine (SVM) approaches have been used to classify grains according to the location of the chalkiness [37], and to estimate rice quality by detecting broken, chalky, damaged and spotted grains in red rice based on infrared images [38]. Similar to the threshold-based approaches, the SVM classifiers are not easily transferable to images containing different types of seeds or taken under different illumination conditions. Furthermore, they require informative image features to be identified and provided as inputs to produce accurate results. Rice chalkiness has also been addressed using specially designed imaging instruments. For example, Armstrong et al. used a single-kernel near-infrared (SKNIR) tube instrument and a silicon-based light-emitting diode (SiLED) highspeed sorter to classify single rice grains based on the percentage of chalkiness [39]. Unfortunately, the singlekernel approach is limited in scope and cannot be used to develop a high-throughput phenotyping method. More recently, volume based quantification technologies, such as X-ray microcomputed tomography, have been used to quantify rice chalkiness [27]. However, such technologies are extremely expensive and, thus, are beyond the reach of routine crop improvement programs and for traders and millers who regularly estimate chalkiness and establish a fair market price.
In recent years, the use of deep learning approaches for image classification and segmentation crop science tasks have led to state-of-the-art high-throughput tools that outperform the results from traditional machine learning and image analysis techniques [40,41], enabling researchers to capture a wide range of genetic diversity [42]. To the best of our knowledge, deep learning approaches have not been used to detect chalkiness despite being used to address other challenging problems in crop science. To address this need, we investigated modern deep learning techniques to create a tool that facilitates high-throughput phenotyping of rice chalkiness to support genetic mapping studies and enable development of rice varieties with minimal chalkiness under current and future warming scenarios. One possible solution to rapidly and accurately phenotype chalkiness is provided by Mask R-CNN [43]. Mask R-CNN is a widely used instance detection and segmentation approach, which employs a convolutional neural network (CNN) as its backbone architecture. One limitation of the Mask R-CNN approach is that it requires pixel-level ground truth with respect to the concept of interest, in our case, chalkiness. Acquiring pixel-level ground truth is laborious and expensive [44]. Furthermore, the Mask R-CNN segmentation approach labels the pixels of a rice grain as chalky or non-chalky, while sometimes it may be preferable to characterize the pixels based on the chalkiness intensity, i.e., on a continuous scale as opposed to a binary scale.
To address the limitations of the Mask R-CNN approach, we framed the problem of detecting chalkiness as a binary classification problem (i.e., a grain is chalky or non-chalky) and used CNNs combined with class activation mapping, specifically Grad-CAM [45], to identify the chalkiness area in an image. Grad-CAM works on top of a CNN model for image classification. It makes use of the gradients of a target category to produce a heatmap that identifies the discriminative regions for the target category (i.e., regions that explain the CNN model prediction) and implicitly localizes the category in the input image. By framing the problem as an image classification task, Grad-CAM can help reduce the laborious pixel-level labeling task to a relatively simpler image labeling task, i.e., an image is labeled as chalky or non-chalky. Furthermore, the heatmaps produced by Grad-CAM have soft boundaries showing different degrees of chalkiness intensity. The values of the pixels in a heatmap can be used to calculate a chalkiness intensity score corresponding to an image. This weakly supervised approach to segmentation was originally proposed by Oquab et al. [46] and has been used in other application domains [47][48][49][50][51], including in the agricultural domain for segmentation of citrus pests [52] and for remote sensing imagery [53], among others. Such approaches are generally called weakly supervised semantic segmentation approaches, given that they only require image-level labels as opposed to pixel-level labels.
The Grad-CAM based approach to rice chalkiness detection has the potential to help rice phenomics catch up with the developments in rice genomics [54] as well as help implementing new advances in achieving the target of nutritious food production goals by 2050 [55]. To summarize, the contributions of this research are: • We proposed to use a weakly supervised approach, Grad-CAM, to classify rice grains as chalky or nonchalky and subsequently detect the chalkiness area in chalky grains. • We experimented with the Grad-CAM approach (with a variety of CNN networks as backbone) on polished rice seeds and evaluated the performance using both instance classification and segmentation metrics as well as time and memory requirements. • We compared the weakly supervised Grad-CAM approach with the Mask R-CNN segmentation approach on polished seeds and studied its transferability to unpolished rice seeds (i.e., to rice seeds that have not been polished after the removal of the husk). • We tested the applicability of the tool in determining the level of chalkiness in rice plants exposed to high night temperature (HNT) and quantified the differential level of chalkiness among tillers within a plant exposed to HNT stress.

Deep learning methods for rice chalkiness segmentation
We address the rice chalkiness segmentation problem using a weakly supervised Grad-CAM approach, which requires binary (chalky or non-chalky) image-level labels as opposed to more expensive pixel-level labels.

Overview of the approach
The Grad-CAM approach includes two main components: (i) a deep CNN network (e.g., VGG or ResNet) that is trained to classify seed images into two classes, chalky or non-chalky; and (ii) a class activation mapping component, which generates a rice chalkiness heatmap as a weighted average of the feature maps corresponding to a specific layer in the CNN network. The chalkiness heatmap can be further used to calculate a chalkiness score, which quantifies the degree of chalkiness in each individual grain, and to estimate the chalkiness area for each grain. An overview of the approach is shown in Fig. 1. Details for the components of the model are provided below.

CNNs
Models based on CNNs have been successfully used for many image classification and segmentation tasks [56][57][58]. A CNN consists of convolutional layers (which apply filters to produce feature maps), followed by nonlinear activations (such as Rectified Linear Unit, or ReLU), pooling layers (used to reduce the dimensionality), and fully connected layers (that capture non-linear dependencies between features). The last fully connected layer in a classification network generally uses a softmax activation function and has as many output neurons as the number of target classes (in our case, two classes: chalky and non-chalky). The ImageNet competition (where a dataset with 1.2 million images in 1000 categories was provided to participants) has led to many popular architectures, including highly competitive architectures in terms of performance as well as cost-effective architectures designed to be run efficiently on low-cost platforms generally present in embedded systems [59]. We anticipate that our rice chalkiness detection models could be useful in both environments with rich computational resources and also environments with more limited resources. Thus, given the trade-off between model performance (i.e., accuracy) and model complexity (e.g., number of parameters, memory and time requirements), we consider a variety of networks (and variants) published between 2012 and 2019 including AlexNet [60], Very Deep Convolutional ResNet-101 has four main groups of convolution layers, shown as Layer1, Layer2, Layer3, and Layer4, consisting of 3, 4, 23 and 3 bottleneck blocks, respectively. b Each bottleneck block starts and ends with a 1 × 1 convolution layer, and has a 3 × 3 layer in the middle. The number of filters in each layer is shown after the kernel dimension. c Grad-CAM uses the gradients of the chalky category to compute a weight for each feature map in a convolution layer. The weighted average of the features maps, transformed using the ReLU activation, is used as the heatmap for the current image at inference time Networks (VGG) [61], Deep Residual Networks (ResNet) [62], SqueezeNet [63], Densely Connected Convolutional Networks (DenseNet) [64], and EfficientNet [65].

Grad-CAM approach
The Grad-CAM approach was originally proposed by Selvaraju et al. [66] in the context of understanding the predictions of a CNN model. In recent years, this approach and its variants have been frequently used for weakly supervised object localization [67]. Given a trained CNN model and an input image at inference time, the Grad-CAM approach uses the gradients of a category of interest (specifically, the corresponding logit provided as input to the softmax function) to compute a category-specific weight for each feature map in a selected convolution layer. Formally, let f k (with k = 1, . . . , K ) be a feature map in a particular convolutional layer, which consists of a total of K feature maps. Let y c be the logit (i.e., input to the softmax function) of the class of interest, c (e.g., chalky). Grad-CAM averages the gradients of y c with respect to all N pixels f k ij of the features map f k to calculate a weight w c k representing the importance of the feature map f k . Specifically, The feature maps f k of the selected convolutional layer are averaged into one final heatmap for the category of interest, c, according to the importance weights w k c , i.e., H c = F k w k c f k , where F is a non-linear activation function. Traditionally, ReLU has been used as the activation function to cancel the effect of the negative values while emphasizing areas that positively contribute to the category c. The heatmap, H c , is resized to the original input size using linear interpolation. The resized heatmap, H c final , can be used to identify the discriminative regions for the category of interest, c, and implicitly to localize the category in the input image. More specifically, the localization is obtained by binarizing the input image using a threshold T on the values of the pixels in the heatmap as first proposed by Zhou et al. [68]. The value of the threshold T is dependent on the data and task at hand, but can be found using a trial-and-error process as shown in related prior works [69][70][71][72]. Regarding the convolutional layer used to produce the heatmao, the last layer was originally used by Selvaraju et al. [66], under the assumption that the last layer captures the best trade-off between highlevel semantic features and spatial information. However, in this study, we experimented with a variety of convolutional layers, from lower level convolutional layers (more general) to higher level convolutional layers (more specific), to identify sets of features maps that best capture chalkiness.

Variants of the Grad-CAM approach
Grad-CAM is a strong candidate for being used as an explainability/localization approach in the weakly supervised framework, as evidenced by many recent studies that have employed Grad-CAM [48,52,[73][74][75][76]. However, several extensions and variants of the Grad-CAM approach have been proposed, e.g. Grad-CAM++ [77] and Score-CAM [78]. Grad-CAM++ is a direct extension of Grad-CAM which was designed to address two limitations: (1) the fact that Grad-CAM does not properly identify/localize all occurrences of a class object; and (2) the fact that Grad-CAM may not always localize the whole object instance, but only parts of it. Score-CAM is an alternative class activation mapping approach which aims to get away with the use of gradients, as they can be noisy or may vanish when dealing with deep neural networks. Instead of using gradients to weight activation maps, Score-CAM proposes to use weights that correspond to the forward pass scores of the original images perturbed with upsampled activation maps on the target class. We compared Grad-CAM with the Grad-CAM++ and Score-CAM variants to understand if a newer approach can be used to further improve the results obtained using Grad-CAM.

Application of Grad-CAM to rice chalkiness detection
We used the Grad-CAM approach to generate chalkiness heatmaps for rice grain images. The heatmaps show the concept of chalkiness using soft boundaries through a color gradient. This representation is very appropriate for localizing the concept of chalkiness, which exhibits different levels of intensity and, thus, has inherently soft boundaries that separate the chalky area from the non-chalky area. The heatmap, H chalky final , corresponding to a particular convolutional layer (determined using validation data) is the final rice chalkiness heatmap and can be used to visualize the area of a seed that is discriminative with respect to chalkiness. This heatmap can further be converted into a chalkiness score corresponding to a rice grain as follows: where Z represents the total number of pixels in the GrainArea in the original image. The resulting chalkiness score has a numerical value between 0 and 1, where 0 means that the grain shows no chalkiness and 1 means that the grain has severe chalkiness all over its surface. Finally, the heatmap is used to create a binary mask for the chalkiness area using a threshold on the intensityof the pixels(determined based on validation data). The masked area can be used to estimate the area of the chalkiness as a percentage of the total grain area. The numeric scores, including the chalkiness score and the chalkiness area, obtained from large mapping populations can be used in determining the genetic control of chalkiness in rice.

Baseline approach-mask R-CNN
Mask R-CNN is an object instance segmentation approach, i.e., an approach that identifies instances of given objects in an image (in our case, the chalkiness concept) and labels their pixels accordingly. Mask R-CNN extends an object detection approach, specifically Faster R-CNN [79], to perform instance segmentation. The Faster R-CNN network first identifies Regions of Interest (ROI, i.e., regions that may contain objects of interest) and their locations (represented as bounding box coordinates) using a Region Proposal Network (RPN). Subsequently, the Faster R-CNN network classifies the identified regions (corresponding to objects) into different classes (e.g., chalkiness and background) and also refines the location parameters to generate an accurate bounding box for each detected object. In addition to the object classification and the bounding box regression components of the Faster R-CNN, the Mask R-CNN network has a component for predicting instance masks for ROIs (i.e., identifying all pixels that belong to an object of interest). One advantage of the Mask R-CNN approach is that it is specifically trained to perform instance segmentation and, thus, produces a precise mask for objects of interest. The main disadvantage of the Mask R-CNN baseline, as compared to the weakly supervised Grad-CAM approach, is that it requires expensive pixel-level annotation for training. We compared the weakly supervised Grad-CAM approach to chalkiness segmentation with Mask R-CNN in terms of performance and also time and memory requirements. We have selected the Mask R-CNN approach as a strong baseline for the weakly supervised Grad-CAM approach, given that Mask R-CNN has been shown to be a very competitive approach for instance segmentation in many application domains [80][81][82][83][84][85].

High night temperature stress experiment
In this section, we describe plant materials and the biological experiment that generated the data (i.e., rice grains) used in this study.

Plant materials
Six genotypes (CO-39, IR-22, IR1561, Oryzica, WAS-174, and Kati) with contrasting chlorophyll index responses to a 14-day drought stress initiated at the agronomic panicle-initiation stage were used in this study [86]. The experiment was carried out in controlled environment chambers (Conviron Model CMP 3244, Winnipeg, MB) at the Department of Agronomy, Kansas State University, Manhattan, KS, USA.

Crop husbandry and high night temperature stress imposition
Seeds obtained from the Germplasm Resources Information Network (GRIN) database were sown at a depth of 2 cm in pots (1.6-L, 24 cm tall and 10 cm diameter at the top, MT49 Mini-Treepot) filled with farm soil. Seedlings were thinned to one per pot at the three-leaf stage. Controlled-release Osmocote (Scotts, Marysville, OH, USA) fertilizer (19% N, 6% P 2 O 5 , and 12%K 2 O) was applied (5 g per pot) before sowing along with 0.5 g of Scotts Micromax micronutrient (Hummert International, Topeka, KS) at the three-leaf stage. The plants were well-watered throughout the experiment and a 1-cm water layer was maintained in the trays holding the pots. Seventy-two plants were grown with at least 12 plants per genotype wherein 6 plants were used for control and the remainder for HNT. Plants were grown in controlled environment chambers maintained at control temperatures of 30/21 °C (maximum day/minimum night temperatures; actual inside the chamber: 32.6 °C [SD±1.0]/21.1 °C [SD±0.3]) and relative humidity (RH) of 70% until treatment imposition. Both control and HNT chambers were maintained at a photoperiod of 11/13 h (light/dark; lights were turned on from 0700 to 1800 h, with a dark period from 1800 to 0700 h) with a light intensity of 850 µmol m −2 s −1 above the crop canopy. Temperature and RH were recorded every 15 min using HOBO UX 100-011 temperature/RH data loggers (Onset Computer Corp., Bourne, Massachusetts) in all growth chambers. At the onset of the first spikelet opening, the main tiller, primary tillers and other tillers of the flowering genotype were tagged and readied for treatment imposition. The same approach was followed for all six genotypes and replicates. Tagged replicate plants were moved to HNT (30/28 °C) chambers and equal numbers of plants were similarly tagged and maintained in control conditions. Six independent plants for each genotype were subjected to HNT stress (30/28 °C-day/night temperatures; actual: 31.

Data collection
At physiological maturity, the plants were harvested from both the control and HNT treatments. The panicles were separated into main panicles (the panicle on the main tiller), two primary panicles (tillers that followed the main panicle), and other remaining panicles for each plant from each treatment and hand threshed separately. Subsequently, the grains were de-husked using the Kett, Automatic Rice Husker TR-250.
In addition to the unpolished grains, polished grains were also used in the initial model development and testing, as polished grains are easier to analyze and label with respect to chalkiness and could potentially be beneficial in terms of knowledge transfer to unpolished rice. The polished grains were obtained from Rice Research and Extension Center in Stuttgart Arkansas, University of Arkansas for preliminary testing and to establish the model. The polished rice grains composed of both medium and long grain rice. For each grain size, there are three degrees of grain chalkiness (roughly estimated by a domain expert): low, medium, and high chalkiness. Thus, based on grain size and degree of chalkiness, the grains were grouped into six categories: (1) long grain, low chalkiness; (2) long grain, medium chalkiness; (3) long grain, high chalkiness; (4) medium grain, low chalkiness; (5) medium grain, medium chalkiness; and (6) medium grain, high chalkiness.

Rice grain image acquisition and processing Image acquisition
Both polished and unpolished grain samples were arranged in transparent 90 mm Petri-plates with three Petri-plates for each sample. A sample corresponds to a size/chalkiness combination in the case of polished rice and a genotype/tiller/condition combination in the case of unpolished rice. Three replicates (i.e., sets of grains to be used in one scan) were randomly selected (without replacement) for each sample. The grains were scanned using an Epson Perfection V800 photo scanner attached to a computer (see Additional file 1: Fig. S1). Images were scanned at a resolution of 800 dots per inch (dpi) and saved in the TIFF (.tif ) file format for further image analysis. A total of 18 (i.e., 3 × 2 × 3 ) images were acquired for polished rice, and 108 (i.e., 3 × 6 × 3 × 2 ) images for unpolished rice. The scanned images included all borders of the three Petri-plates but not excessive blank area outside of the dishes, as shown in Additional file 2: Fig. S2.

Image preprocessing
Each scanned image (for both polished and unpolished rice grains) was approximately 6000 × 6000 pixels. This size is extremely large for deep learning approaches, which require GPU acceleration [87]. Furthermore, as we aim to perform chalkiness detection at grain level using a weakly supervised approach, we need images that contain individual seeds. To reduce the size of the images and to enable grain level labeling and analysis, we resorted to cropping individual grains from the original Petri-plate images (which contain approximately 25-30

Image annotation and benchmark datasets Ground truth labeling
Two types of manual annotations were performed and used as ground truth in our study, as shown in Fig. 3. First, for the Grad-CAM weakly supervised approach to chalkiness segmentation, we labeled each rice grain image as chalky or non-chalky. The labeling was done based on visual inspection of the images by a domain expert. Second, to train Mask R-CNN models, which inherently perform instance segmentation, and to evaluate the ability of the Grad-CAM approach to accurately detect the chalkiness area in a rice grain, we manually marked the chalkiness area using polygons. The polygon annotation was performed by a domain expert using the VGG Image Annotator [88], a web-based manual annotation software. Compared to the image-level labeling (i.e., chalky/non-chalky), the polygon annotation is significantly more expensive, as it requires 10 to 100 clicks to draw the polygons, given the irregular shape of the chalkiness area.
Out of 1645 polished grains used in our study, 660 grains were labeled as chalky and 985 grains were labeled as non-chalky. The exact numbers of chalky and nonchalky grains in each of the eighteen high-resolution images with polished rice are shown in Additional file 3: Table S1 in columns 6 (Chalky) and 7 (Non-chalky). To be able to evaluate segmentation performance and to compare the Grad-CAM approach with Mask R-CNN, we also labeled the 660 chalky grains in terms of chalkiness area (represented as a polygon).
Similarly, out of 13,101 unpolished grains, 4085 grains were labeled as chalky and 9,016 grains were labeled as non-chalky. The exact numbers of chalky and nonchalky grains in each of the 108 high-resolution images of unpolished rice are shown in Additional file 4: Table S2 in columns 7 (Chalky) and 8 (Non-chalky). We note that many of the 36 possible genotype/tiller/condition combinations have a small number of chalky grains (or do no have any chalky grain at all). Specifically, 12 combinations corresponding to genotypes CO-39 and Kati contain 4085 chalky grains and 1299 non-chalky grains, while the remaining 24 combinations contain 151 chalky grains and 7717 non-chalky grains. Thus, we used only the 12 chalky prevalent combinations for training, tuning and evaluating the models designed in this study. Twenty chalky grain images from each of these 12 combinations (for a total of 240 images) were used as test set. To estimate the chalkiness segmentation performance on unpolished rice, the 240 test images were labelled also in terms of chalkiness area using polygons. We did not label all the chalky images  Manual annotations. a Image-level annotation: each seed is labeled as chalky or non-chalky (technically, the label was created by dragging each rice seed image into chalky or non-chalky folder, respectively). b Specific chalkiness annotation: chalkiness area is marked with polygons using VGG Image Annotator (each red dot in the image represents a click). The dark white opaque region in panel "a" is the chalk portion while the non-chalky region is translucent in terms of chalkiness area due to the cost associated with this annotation. The number of images labeled as chalky and non-chalky and also the number of chalky images annotated in terms of chalkiness area are summarized in Table 1.

Training, development and test datasets
To train, fine-tune and evaluate our models, we created training, development and test datasets for both polished and unpolished (12) grain images. In the case of polished grain images, for each grain size and chalkiness degree combination, we aimed to use one of the three replicates for training, another one for development and the last one for testing. However, to ensure a larger number of images in the training set (which is common practice in machine learning), we moved some of the instances from the development and test replicates/subsets to the training subset, so that the final distribution of the data split was approximately 2:1:1. In the case of unpolished seed images, for each genotype, tiller and condition combinations, we used a similar procedure to split the three replicates into training training, development and test subsets. The specific distribution of chalky/non-chalky grain images in the training/development/test subsets is shown in Table 2 for both polished and unpolished rice. It should be noted that our splitting process ensures that the training subset contain all types of grains considered and there is no grain that belongs to at least two subsets. We used the training subsets to train the models (both Grad-CAM networks for binary chalky/non-chalky classification and the Mask R-CNN networks for chalkiness segmentation). We used the development subsets to fine-tune hyper-parameters for the models. Finally, the performance of the models is evaluated on the test subsets. The subsets are made publicly available to ensure reproducibility and to enable further progress in this area.

Experimental setup
In this subsection, we state several research questions that we aim to address and describe the experiments performed to answer these questions. We also discuss the metrics used to evaluate the models trained in our experiments and the hyper-parameters that were fine-tuned to obtain the most accurate models.

Research questions
We aim to answer the following research questions (RQ): RQ1 Among different CNN networks used as the backbone in the Grad-CAM models for polished rice, what network is the best overall in terms of chalky/non-chalky classification performance versus time and memory requirements? Also, what network is the best overall in terms of chalkiness segmentation? RQ2 How does the Grad-CAM weakly supervised approach to chalkiness segmentation compare with the Mask R-CNN segmentation approach to chalkiness detection in polished rice? RQ3 What is the performance of the Grad-CAM models for unpolished rice? What is the performance of the polished rice models when used to make predictions on unpolished rice? Does the performance improve if we fine-tune the polished rice models with unpolished rice?

Experiments
To answer RQ1, we trained Grad-CAM models with several CNN networks as backbone, including variants of AlexNet, DenseNet, ResNet, SqueezeNet, VGG and EfficientNet pre-trained on ImageNet. We compared the models in terms of classification performance, memory and time requirements. We also identified the best model/network for each type of architecture.  Subsequently, we study the variation of those best models with respect to the layer used to generate the heatmaps and the threshold used to binarize the heatmaps when calculating the average Intersection-over-Union (IoU). The goal is to identify the best overall layer and threshold for each type of network. The best models (with the best layer and threshold) are used to evaluate the localization accuracy, both quantitatively and qualitatively, for chalkiness detection in polished rice. To answer RQ2, we also trained Mask R-CNN models (with the default ResNet-101 as backbone) and compared them with the best weakly supervised Grad-CAM approach. Finally, to answer RQ3, we first trained and evaluated a Grad-CAM model (with ResNet-101 as backbone) on unpolished rice. We compared the performance of the resulting model with the performance of a model trained on polished rice and also with the performance of the polished rice model fine-tuned on unpolished rice.

Evaluation metrics
We evaluated the performance of the Grad-CAM approach along two main dimensions. First, we evaluated the ability of the approach to correctly classify seeds as chalky and non-chalky using standard classification metrics such as accuracy, precision, recall and F1 measure. Second, we evaluated the ability of the approach to perform chalkiness segmentation (i.e., the ability to identify the chalky area in the chalky seed images) using standard segmentation metrics. Specifically, we calculated average IoU [89], as well as localization accuracy and ground truth known (GT-known) localization accuracy [90] for the chalky class. Figure 4 illustrates the process of calculating IoU between the ground truth mask for the chalkiness area and the predicted chalkiness mask. As opposed to classification accuracy, which considers a prediction to be correct if it exactly matches the ground truth label, the localization accuracy considers a prediction to be correct if both the image label and the location of the detected object are correct. For the location of the object to be correct, the object mask needs to have more than 0.5 overlap with the ground truth mask. The overlap is measured as the IoU. In our case, we calculated the localization accuracy for the chalky class as the fraction of seed images for which the predicted mask for the chalky area has more than 50% IoU with the ground-truth mask.
We also calculated the GT-known localization accuracy, which eliminates the influence of the classification results, as it considers a prediction to be correct when the IoU between the ground truth mask and estimated mask (in our case, for the chalky class seed images) is 0.5 or more.

Hyper-parameter tuning
Deep learning models, in general, and the ResNet, VGG, SqueezeNet, DenseNet EfficientNet networks, in particular, have many configurable hyper-parameters. We tuned several hyper-parameters shown to affect the performance of all models. More specifically, we tuned the batch size used in gradient descent to control the number of rice seeds processed before updating the internal model weights. Furthermore, we tuned the learning rate which controls how much we are adjusting the network weights with respect to the gradient of the loss function.
The specific values that we used to tune the batch size were 16, 32 and 64. The values used to tune the learning rate were 0.1, 0.01, 0.001, 0.0001 and 0.00001. For each network, the best combination of parameters was selected based on the F1 score observed on the validation subset. Each model was run for 200 epochs and the best number of epochs for a model was also selected based on the validation subset. Overall, our hyper-parameter tuning process revealed that the performance did not change significantly with the parameters considered. All the models were trained on Amazon Web Services (AWS) p3.2xlarge instances. According to AWS 2 , the configuration of the p3.2xlarge instance is as follows: 1 GPU, 8 vCPUs, 61 GiB of memory, and up to 10 Gbps network performance.
As opposed to the models used as backbone for the Grad-CAM approach, the Mask R-CNN network with ResNet-101 as backbone could only be trained with a batch size of 8 images on AWS p3.2xlarge instances. The same learning rate values as for the CNN networks were used for tuning. However, this network was trained for a total of 600 epochs, as opposed to just 200 epochs for the other models. No other hyper-parameters specific to Mask R-CNN network were fine-tuned. Table 3 shows classification results for a variety of network architectures (and variants within one type of architecture) that were used as backbone for the Grad-CAM models. Specifically, we experimented with variants of the DenseNet, ResNet, SqueezeNet, VGG, and EfficientNet architectures. All the variants that we used have models pre-trained on ImageNet, which allowed us to perform knowledge transfer and train weakly supervised models for chalkiness detection with a relatively Table 3 Classification results on polished rice with various networks as backbone in the weakly supervised Grad-CAM approach  small number of chalky/non-chalky seed images. Only models that we could train on AWS p3.2xlarge instances were included in the table to allow for a fair comparison in terms of training time. Each model is trained and fine-tuned on the training and development subsets consisting of polished rice seed images. Performance is reported in terms of overall accuracy and also precision, recall and F1 measure for both the chalky and non-chalky classes. The best results for one type of architecture are highlighted with bold font. For each model included in Tables 3, 4 shows the training time (seconds), number of parameters, and size (MB) of the models versus the classification accuracy of the model. As can be seen from Table 3, the overall classification accuracy varies from 93.67% (for EfficientNetB2 and EfficientNetB5) to 95.61% (for DenseNet-121). The DenseNet-121 model, which has the highest classification accuracy, also has the highest F1 measure for both chalky and non-chalky classes, although there is at least one competitive variant for each architecture type, e.g., ResNet-101 for ResNet, SqueezeNet-1.0 for SqeezeNet, VGG-16 for VGG, and EfficientNetB4 for EfficientNet. Furthermore, the DenseNet-121 model has a relatively small size (28 MB) and average training time (approximately 1500 s). Surprisingly, the SqueezeNet architecture, which is highly competitive in terms of performance, has the smallest size (3.0/2.9 MB for SqueezeNet-1.0/ SqueezeNet-1.1, respectively) and smallest training time (approximately 500 s). The VGG models have the largest size (more than 500 MB) and relatively large training time (in the range of 2400 to 3000 s), and the best EfficientNet variant (EfficientNetB4) has moderate size (approximately 140 MB) but relatively large training time (approximately 3500 s). Finally, the ResNet-101 variant, which is the best in the ResNet group, has moderate size (170 MB) and training time (close to 1700 s). Based on these results, we selected one model for each type of architecture and used those selected models for further analysis.

Chalkiness detection in polished rice
To produce accurate detection of chalkiness area, we first studied the variation of the average IoU with respect to the layer used to generate the heatmaps and the threshold, T, used to binarize the heatmaps when calculating the IoU. The best layer/threshold combination was   selected independently for each type of network using both qualitative and quantitative evaluations. Based on preliminary visual inspection of the heatmaps, we observed that heatmaps corresponding to lower level layers in a network result in better approximations of the chalkiness area, possibly because the progressive downsampling along the convolutional layers of the backbone CNN makes it hard to precisely recover the chalkiness information from the higher level feature maps [91]. Therefore, for each type of network, we evaluated a lower-level layer (e.g., layer1_2_conv2 for ResNet-101), two intermediate layers (e.g., layer2_0_conv2 and layer3_1_conv2 for ResNet-101), and one high-level layer (e.g., layer4_1_conv3 for ResNet-101). The threshold, T, varied from 10% to 80% in increments of 10. We focused our analysis on ResNet-101 moving forward as this network produced the best segmentation results overall. Table 5 shows the variation of performance (i.e., average IoU over the set of chalky seed images) with the layer and the threshold for ResNet-101. As shown in Table 5, for ResNet-101 we obtained better performance with a lower-intermediate layer (layer2_0_ conv2) as opposed to a higher layer as reported in other studies [66,68], and a threshold of T = 60% of the highest pixel value, which is larger than the standard T = 15% [66] or T = 20% [68] thresholds frequently used in prior studies. A similar pattern is observed in terms of threshold with the other four networks. As for the layer, the DenseNet-121 and VGG-16 give the best results using a lower layer, while SqueezeNet-1.0 and EfficientNetB4 networks give best results with a higher layer. More details can be seen in Additional file 5: Table S3.
To gain more insights into the heatmap layer and threshold, Fig. 5 shows qualitative and quantitative results obtained with Grad-CAM using ResNet-101 as backbone for 10 sample seed images in the test dataset when considering three thresholds ( 20% , 40% , 60% ) and four convolution layers. As can be seen in the figure, seeds with a larger chalky area (e.g., seeds 6 and 10) are less sensitive to the layer chosen, i.e., several layers produce heatmaps with high IoU scores. However, for seeds with a smaller or narrow chalky area, the results are more sensitive to the layer selected and the best results are obtained with the intermediate layer, layer2_0_conv2. Another observation that can be made from Fig. 5 is that, overall, the lower layers tend to have sharper boundaries as opposed to the higher levels that have softer boundaries, making it harder to find a good threshold. This may be due to the fact that higher levels in the network correspond to lower dimensional feature maps, which no longer preserve boundary details when interpolated back to higher dimensions. Additional files 6, 7, 8: Figs. S3-S5 show similar quantitative and qualitative results produced by SqueezeNet-1.0, DenseNet-121 and VGG-16 networks, respectively, on the same 10 seeds shown in Fig. 5. Despite the good classification results obtained with these networks, the heatmaps show lighter colors and softer boundaries for the chalkiness area and overall poor chalkiness detection results as compared with the results of ResNet-101. A better understanding regarding this can be gained from Fig. 6 which shows a side-by-side comparison of the heatmaps produced by different networks and the corresponding binarized chalkiness masks. The masks obtained with Mask R-CNN are also shown.
The same conclusions regarding the superiority of ResNet-101 for chalkiness segmentation are supported by a quantitative evaluation of the networks in terms of localization metrics computed over the whole test set. The results of this evaluation are shown in Table 6 for the best performing models for each type of architecture considered as backbone (DenseNet-121, ResNet-101, SqueezeNet-1.0, VGG-19, and EfficientNetB4). For each network, the specific convolution layer and the threshold used to produce the results are shown in the last two columns of the table, respectively. The results obtained with the Mask R-CNN network, which has ResNet101 as its backbone, are also shown in Table 6. As can be seen, the best results are obtained using the ResNet-101 network (for all metrics considered), while the next best results are obtained with DenseNet-121. Among the weakly Table 5 Variation of the Average IoU (%) with the layer and threshold used for ResNet-101 The layer is used to generate the heatmaps and the threshold T is used to binarize the heatmaps (e.g., T = 20% means that only pixels with values at least 20% of the max pixel value in the image are included in the binary mask). The layers were sampled to include a low-level layer (layer1_2_conv2), a high-level layer (layer4_1_ conv3) and two intermediate layers (layer2_0_conv2 and layer3_1_conv2) that showed good results based on a qualitative inspection of the maps. The threshold T is varied from 20% to 80% in increments of 10. The best result and the corresponding layer and threshold are highlighted in bold font  as backbones. This shows that the weakly supervised approach is more effective for the chalkiness detection/ segmentation problem in addition to being less laborious in terms of data labeling, as compared to the Mask R-CNN segmentation approach.
To understand if the results obtained with Grad-CAM and ResNet-101 can be further improved with an  Table 7 and show that Grad-CAM consistently outperforms its Grad-CAM++ variant and the Score-CAM approach. While this may seem surprising at first, we note that Grad-CAM++ has been designed to handle multiple occurrences of an object in an image. However, in our task, chalkiness is a concept with soft boundaries and doesn't present multiple occurrences. Thus, Grad-CAM++ may highlight additional regions that are not representative of chalkiness, as marked by human annotators. As for Score-CAM, this approach has been shown to find firmer, less fuzzy localizations of the objects of interests. However, as chalkiness inherently has relatively fuzzy, soft boundaries, Score-CAM usually highlights a smaller area as compared with the manual annotations, resulting in worse overall  Table 6 Chalkiness Segmentation: results of the weakly supervised Grad-CAM approach with the best performing classification models as backbone The results of Mask R-CNN with ResNet-101 as backbone are also shown. Only the 166 chalky seed images in the test set were used for chalkiness segmentation evaluation. Performance is reported using the following metrics (as applicable): Ground-Truth Localization Accuracy (GT-known Loc. Acc.), which represents the fraction of ground-truth chalky seed images with IoU ≥ 0.5 ; Localization Accuracy (Loc. Acc.), which represents the fraction of ground-truth chalky images, with IoU ≥ 0.5 , correctly predicted by the model; Average IoU (Avg. IoU), which represents the average IoU for the set of chalky seed images. To calculate the IoU, the mask of the predicted chalkiness is obtained using a threshold T = 60% of the maximum pixel intensity. The last two columns show the layer that was used for generating the heatmap and the threshold used to binarize the heatmap when calculating IoU, respectively  results compared with Grad-CAM, although better than Grad-CAM++. Additional file 9: Fig. S6 shows the heatmaps found by the three approaches (Grad-CAM, Grad-CAM++ and Score-CAM) by comparison with the manually annotated chalkiness area for four sample seeds. The heatmaps support our conclusions regarding Grad-CAM++ and Score-CAM results (shown in Table 7).

Chalkiness classification and detection in unpolished rice
Another objective of this study is to explore the applicability of the Grad-CAM approach to unpolished rice seeds and to study the transferability of the models trained on polished rice to unpolished rice (as unpolished rice seeds can be harder to annotate manually). This is important as researchers working on large breeding populations involving hundreds of lines do not obtain large sample sizes and would not have access to polish a small amount of seeds, which requires models that can effectively operate on unpolished seeds. To address this objective, we performed experiments with three models that use ResNet-101 as their backbone: (1) a model trained on polished seed images, called polished model; (2) a model trained on unpolished seed images, called unpolished model; and (3) a model originally trained on polished seed images and subsequently fine-tuned on unpolished seed images, called mixed model. All models were evaluated on the 240 seed images in the unpolished test set, which were manually annotated in terms of chalkiness area. These images belong to one of the 12 combinations corresponding to the Kati and CO-39 genotypes, i.e., unpolished (12) set. The training and developments sets used to train the unpolished and mixed models belong to the unpolished(12) set as well (see Table 2). Classification results for the three models are shown in Table 8, while segmentation results are shown in Table 9. As can be seen in Table 8, the mixed model performs the best overall in terms of classification metrics, although the unpolished model has similar performance for both chalky and non-chalky classes. However, as Table 9 shows, the unpolished model is by far the most accurate in terms of segmentation metrics, while the polished model is the worst.
To visually illustrate the output of each model, Fig. 7 shows the chalkiness prediction masks of the polished, unpolished and mixed models for four unpolished seeds. The polished model largely over-estimates the chalkiness area given the opaque nature of the unpolished seeds, as opposed to the translucent appearance of the polished seeds. The mixed model improves the masks but not as much as the unpolished model that is trained specifically on unpolished rice seeds. Together, these results suggest that not much knowledge can be transferred directly from the polished images to unpolished images, as the appearance of the chalkiness is different between polished and unpolished seeds. The results can be improved with the mixed model which fine-tunes the polished Table 7 Comparison between the chalkiness segmentation results of the weakly supervised approaches Grad-CAM, Grad-CAM++ and Score CAM with ResNet-101 as backbone on polished rice Only 166 chalky seed images in the polished test set were used for chalkiness segmentation evaluation. Performance is reported using the following metrics: Ground-Truth Localization Accuracy (GT-known Loc. Acc.), which represents the fraction of ground-truth chalky seed images with IoU ≥ 0.5 ; Localization Accuracy (Loc. Acc.), which represents the fraction of ground-truth chalky images, with IoU ≥ 0.5 , correctly predicted by the model; Average IoU (Avg. IoU), which represents the average IoU for the set of chalky seed images. To calculate the IoU, the mask of the predicted chalkiness is obtained using a threshold T = 60% of the maximum pixel intensity. The last two columns show the layer that was used for generating the heatmap and the threshold used to binarize the heatmap when calculating IoU, respectively  models on unpolished rice, although the fine-tuned models still lag behind the models trained directly on unpolished rice. Hence, models developed using polished or unpolished grains needs to be used based on the objective with poor transferability across these two categories.

Answers to the Research Questions and Error Analysis
As mentioned in Section "Experimental setup", we set to answer three main research questions: [RQ1] aims to identify the best Grad-CAM models for polished rice, in terms of classification and segmentation performance; [RQ2] is focused on the segmentation performance of the weakly supervised Grad-CAM approach by comparison with Mask R-CNN; and [RQ3] is focused on the performance of models for classifying unpolished rice and transferability of information from polished to unpolished rice.
To answer RQ1, we evaluated several CNN architectures in terms of classification accuracy, memory and time requirements, and also chalkiness detection performance in polished rice. While the architectures studied have comparable classification performance, the ResNet-101 network was found to be superior with respect to chalkiness detection in polished rice and has relatively small memory and time requirements. Furthermore, we compared the best weakly supervised Grad-CAM models with the Mask R-CNN segmentation model to answer RQ2 and found that the Grad-CAM models performed better than Mask R-CNN, which needs more expensive pixel level annotation. Overall, the chalkiness detection results obtained for polished rice are remarkably good, with an average IoU of 68.11%, GT-known accuracy of 83.34% and localization accuracy of 83.13%. Finally, to answer RQ3, we used Grad-CAM models trained on polished rice, unpolished rice, and a mix of polished and unpolished rice and evaluated them on unpolished rice. When studying the transferability of the models trained on polished rice to unpolished rice, we found that fine-tuning on unpolished rice is necessary. In fact, models trained directly on the unpolished rice performed the best in our study. More specifically, our evaluation on unpolished rice grain images showed that the best model trained directly with unpolished rice had an average IoU of 51.76%, while both the GT-known Table 9 Chalkiness segmentation results of the weakly supervised Grad-CAM approach with ResNet-101 as backbone on unpolished rice Only 240 chalky seed images in the Unpolished (12) test set were used for chalkiness segmentation evaluation. Performance is reported using the following metrics: Ground-Truth Localization Accuracy (GT-known Loc. Acc.), which represents the fraction of ground-truth chalky seed images with IoU ≥ 0.5 ; Localization Accuracy (Loc. Acc.), which represents the fraction of ground-truth chalky images, with IoU ≥ 0.5 , correctly predicted by the model; Average IoU (Avg. IoU), which represents the average IoU for the set of chalky seed images. To calculate the IoU, the mask of the predicted chalkiness is obtained using a threshold T = 60% of the maximum pixel intensity. The last two columns show the layer that was used for generating the heatmap and the threshold used to binarize the heatmap when calculating IoU, respectively accuracy and the localization accuracy were 63.75%. It is not surprising that the models perform better on polished rice as chalkiness is easier to detect after the interfering aluerone layer is removed through milling. While the use of the Grad-CAM approach for rice chalkiness segmentation was extremely successful, one challenge that we encountered was the tuning of the layer to be used for generating the heatmaps as well as the threshold for producing the binary masks for chalkiness area. Our goal was to find a good overall layer and threshold for a model to avoid the pitfall of tuning the threshold for each type of rice seed. Our analysis showed that a lower layer generally results in better chalkiness detection. One explanation for this is that higher levels undergo more extensive down-sampling (through successive applications of pooling layers) and this causes loss of information that cannot be recovered in the chalkiness heatmaps. Regarding the threshold for binarization, our results showed that a higher threshold (e.g., T = 60% ) produces better overall results. One possible reason for the higher threshold may be given by the fact that our images have relatively low contrast between the chalky area and its neighboring area, as compared to other segmentation tasks for which weakly supervised approaches have been used. However, a threshold of 60% has also been used for binarizing gray images, e.g. fingerprint images [92] or textile pilling images [93], which are similar in nature to our chalkiness images.
Error analysis of the polished models revealed several sources of errors that lead to disagreement between model predictions and ground truth annotations. Such sources are illustrated in Fig. 8 and include: (a) inconsistencies in the way chalkiness is manually annotated due to the soft/fuzzy boundaries of chalkiness (as opposed to binary chalky versus non-chalky boundaries); (b) scratches or marks (referred to as noise) on the chalkiness area are interpreted as non-chalkiness and lead to mismatches with the ground truth annotations in terms of IoU metric; (c) irregular chalkiness shapes also make it hard to annotate chalkiness very precisely; (d) abrasion stains that are recognized as chalkiness (white dots on the right in the figure) despite the fact that the Grad-CAM model uses deeper feature maps that presumably miss some "details"; (e) irregular shape and fuzzy boundaries affect the ground truth annotations and consequently the predictions in unpolished rice as well. Despite such errors, we found that the best Grad-CAM model for unpolished rice, trained on the Kati and The sources of error can be summarized as: a Inconsistencies in the way chalkiness is manually annotated, due to the white gradient nature of chalkiness; b Scratches or marks (referred as noise) on the chalkiness area can be interpreted as non-chalkiness; c Irregular chalkiness shape makes it hard to annotate chalkiness very precisely; d Abrasion stains can be recognized as chalkiness (white dots on the right in the figure); e Irregular shape and fuzzy boundaries affect the ground truth annotations and the predictions in unpolished rice as well CO-39 genotypes, can generalize well to unpolished rice grains from the other genotypes included in the biological experiment. Additional file 10: Fig. S7 shows the prediction results of the unpolished model on 12 rice grains randomly selected from the genotypes not used in the training, together with their manual annotations. When analyzing images predicted as false positives by the model with ResNet-101 as backbone, we observed that the main reason for the model to predict non-chalky images as chalky is the presence of larger abrasion stains or damaged seed ends that are recognized as chalkiness, although not considered to be chalkiness by manual annotators. Some examples of false positive seed images, together with their corresponding chalkiness heatmaps produced by Grad-CAM are shown in Additional file 11: Fig. S8.

Tool availability and time requirements
In terms of time requirements, our experiments showed the average time for training a ResNet-101 model on an EC2 p3-2xlarge instance available from AWS is 1668.41 s, as shown in Table 4, and no human intervention is required during that time. Once the model is trained, the average time to predict the label of a new image and create a chalkiness heatmap is less than 1 s using an EC2 p2-xlarge instance. Given these time requirements and assuming that thousands of images need to be annotated for genetic mapping studies, our models could be extremely cost-effective and help save significant human efforts and time that would otherwise be invested in the manual annotation.

Development of rice with less chalk under future hotter climate
Quantifying rice chalkiness rapidly and accurately continues to be a limitation for capturing the degree of chalkiness across a wide range of genetic backgrounds due to the lack of a high throughput phenotyping tool. Developing such a tool is important and timely as the proportion of chalky grains are bound to increase under warming scenarios, particularly with increasing night temperatures [19,94]. We used the tool developed based on Grad-CAM to determine the percent chalkiness area and the chalkiness score for each of the 13,101 unpolished rice grains extracted from the original scanned images. As opposed to the chalkiness area, which is obtained based on a binary map, the chalkiness score considers the intensity of chalkiness for each pixel, with red indicating greater proportion of chalk per pixel and blue indicating the least proportion of chalk per pixel (Figs. 5 and 6). Subsequently, we aggregated the percent chalkiness and the chalkiness score per sample (i.e., for each combination genotype/tiller/treatment). Using the aggregates, we analyzed differences between genotypes, tiller and treatment in terms of chalkiness in three scenarios. In scenario 1, where the chalkiness was determined using the coarse chalky versus non-chalky classification of the grains, analysis based on the number of grains with and without chalk resulted in a poor analytical resolution and failed to detect any differences or significant interaction effects (Additional file 12: Table T4). In scenario 2, analysis based on the proportion of area of chalkiness determined from the Grad-CAM binarized heatmaps improved the prediction power where apart from genotype (G) main effect, the interaction effects of HNT treatment (T) *G, G* panicle type (P), and T*G*P interaction effects were significant (Table 10). This finding indicated that the approach was able to detect the differential proportion of chalkiness in different tillers across genotypes under HNT exposure during grain-filling. Using this approach, genotypic differences in the proportion of accumulation of chalkiness were observed with IR1561 and WAS-174 which recorded an increase of chalkiness in grains in primary and other panicles as compared to main tiller under HNT, while the same was reduced in IR-22 and Kati and was not affected in CO-39 and Oryzica (Table 10). Percent change in proportion of chalkiness under HNT in primary and other panicles compared to main panicle ranged from −0.89% in IR1561 to 122% in WAS-174. Grains from both primary and other panicles recorded an increase in proportion of chalkiness by 63 and 122%, respectively, compared to main panicle under HNT in WAS-174 (Table 10). In scenario 3, the chalkiness score was calculated using the pixel intensity in the chalkiness heatmaps produced by Grad-CAM and analysis of variance for chalkiness score revealed a significant effect of G, T*G, G*P and T*G*P further indicating an improvement in prediction potential for chalkiness among genotypes, treatments and tiller types (Table 10).
Similar to proportion of chalkiness area, chalkiness score showed an increase under HNT compared to control in IR1561 and WAS-174 in primary and other panicles, while the same was reduced in IR-22 and Kati (Table 10). Among the genotypes, WAS-174 recorded highest percent increase in chalkiness score under HNT in grains from primary (74%) and other panicles (59%) compared to main panicle (Table 10). In contrast, Oryzica recorded an increase in chalkiness score under HNT in grains from primary (46%) and other panicles (99%) compared to main panicle. Genotypes like CO-39, IR1561 and IR-22 showed minimal changes in chalkiness score between tillers under HNT (Table 10). In summary, identifying and using such germplasm (for example, CO-39 and Oryzica) with minimal chalkiness, even under HNT, will help develop rice varieties that can sustain quality under future warming scenarios without having a negative impact of economic revenue of the rice farmers. In addition, the ability to obtain the level of chalkiness in less than 1 s per image and in batch mode allows these models to be used efficiently as a highthroughput phenotyping tool for capturing chalkiness in large breeding populations and to efficiently incorporate genetics leading to low grain chalkiness into ongoing global rice breeding programs.

Limitations of the study
While the methodology proposed in this study provides tremendous benefits to global rice breeding programs, we would also like to point out its limitations: • We have shown that the weakly supervised approach, Grad-CAM, can be used to detect chalkiness in white rice (both polished and unpolished grains, with higher accuracy for polished grains, as expected). However, the approach may not work as well for coloured rice such as brown or black rice, given the more opaque nature of coloured rice grains. • Our experiments showed that models cannot effectively be transferred from polished rice to unpolished rice. Instead, models trained specifically on unpolished rice have better accuracy. This result suggests that new models may need to be trained and finetuned (in terms of threshold T for binarization and convolution layer to be used for the heatmap) for other types of rice, or for images taken under different conditions. • The sets of images that were used in this study contained mostly seeds that did not touch each other. The ability of the tool to determine chalkiness in samples without physical separation of grains was not tested in this study. Hence, the tool would require additional training to be able to quantify chalkiness under different proportions of overlap of grains. • Our approach was designed to determine the global chalkiness in grains, but it does not consider specific chalkiness types such as white-belly, white core, or white-base. However, the developed models set the stage for further refinement to determine the different proportions of chalky types in future work.

Conclusions
In this study, we presented the application of a high throughput deep learning tool to detect the chalkiness area in polished and unpolished rice grains. To avoid the need for cumbersome pixel-level annotation, we used a weakly supervised segmentation approach, Grad-CAM, which addresses the problem as a binary classification task and subsequently uses the gradients of the grain chalk to produce a chalkiness heatmap. Experimental results showed that it is possible to use the Grad-CAM model with ResNet-101 as a backbone Table 10 Percentage chalkiness area and chalkiness score were obtained for individual seeds randomly selected across treatments and genotypes