Skewed distribution of leaf color RGB model and application of skewed parameters in leaf color description model

Background Image processing techniques have been widely used in the analysis of leaf characteristics. Earlier techniques for processing digital RGB color images of plant leaves had several drawbacks, such as inadequate de-noising, and adopting normal-probability statistical estimation models which have few parameters and limited applicability. Results We confirmed the skewness distribution characteristics of the red, green, blue and grayscale channels of the images of tobacco leaves. Twenty skewed-distribution parameters were computed including the mean, median, mode, skewness, and kurtosis. We used the mean parameter to establish a stepwise regression model that is similar to earlier models. Other models based on the median and the skewness parameters led to accurate RGB-based description and prediction, as well as better fitting of the SPAD value. More parameters improved the accuracy of RGB model description and prediction, and extended its application range. Indeed, the skewed-distribution parameters can describe changes of the leaf color depth and homogeneity. Conclusions The color histogram of the blade images follows a skewed distribution, whose parameters greatly enrich the RGB model and can describe changes in leaf color depth and homogeneity.


Background
In recent years, high-throughput techniques for phenotype identification in greenhouses and fields have been proposed in combination with non-invasive imaging, spectroscopy, robotics, high-performance computing and other new technologies, to achieve higher resolution, accuracy and fast [1,2]. With the increasing maturity of digital image technology and the rising popularity of high-resolution camera equipment, research is becoming more feasible on qualitative and quantitative descriptions of phenotypic traits of plant appearance using digital imaging techniques [3][4][5][6]. Digital cameras can record spectral leaf information in visible color bands, with high resolutions and low costs [7]. In addition, digital color images contain rich information of plant morphology, structure, and leaf colors. So, leaf digital images are often exploited to identify changes in leaf color [8][9][10].
The most commonly used color representation for digital color images is the RGB color model. For an RGB color image, three color sensors per pixel can be used to capture the intensity of light in the red, green, and blue channels, respectively [11]. Existing software tools, such as MATLAB is used to process the obtained digital pictures [12]. The study of RGB color models of plant leaves has a long history [13]. After decades of development, the RGB color information of plant leaves has been exploited for the determination of chlorophyll content and indicators of changes in this content [14]. To exploit the data further, researchers suggested a number of RGB-based color features for the determination of chlorophyll levels in potato, rice, wheat, broccoli, cabbage, barley, tomatoes, quinoa and amaranth [15][16][17][18][19][20][21][22][23]. Many formulas have also been suggested to determine leaf chlorophyll content based on RGB components such as (R Mean − B Mean )/(R Mean + B Mean ),G Mean / (R Mean + G Mean + B Mean ), R Mean /(R Mean + G Mean + B Mean ), G Mean /R Mean , [20]. However the problem of the small amount of information still persists. This information scarcity has become a bottleneck in the application of RGB models, greatly limiting their use.
In the analysis of RGB data of leaf images, the cumulative frequency distributions of the R Mean , G Mean and B Mean components have been generally assumed to follow a normal distribution. However, recent studies have reported that the cumulative frequency distributions of leaf colors follow skewed distributions. For example, Wu et al. found that the cumulative frequency of tea leaf color has a skewed distribution, and that the deviations with new and old leaves had clear differences [21]. Also, the moisture condition in maize leaves is related to the deviation of the grayscale values in the RGB blade model [22]. The asymmetry of a skewed distribution can be described by the partial frequency distributions of the skewed distribution curve. Several parameters can be derived from a skewed distribution including the mean, median, mode, skewness, kurtosis, and others.
The SPAD leaf chlorophyll meter is one of the most widely used hand-held meters for rapid and non-destructive assessment of the chlorophyll content in many crops [23]. In this paper, we analyzed the frequency distributions of the red, green, blue and grayscale channels in RGB leaf images and confirmed the skewed characteristics of these distributions. By extracting relevant distribution parameters, models are established for the correlation of the color characteristic parameters and the SPAD chlorophyll concentration values. When the skewness parameter was exploited, we found that both the fitting degree and the prediction accuracy were greatly improved. The proposed spatial model could predict the SPAD values more accurately, and explain the physiological significance of the leaf color changes. We hope that this work would provide researchers with a new method for the analysis of blade color patterns in RGB digital images.

Experimental design
In this work, the tobacco was planted in pots on November 25, 2017 at Shanghang County Township, Fujian, China (24°57′N,116°30′E). The 50-day-old seedlings were transferred to the field. Then, tags were made for 400 new tobacco leaves which exhibited consistent normal growth and leaf color, as well as no signs of pests and diseases after 15 days. A total of 100 leaves were collected at 40, 50, 60 and 65 days of leaf age, respectively. For each leaf, the SPAD value was measured at 10 AM. Then, the leaves were picked and sent to a dark room to take photos for them immediately.

Leaf image collection
On the same day of plant sampling, tobacco leaves were transferred to one platform in a dark room. The platform used for image acquisition is a rectangular desktop of a 300-cm length, a 200-cm width, and an 80-cm height. The desktop bottom plate is a white matte scrub countertop. Images were captured using a high-resolution camera (CANON EOS-550D, Canon Company, Japan) with a resolution of 3840 × 5120 pixels. The camera was mounted on atripod at the nadir position with a constant height of 1 m above the top of the platform. The light sources are two 20-W strip white LED lamps with a color temperature of 4000 K. To ensure light uniformity, the lamp suspension positions in the platform are at 1/4th, and 3/4th of the 200 cm distance to the fixed digital camera.

Leaf image segmentation, denoising and color feature extraction
The commercial image-editing software, Adobe Photoshop CS, was used to manually cut each original image, save the PNG image as a transparent background, and adjust the image size to 1000 × 1330. The MATLAB 2016R computing environment was used for the extraction and analysis of the color image data. First, the imread and rgb2gray functions were respectively used to read each color image and obtain its gray-level information. Then, the double function was used to convert each gray-level array into a double-precision array. The mean, median, mode, skewness and kurtosis functions were respectively used to analyze and obtain the mean, median, mode, skewness, kurtosis, and other parameters of the double-precision arrays of the red, green and blue channels as well as the gray-level image for each color leaf image.

Color cumulative histogram construction and normality testing
The imread and rgb2gray functions are used to read each color image and obtain its gray-level counterpart. Then, using the image histogram functions, the cumulative histograms of the double-precision arrays of the red, green, blue and gray-level data were obtained. The Lilliefors and Jarque-Bera tests were used to test the distribution normality.

Chlorophyll concentration measurement
For measuring the chlorophyll concentration, a chlorophyll meter (SPAD-502, Zhejiang Topuiunnong Technology Co., Ltd., China) was used to obtain the SPAD values for 50 pieces of fully-expanded tobacco leaves at 40, 50, 60 and 65 days of age, respectively. Each leaf blade was measured at five points: one on the upper part, two at the middle part, and two at the petiole of both sides of the leaf. The measurement process was designed to ensure that the sample completely covers the receiving window, avoid the veins only, and determine the leaf meat tissue. For each blade, the SPAD value is the mean value of the 5 measured points.

Model building and goodness-of-fit testing
We mainly used the IBM SPSS Statistics22 software to analyze the blade features at ages of 40, 50, 60 and 65 days, and establish multivariate linear regression models, F 1 and F 2 , by stepwise regression. In the F 1 model, we got the parameters (R Mean , G Mean , B Mean ) using the mean function for three color channels. Then, we used each of these three parameters and ten combinations of them (namely (

Computer equipment
In this work, images and data were processed using a virtual private server. The hardware resources included Intel Xeon CPU E5-2640 2.5 GHz with 2 DDR4 8 GB RAMs. This server type can perform billion double-precision real-time floating-point operations.

Distribution characteristics and normality verification of color gradation cumulative frequency of leaf-color RGB model
In previous studies, the histogram of RGB leaf colors was mostly assumed to follow a normal distribution [24][25][26][27]. However, the validity of this assumption was contested by some reports. To verify the suitability of the proposed method, we designed an experiment that involves tobacco leaf images with different sample sizes and growth periods. We found that the tobacco leaves gradually decayed, and that the leaf color changed from green to yellow after 40 days. All histograms of single-leaf RGB images at different leaf ages (40, 50, 60, and 65 days) had skewed distributions (Fig. 1). No one RGB color distribution (red, green, blue or grayscale) was completely normal and the skewness changed regularly with the increase in the leaf age. To further confirm our histogram-based findings, we performed the Lilliefors and Jarque-Bera normality test using color gradation data of 50 leaves. The results showed that the normal distribution hypothesis value was1, and the p value was 0.001 (< 0.05). That means the leaf color distribution follows a skewed distribution, not a normal one.

Correlation between skewed-distribution parameters and SPAD values
We have shown that the leaf RGB color distribution is a skewed distribution. Using skewed-distribution analysis in MATLAB, we got 20 parameters including the mean, median, mode, skewness and kurtosis for the red, green, blue and grayscale channels, respectively. In the individual-leaf color distribution, the parameters of the skewness and kurtosis represent the state of the leaf color distribution ( Table 1). The skewness showed obvious changes with different leaf ages and decreased from positive to negative values. This also indicates that the color distribution of tobacco leaves is skewed throughout their lifetime. The SPAD values showed increasing and then decreasing trends.  Table 2. In Table 3, we carried on correlation analysis using 20 RGB skewed-distribution parameters with 200 leaves of four leaf ages. The results showed 17 out of 20 parameters were significantly correlated with the SPAD values at the 0.01 level. This means the change of the chlorophyll content was highly correlated with the change of the leaf color. While the chlorophyll distribution area is not uniform, it is numerically related to the increase in skewness.

Construction of the correlation models between the SPAD and leaf color parameters
The correlation model can be established by the leaf color parameters based on the skewed distribution and the SPAD value. In previous studies, researchers generally used stepwise regression methods based on ordinary least squares (OLS) to construct the association model. For comparison with previous models, we used the mean parameters R Mean , G Mean , B Mean and their combinations to establish multivariate linear regression models by stepwise regression, then chose the best combination as the model F 1 (Table 4). We also extended the parameter range and adopted 20 parameters to establish multivariate linear regression models by stepwise regression, then chose the best as the model F 2 . We found that the leaf color parameters changed linearly with increasing leaf ages, while the SPAD value was characterized by first increasing and then decreasing. Since different color gradations represent different wavelengths of light, we were inspired to use the Fourier functions to fit and get the model F 3 (Fig. 2). The leaf color showed different kinds of change, both in depth and in heterogeneity at different positions, with non-planar characteristics. Therefore, to model the bidirectional changes of leaf color (i.e. the change of leaf color depth and distribution), we used the MATLAB Curve Fitting Toolbox to fit the polynomial F4 that incorporates spatial bidirectional patterns (Fig. 3).
In order to assess the advantages and disadvantages of the four models, we compare their fitting performance ( Table 5). The models F 2 , F 3 and F 4 had higher R 2 . The model F 4 increased 21% compared with the model F 1 . To evaluate the prediction accuracy of the four models, we collected another batch of leaf images with four values of leaf ages and 50 blades for each age value ( Table 5). The models F 2 and F 4 had more Fig. 1 Color gradation cumulative frequency histograms for single-leaves at four different leaf ages. The leaves are picked at random. Color gradation cumulative frequency histograms of the red, green, and blue color channels as well as gray-level images are showed at 40, 50, 60 and 65 days of leaf age. The X-axis is the cumulative frequency, and the Y-axis is the intensity level frequency accurate prediction, and the accuracy of F 4 increased 5% compared with F 1 . The SSE and RMSE metrics of the F 4 model were superior to those of the other models. Therefore, the model F 4 based on the spatial feature polynomial with the spatial bidirectional patterns is the optimal model.

Discussion
In the past, the use of the RGB models for leaf color analysis had obvious limitations. The biggest drawback of such model was that it had too few parameters to use, only the mean values of the red, green, blue, and grayscale intensities [24]. Although previous studies have proposed a variety of models based on combinations of these parameters, no plausible explanation was given for the physiological significance of these parameters in describing leaf color changes [21,22]. The reason for this was that when RGB features were extracted from digital images, the descriptive statistics were based on a normal distribution. This normality assumption is only a convenience for finding approximate values, but it cannot reflect the distribution of leaf colors in a comprehensive and truthful way.
In this work, we verified through general normality tests that the RGB color gradation histogram followed a skewed distribution for tobacco leaves with different leaf ages. As a result, we extend the color gradation distribution parameters in the RGB model. These parameters include the mean, median, mode, skewness, and kurtosis. This gives a total of 20 parameters for 4 channels, while the common normal-distribution parameter is only the mean value.
Each of these parameters reflects some property or trait of leaf color. When the mean value is extracted based on a normality assumption, the leaf color heterogeneity is ignored. The mean can only describe the state of the leaf color depth quantitatively. This cannot fully reflect a real leaf color distribution at any leaf age. The description of the skewed distribution not only expands quantitative leaf color information but also systematically characterizes the leaf color depth and homogeneity. The skewness and kurtosis are features that mainly reflect the leaf color homogeneity. These features make it possible to accurately and quantitatively describe leaf color from different aspects.

Table 1 Parameters using skewed-distribution analysis and the SPAD values
The 20 parameters include the mean, median, mode, skewness and kurtosis with the red, green, and blue color channels as well as the gray-level images with MATLAB using 50 pieces of fully expanded tobacco leaves at 40, 50, 60 and 65 days, respectively. The SPAD values also come from the 50 leaves for each leaf age. Each leaf blade was measured at five points: one on the upper part, two at the middle part and two at petiole of both leaf sides. Values without a common letter are significantly different according to the Duncan test (p < 0.05)

Table 2 Correlation between the mean parameters and their combinations for tobacco leaves and the blade SPAD values
The mean parameters of the red, green, and blue color channels as well as the gray-level images were obtained using 50 pieces of fully expanded tobacco leaves at 40, 50, 60 and 65 days, respectively. The SPAD values also come from 50 leaves at each leaf age. Each leaf blade was measured at the same five points mentioned in Table 2 ** Indicates significant correlation according to a two-tailed test (p < 0.01) * Indicates significant correlation according to a two-tailed test (p < 0.05) We found 17 of the 20 parameters to be significantly correlated with the SPAD value at the 0.01 significance level. We try to model the chlorophyll content and distribution of leaves with these parameters. In earlier studies, the mean parameters of the R, G, and B components as well as their combinations were generally used with a normality assumption to establish models by stepwise regression. We also used this method to get the model F 1 . After comparing the models F 2 , F 3 and F 4 with F 1 using skewed-distribution parameters, we found that the model based on the median and the skewness could better fit the SPAD value. More parameters increased the accuracy of the RGB model description and prediction, and extended its application range. When we used the Fourier method in the model F 3 , we found that the fitting degree was higher than that in the model F 1 , indicating that the numerical SPAD distribution was more in line with the curve distribution. Predicting the SPAD value with the mean value only didn't work well. This means that the depth of the leaf color cannot describe the leaf color accurately. When introduced the skewness, and found that both the fitting degree and the prediction accuracy were greatly improved. So, these skewed-distribution parameters can describe changes in leaf color depth and homogeneity.
To sum up, the color distribution histogram of blade images follows a skewed distribution, whose parameters (such as the mean, median, mode, skewness, and kurtosis) greatly enrich the RGB model. We hope that this work will provide researchers with a new method

Table 3 Correlation between the skewed-distribution parameters and the blade SPAD values of the tobacco leaves
The 20 parameters with the red, green, and blue color channels as well as the gray-level images were obtained with MATLAB using 50 pieces of fully expanded tobacco leaves at 40, 50, 60 and 65 days, respectively ** Indicates significant correlation according to a two-tailed test (p < 0.01) * Indicates significant correlation according to a two-tailed test (p < 0.05)  for the analysis of blade color patterns in RGB digital images. This work shall also inspire the extraction and exploitation of novel leaf color descriptors for plant monitoring and treatment.