Rapid identification of medicinal plants via visual feature-based deep learning

Tan, Chaoqun; Tian, Long; Wu, Chunjie; Li, Ke

doi:10.1186/s13007-024-01202-6

Methodology
Open access
Published: 31 May 2024

Rapid identification of medicinal plants via visual feature-based deep learning

Chaoqun Tan¹,
Long Tian²,
Chunjie Wu⁴ &
…
Ke Li³

Plant Methods volume 20, Article number: 81 (2024) Cite this article

289 Accesses
1 Altmetric
Metrics details

Abstract

Background

Traditional Chinese Medicinal Plants (CMPs) hold a significant and core status for the healthcare system and cultural heritage in China. It has been practiced and refined with a history of exceeding thousands of years for health-protective affection and clinical treatment in China. It plays an indispensable role in the traditional health landscape and modern medical care. It is important to accurately identify CMPs for avoiding the affected clinical safety and medication efficacy by the different processed conditions and cultivation environment confusion.

Results

In this study, we utilize a self-developed device to obtain high-resolution data. Furthermore, we constructed a visual multi-varieties CMPs image dataset. Firstly, a random local data enhancement preprocessing method is proposed to enrich the feature representation for imbalanced data by random cropping and random shadowing. Then, a novel hybrid supervised pre-training network is proposed to expand the integration of global features within Masked Autoencoders (MAE) by incorporating a parallel classification branch. It can effectively enhance the feature capture capabilities by integrating global features and local details. Besides, the newly designed losses are proposed to strengthen the training efficiency and improve the learning capacity, based on reconstruction loss and classification loss.

Conclusions

Extensive experiments are performed on our dataset as well as the public dataset. Experimental results demonstrate that our method achieves the best performance among the state-of-the-art methods, highlighting the advantages of efficient implementation of plant technology and having good prospects for real-world applications.

Introduction

Chinese Medicinal Plants (CMPs) can be directly used in the clinical practice of traditional Chinese medicines. It has been an essential part of healthcare for thousands of years, with a focus on using natural plant-based remedies to promote health, prevent illness, and treat various medical conditions [1,2,3]. CMPs are employed as either a primary or complementary method to address a diverse spectrum of health concerns, spanning from minor ailments to chronic conditions. The important role of CMPs in the prevention and treatment of many epidemic, chronic, and infectious diseases, such as COVID-19, CMPs has been widely demonstrated and recognized by the international community [4,5,6]. The quality of CMPs is one of the major factors in ensuring medication safety and clinical security [7,8,9,10].

Typically, biological techniques and chemical methods, such as mass spectrometry, gas chromatography, etc., can be used for adulteration detection [11,12,13]. However, these analyses require highly trained professionals, and also it is time-consuming. On the other hand, molecular markers serve as a fast and promising analytical way, but it is cumbersome and high professional threshold [14, 15]. Additionally, the evaluation of CMPs by manual identification lacks objectivity and scientificity. As an effective alternative, the research hot spot for the identification of CMPs based on intelligent sensory technology (such as electronic nose, electronic tongue, and electronic eyes) has aroused strong attention. Previous works [16,17,18] have demonstrated the effectiveness of discrimination, however, those require expensive equipment and are not efficient. Additionally, image processing by hand-designed features relies heavily on the analysis of shallow visual features, lacking the capture of high-level semantic features. Consequently, the approaches for rapid and accurate detection of CMPs are necessary for practical use and market demands.

With the continuous innovation and research in computer technology, deep learning in following-up on the effects of image processing has been widely recognized for the identification of food, plant, agriculture, medical care, and multiple fields [19,20,21,22,23]. It has also been used for the identification of CMPs. Zhou et al. [24] combined near-infrared spectroscopy and convolutional neural networks to analyze medicinal plants from different origins. Wang et al. [25] proposed hyperspectral imaging assisted by an attention mechanism and a long short-term memory network to identify the origin of the coix seed and predict the nutritional content. Miao et al. [26] fused ConvNeXt with the ACMix network to extract features and classify traditional Chinese medicine. Bai et al. [27] combined deep learning and spectral fingerprint features to accurately predict the soluble solids content of jujube in multiple geographical areas. Yan et al. [28] used visible/near-infrared combined with deep learning to identify the geographical origin of licorice. Yue et al. [29] employed near-infrared 2DCOS images combined with a residual neural network to identify the origin of Yunnan’s big leaves. Compared with widely used generative adversarial networks (GANs) [30, 31] and CNN-based methods [32,33,34], the Masked AutoEncoders (MAE) [35] have caused public concerns due to reducing dependence on data. In this paper, our goal is to investigate a rapid and effective strategy for identifying the different varieties of CMPs. Inspired by MAE and CoAtNet, a hybrid structure by fusing MBConv [36] and Transformer [37] has been designed to better obtain the local details and global features for the classification of CMPs.

To the best of our knowledge, there is no public medicinal fruit plants dataset, thus, we create a new dataset. We create a comprehensive visual multi-varieties CPMs images dataset, where high-resolution images are captured using a self-developed acquisition device, the details are shown in Sect. 2. On the other hand, to enhance MAE for extracting global features and reducing information loss, we propose a novel framework. The overview of our model is illustrated in Fig. 1, and details of our proposed methods can be found in Sect. 3. Finally, the experimental results and analysis are shown in Sect. 4, with a conclusion drawn in Sect. 5. The contributions of this study are highlighted as follows:

(1) Utilizing self-developed equipment to acquire our dataset, which is the first publicly dataset related to medicinal fruit plants.

(2) Compared with the previous works, the proposed method addresses the limitations of MAE in extracting global features and reduces information loss. By combining a new pre-training paradigm integrating self-supervised and supervised label information, it can mitigate the model overfitting to imbalanced data and enhance adaptability.

(3) In response to the characteristics of the dataset, a novel random data augmentation method is proposed to enhance the model’s focus on edge regions and feature extraction by randomly adding shadows to local areas.

(4) Extensive experiments are performed on our dataset as well as public datasets. The experimental results show that our model achieves the highest accuracy among state-of-the-art models. Our proposed model has excellent practical value for plant technology.

Materials

Sample preparation

All samples are obtained from the Lotus Pond medicinal market in Chengdu. Our collection has 14 different types of samples as long as their derived products. These samples are certified by experts from the Chengdu Institute of Food and Drug Control (Chengdu, China). The dry samples are derived from intact samples and are stored in ordinary cold storage.

Data acquisition

A self-developed high-resolution data acquisition device (Canon EOS 60D) is used to acquire images as shown in Fig. 2A. The device is composed of a box, a light system, and an image acquisition system, which can provide stable and consistent environmental conditions. The image acquisition process is illustrated in Fig. 2.

The box is made of wood and has a reflective gray coating with a reflectivity of 18%. PHILIPS Graphical TL-D light with a temperature of 5000 K is used in the light system. Four light tubes and scattering plates are utilized to eliminate any shadowing during the image-capturing process. All images are captured using a 35 mm CMOS sensor with a resolution of 5120 × 3840, as shown in Fig. 2B. Images are annotated and cropped to obtain a target. (Fig. 2C), while incomplete, blurry, and inappropriate images are removed. Our dataset is shown in Fig. 3.

shanzha is a medicinal and edible plant, which commonly applied in clinical practice by slices. In our dataset, there are four varieties from the same origin, including shanzha, chaoshanzha, jiaoshanzha, and shanzhatan. They are fired at different temperatures by sliced shanzha. For example, chaoshanzha is fired at 100℃, jiaoshanzha is fired at 150℃, and shanzhatan is fired at 200℃. With the fluctuation of temperature during frying, there are alterations in both the morphology and color, leading to variations in pharmacological effects. Similarly, jiangbanxia, fabanxia, qingbanxia, and jingbanxia are from the same origins, while they are obtained from mature harvested banxia by different processing methods. Specifically, qingbanxia is obtained by purifying banxia, jiangbanxia is made by mixing ginger juice and banxia, and fabanxia is obtained by soaking banxia in licorice lime liquid. Additionally, jingbanxia is a highly valuable medicinal plant prepared by mixing banxia with various adjuvants. jiangnanxing is a processed product derived product from Tiger’s Paw Southern Star and has completely different medicinal effects from banxia. On the other hand, shuibanxia has a different origin and effects from banxia. Furthermore, lubeimu, qingbeimu, and songbiemu are three different species of chuanBeimu, they have different market values due to their different morphology and color.

We explain the different morphologies and color changes in our dataset. According to the properties of images, all data are detected to remove redundant pixels that contain no information. During the data collection processing, we collect multiple images of the same plant sample from different angles to enrich the diversity of data. Thus, we compile the specific quantity of each medicinal plant, and the distribution of the original dataset is shown in Fig. 4. The blue represents the raw samples, while the orange is the collected original data.

Methodology

Overview architecture

Our framework for CMP classification is shown in Fig. 1. Our model has 3 parts: (A) Encoder, (B) Decoder, and (C) Classification. Specifically, we use ViT to extract global features from different images. Additionally, we use MBConv to reduce the number of parameters and improve learning ability. Thus, the encoder is dedicated to learning the structural knowledge of images by incorporating MBConv and ViT. The patches and masks are processed to reconstruct the original images. Additionally, it harnesses the potential of the ViT in capturing essential information. Furthermore, a parallel supervised classification branch is introduced to make up the integration of global features within MAE. Lastly, the decoder aims to predict the features of the masked regions. As a result, the model accomplishes image classification.

Taking advantage of the sparsity of images and the learning ability of MAE, the combination Transformer with MBConv is used to extract local deep features. the loss is designed to compute for all patches. Moreover, we can generate diverse data by random masking, which provides a powerful regularization effect in supervised pre-training.

Random data enhancement

We first use Grad-CAM [38] to analyze which parts are more important for our model, the heatmap is illustrated in Fig. 5. Through the heatmap we can observe that our model focuses more on image edges, with limited attention to other areas. According to this observation, we propose a random data enhancement method that aims to improve the feature representation by selectively augmenting underrepresented minority images through random cropping and random shadowing.

Random shadow augmentation

As shown in Fig. 6, when processing the input image, a random value $p$ is generated within the range 0 to 1. If $p$ is less than $dark\_rate$, a random rectangular region is selected, and the values of RGB channels are decreased to create a shadow. Otherwise, the original image is kept.

The shadow areas ${D}_{rect}$ are computed in:

$$\begin{array}{c}X\left(i,j,c\right)= x\left(\text{i},\text{j},\text{c}\right)-shadow,\left(\text{i},\text{j}\right)\in {D}_{rect},c\in \left(\text{0,1},2\right)\end{array}$$

(1)

Where $x\left(\text{i},\text{j},\text{c}\right)$ represents the RGB channel in the area, $\text{s}\text{h}\text{a}\text{d}\text{o}\text{w},\left(\text{i},\text{j}\right)$is the levels of shadow intensity. $X\left(i,j,c\right)$ is the RGB value after shadow darkening.

Random crop augmentation

Simultaneously, a random local enhancement method is used for data preprocessing in this study. For the different classes, the proportion $\text{A}$ is calculated, and $1-\text{A}$ is used as the threshold. A random point and a random length are selected, and the local region is cropped. This is calculated in Formula 2.

$$\left\{ {\eqalign{ {\gamma = 1 + (1 - {\rm{A)}} \times d} \cr {\gamma = 1 - {\rm{(1}} - {\rm{A)}} \times d} } } \right.$$

(2)

where $d$ represents the Euclidean distance from the center, $d\in [0, 112]$. The threshold for random cropping is higher for fewer classes to enhance the capture of local information. Moreover, images are enhanced by random rotation and flip. The results of data augmentation are shown in Fig. 7.

Nonlinear transform of self-attention

Generally, the image is denoted as $X\in {\mathbb{R}}^{h\times w\times C}$, which are divided into $N=h\times w/{P}^{2}$ non-overlapping patches.

$$X=\left\{{x}^{1},{x}^{2}\dots {x}^{n}\right\}$$

(3)

where ${x}^{n}\in {\mathbb{R}}^{{P}^{2}C}$ is the vector of patch, $P$ represents the resolution of patch. Each patch is projected as a 1D token embedding. Then, ${N}_{m}$ patches are randomly masked, and remaining ${N}_{v}$ are visible patches, ${N=N}_{m}+{N}_{v}$. ${X}_{v}=\left\{\left.{x}^{k}\right|k\notin M\right\}$ is defined as the set of visible pixels, ${X}_{m}=\left\{\left.{x}^{k}\right|k\in M\right\}$ is the set of masked pixels, where $M$ represents the indices of randomly masked pixels. Thus,

$$X={X}_{m}\cup {X}_{v}, {X}_{m}\cap {X}_{v}=\varnothing$$

(4)

In this study, the size of $224\times 224$ image is divided into $14\times 14$ grid of blocks, where each block has a size of $16\times 16$. Each visible patch is projected into an embedding, and the positional embedding${E}_{pos}$ is added to ensure the position of patch.

$$z=\left[{x}_{cls},{x}_{p}^{1}E,{x}_{p}^{2}E\dots .,{x}_{p}^{N}E\right]+{E}_{pos}$$

(5)

Then, it is computed by self-attention, the scaled dot-product attention is to obtain $Z\in {\mathbb{R}}^{d\times d}$.

$$Z=Attn\left(z\right)=Softmax(Q{K}^{\text{{\rm T}}}/\sqrt{w})V$$

(6)

the $Softmax$ attention $Attn(\cdot)$ with a global receptive field works as the following nonlinear mapping:

$$y{\prime }=LN(Z+FFN\left(LN\left(Z\right)\right))$$

(7)

where $LN(\cdot)$ is the Layer Normalization that essentially is a learnable column scaling with a shift, and $FFN(\cdot)$ is a standard two-layer feedforward neural network applied to the embedding of each patch. The scaled dot-product attention (6) of $Z$, the jth element of its ith row zi is obtained in Formula 8.

$${Z}_{i}^{j}=\frac{{e}^{{(Q{K}^{\text{{\rm T}}}/\sqrt{w})}_{i}}}{\sum _{j=1}^{h}{e}^{{(Q{K}^{\text{{\rm T}}}/\sqrt{w})}_{ij}}}. V= Softmax({q}_{i}{K}^{\text{{\rm T}}}/\sqrt{w})V$$

(8)

From Formula 7, the representation space for an encoder layer in MAE is spanned by the row space of V and is being nonlinearly updated layer-wise. The embedding for each patch serves as a basis to form the representation space for the current attention block.

Compared with CNN, the global self-attention mechanism ignores some local information about images, especially fine-grained features. Thus, ${y}^{{\prime }}$ is processed by depth-wise convolution to obtain deep details,

$$y=DepthConv\left({y}^{{\prime }}\right)$$

(9)

CNN is acting on a pixel level and is locally supported, thus having a small receptive field. MAE is globally supported, which means it can learn effectively the interaction between far-away patches. Transformer can aggregate coarse-grained features and expand the field of the convolutional blocks. Therefore, the hybrid structure exhibits superior performance.

Supervised branch

The mask token is a learnable vector shared by masked patch, and then is connected to the unshuffled representation of the unmasked patches. Let ${N}_{m}\in {\mathbb{R}}^{1\times 1\times d}$ be the learned mask token embedding, and the index set of masked and unmasked patches as W and U, respectively. Thus, the affine maps are generated for $\left\{{Q}^{{\prime }}\right.,{K}^{{\prime }},\left.V{\prime }\right\}$.

$$\eqalign{ & \left\| {\mathop \sum \limits_{j = 1}^n Attn\left( {{Q_i},{K_j}} \right){V_i} - \mathop \sum \limits_{j \in U} \left( {Q{'_i},K{'_j}} \right)V{'_i}} \right\| \cr & \quad < {C_{{n^{ - 1}}}} \cr}$$

(10)

where $Attn(\cdot)$ denotes the attention kernel, which maps each patch’s embedding represented by the rows of Q, K to a measure of how they interact. It shows that the network interpolates the representation using global information from the embeddings learned by the MAE encoder, not just the nearby patches. For the embedding of masked patch $i\in \text{W}$, ${v}_{i}^{t+1}$ is the output embedding of a decoder layer, ${v}_{i}^{t}$ is the input from the encoder, then ${v}_{i}^{t+1}$ is computed:

$${v}_{i}^{t+1}=\sum _{j\in U}{a}_{j}{v}_{i}^{t}$$

(11)

Where ${a}_{j}({v}_{{i}_{1}}\dots .{v}_{{i}_{k}})$ is a set of weights based on unmasked patches, $U=\left\{{i}_{1}\dots {i}_{k}\right\}$. To prove that the latent representations of the masked patches are interpolated globally based on an inter-patch topology that is learned by the attention mechanism. To better learn the feature representations of data, the supervised label information is added. Simultaneously, we introduce a regularization term through the supervised branch to help prevent the model from overfitting to imbalanced data and improve its generalization ability.

Loss functions

We optimize the reconstruction loss and classification loss at the same time. Reconstruction loss quantifies the disparity between the input data and the model’s reconstructed output. It incentivizes the model to acquire meaningful representations of the input data by penalizing inconsistencies between the original input and the reconstructed output. Classification loss is used to quantify the disparity between the predicted labels and the ground truth labels. The goal of the classification loss is to prevent the model from overfitting to imbalanced data and improve the generalization ability. The overall loss is shown:

$$Loss={Loss}_{MSE}+{Loss}_{ClS}$$

(12)

$${Loss}_{MSE}=\frac{1}{M\sum _{0}^{m}{\left(y-x\right)}^{2}}$$

(13)

According to the characteristics of the dataset, LabelSmooth [38] is selected as the classification loss function:

$${Loss}_{LS}=-\sum _{i}^{n}y\left(i\right)\text{log}\left(p\left({x}_{i}\right)\right)$$

(14)

$$y\left( i \right) = \left\{ {\matrix{ {{\varepsilon \over n}\,i \ne target} \cr {1 - \varepsilon + {\varepsilon \over n}\,i = target} \cr } } \right.$$

(15)

The penalty factor ε is introduced to emphasize the importance of low probability distributions. Therefore, it is used to address overfitting and insufficient supervision, and ε is set to 0.25.

Results and discussions

Training paraments

In this study, the model is optimized by the AdamW [39] algorithm. The initial learning rate is 1e-3, and the learning rate decay strategy is StepLR [40]. The batch size is set to 32, the gamma is set to 0.1. The experiment is based on Pytorch1.8.1 and Python3.9. The model is trained with Nvidia 2080Ti, and with 11G GPU. The final pre-trained model is obtained when reaching 400 epochs. For the fine-tuning, the initial learning rate is set to 1e-3, and the learning rate decay strategy is Cosine Annealing. The input image size is $224\times 224$, the batch size is set to 32, and the final model is obtained when it reaches 200 epochs.