Texture recognition approach to plant identification
Inspired by the textural nature of bark and leaf surfaces, we approach plant recognition as texture classification. In order to describe texture independently of the pattern size and orientation in the image, a description invariant to rotation and scale is needed. For practical applications we also demand computational efficiency.
We introduce novel texture description called Fast Features Invariant to Rotation and Scale of Texture (Ffirst), which combines several design choices to satisfy the given requirements. This method builds on and improves our texture descriptor for bark recognition [4].
Completed local binary pattern and histogram fourier features
The Ffirst description is based on the Local Binary Patterns [51, 52, 71]. The common LBP operator (later denoted as signLBP) locally computes the signs of differences between the center pixel and its P neighbours on a circle of radius R. With an image function f(x, y) and neighbourhood point coordinates \((x_p,y_p)\):
$$\begin{aligned} \begin{aligned} \text {LBP}_{P,R} (x,y)&= \sum \limits _{p=0}^{P1} s( f(x,y)  f(x_p,y_p) ) 2^p , \; s(z)&=\left\{ \begin{array}{ll} 1 : &{} \text {if } z \le 0,\\ 0 : &{} \text {otherwise.} \end{array} \right. \end{aligned} \end{aligned}$$
(1)
To achieve rotation invariance,^{Footnote 2} we adopt the so called LBP histogram Fourier features (LBPHF) introduced by Ahonen et al. [53]. LBPHF describe the histogram of uniform patterns using coefficients of the discrete Fourier transform (DFT). Uniform LBP are patterns with at most 2 spatial transitions (bitwise 01 changes). Unlike the simple rotation invariants using \(\hbox {LBP}^\text {ri}\) [71, 72], which joins all uniform patterns with the same number of 1s into one bin, the LBPHF features preserve the information about relative rotation of the patterns.
Denoting a uniform pattern \(U_p ^{n,r}\), where n is the “orbit” number corresponding to the number of “1” bits and r denotes the rotation of the pattern, the DFT for given n is expressed as:
$$\begin{aligned} H(n,u) = \sum \limits _{r=0}^{P1} h_I\left( U_p^{n,r}\right) e^{i2\pi u r /P} \,, \end{aligned}$$
(2)
where the histogram value \(h_I (U_p^{n,r})\) denotes the number of occurrences of a given uniform pattern in the image.
The LBPHF features are equal to the absolute value of the DFT magnitudes, and thus are not influenced by the phase shift caused by rotation).
$$\begin{aligned} {LBPHF}(n,u) = \vert H(n,u) \vert = =\sqrt{ H(n,u) \overline{H(n,u)}} . \end{aligned}$$
(3)
Since \(h_I\) are real, \(H(n,u) = H(n,Pu)\) for \(u = (1,\ldots ,P1)\), and therefore only \(\left\lfloor {\frac{P}{2}}\right\rfloor +1\) of the DFT magnitudes are used for each set of uniform patterns with n “1” bits for \(0<n<P\). Three other bins are added to the resulting representation, namely two for the “1uniform” patterns (with all bins of the same value) and one for all nonuniform patterns.
The LBP histogram Fourier features can be generalized to any set of uniform patterns. In Ffirst, the LBPHFSM description [54] is used, where the histogram Fourier features of both sign and magnitudeLBP are calculated to build the descriptor. The magnitudeLBP [73] checks if the magnitude of the difference of the neighbouring pixel \((x_p,y_p)\) against the central pixel (x, y) exceeds a threshold \(t_p\):
$$\begin{aligned} \text {LBPM}_{P,R} (x,y) = \sum _{p=0}^{P1} s( \vert f(x,y)  f(x_p,y_p) \vert  t_p) 2^p . \end{aligned}$$
(4)
We adopted the common practice of choosing the threshold value (for neighbours at pth bit) as the mean value of all m absolute differences in the whole image:
$$\begin{aligned} t_p = \sum \limits _{i=1}^m \dfrac{ \vert f(x_i,y_i)  f(x_{ip},y_{ip}) \vert }{m}. \end{aligned}$$
(5)
The LBPHFSM histogram is created by concatenating histograms of LBPHFS and LBPHFM (computed from uniform signLBP and magnitudeLBP).
Multiscale description and scale invariance
A scale space is built by computing LBPHFSM from circular neighbourhoods with exponentially growing radius R. Gaussian filtering is used^{Footnote 3} to overcome noise.
Unlike the MSLBP approach of Mäenpää and Pietikäinen [74], where the radii of the LBP operators are chosen so that the effective areas of different scales touch each other, Ffirst uses a finer scaling with a step of \(\sqrt{2}\) between scales radii \(R_i\), i.e. \(R_i = R_{i1} \sqrt{2}\). This radius change is equivalent to decreasing the image area to one half. The first LBP radius used is \(R_1=1\), as the LBP with low radii capture important high frequency texture characteristics.
Similarly to [74], the filters are designed so that most of their mass lies within an effective area of radius \(r_i\). We select the effective area diameter, such that the effective areas at the same scale touch each other: \(r_i = R_i \sin \frac{\pi }{P}\).
LBPHFSM histograms from c adjacent scales are concatenated into a single descriptor. Invariance to scale changes is increased by creating \(n_\text {conc}\) multiscale descriptors for one image. See Fig. 1 for the overview of the texture description method.
Support Vector Machine and feature maps
In most applications, a Support Vector Machine (SVM) classifier with a suitable nonlinear kernel provides higher recognition accuracy at the price of significantly higher time complexity and higher storage demands (dependent on the number of support vectors). An approach for efficient use of additive kernels via explicit feature maps is described by Vedaldi and Zisserman [75] and can be combined with a linear SVM classifier. Using linear SVMs on featuremapped data improves the recognition accuracy, while preserving linear SVM advantages like fast evaluation and low storage (independent on the number of support vectors), which are both very practical in real time applications. In Ffirst we use the explicit feature map approximation of the histogram intersection kernel, although the \(\chi ^2\) kernel leads to similar results.
The "One versus All" classification scheme is used for multiclass classification, implementing the Platt’s probabilistic output [76, 77] to ensure SVM results comparability among classes. The maximal posterior probability estimate over all scales is used to determine the resulting class.
In our experiments we use a stochastic dual coordinate ascent [78] linear SVM solver implemented in the VLFeat library [79].
Adding rotational invariants
The LBPHF features used in the proposed Ffirst description are usually built from the DFT magnitudes of differently rotated uniform patterns. We propose to use all LBP instead of just the subset of uniform patterns. Note that in this case, some orbits have a lower number of patterns, since some nonuniform patterns show symmetries, as illustrated in Fig. 1.
Another rotational invariants are computed from the first DFT coefficients for each orbit:
$$\begin{aligned} \text {LBPHF}^{+}(n) = \sqrt{ H(n,1) \overline{H(n+1,1)}} \end{aligned}$$
(6)
\(\hbox {Ffirst}^{\forall +}\) denotes the method using the full set of patterns for LBPHF features and adding the additional LBP\(\hbox {HF}^{+}\) features.
Recognition of segmented textural objects
We propose to extend Ffirst to segmented textural objects by treating the border and the interior of the object segment separately.
Let us consider a segmented object region \({\mathbb {A}}\). One may describe only points that have all neighbours at given scale inside \({\mathbb {A}}\). We show that describing a correctly segmented border, i.e. points in \({\mathbb {A}}\) with one or more neighbours outside \({\mathbb {A}}\) (see Fig. 2), adds additional discriminative information.
We experiment with 5 variants of the recognition method, differing in the processing of the border region:

1.
\(\hbox {Ffirst}_\text {a}\) describes all pixels in \({\mathbb {A}}\) and maximizes the posterior probability estimate (i.e. SVM Platt’s probabilistic output) over all \(n_\text {conc}\) scales.

2.
\(\hbox {Ffirst}_\text {i}\) describes only the segment interior, i.e. pixels in \({\mathbb {A}}\) with all neighbours in \({\mathbb {A}}\).

3.
\(\hbox {Ffirst}_\text {b}\) describes only the segment border, i.e. pixels in \({\mathbb {A}}\) with at least one neighbour outside \({\mathbb {A}}\).

4.
\(\hbox {Ffirst}_{\text {ib}{\sum }}\) combines the \(\hbox {Ffirst}_\text {i}\) and \(\hbox {Ffirst}_\text {b}\) descriptors and maximizes the sum of their posterior probability estimates over \(n_\text {conc}\) scales.

5.
\(\hbox {Ffirst}_{\text {ib}{\prod }}\) combines the \(\hbox {Ffirst}_\text {i}\) and \(\hbox {Ffirst}_\text {b}\) descriptors and maximizes the product of their posterior probability estimates over \(n_\text {conc}\) scales.
The leaf databases contain images of leaves on an almost white background. Segmentations were obtained by thresholding using the Otsu’s method [80].
Deep learning approach to plant identification
For significantly more complex tasks—where the photos are nearly unconstrained (depicting different plant organs or the whole plant in its natural environment), with complex background, and much higher numbers of classes (10,000 in the case of LifeCLEF 2017 [81]), we choose a deep learning approach and utilize stateoftheart deep convolutional neural networks, which succeeded in a number of computer vision tasks, especially those related to complex recognition and detection of objects. Given the enormous popularity of convolutional neural networks in the last years and the volume of available deep learning literature (e.g. [82,83,84]), we skip most of the deep learning theory and we only briefly describe our choices of architectures, models and techniques for our contributions to the PlantCLEF challenges.
In the experiments, we used the stateoftheart CNN architectures as a baseline and added modifications described below: ensemble training with bagging, maxout, and bootstrapping for training on noisy labels. We initialized all convolutional layer parameters from networks pretrained on the 1 million ImageNet images, and then finetuned the networks on the training data for the plant recognition task. Such initialization is a common practice that speeds up training and helps to avoid early overfitting on tasks with a small number of training images.
Bagging
In deep learning challenges it is a common practice to train several networks on different (but not necessarily mutually exclusive) subsets of the training data. An ensemble of such networks, commonly combined by a simple voting mechanism (e.g. sum or maximum of class prediction scores), tends to outperform individual networks. In the PlantCLEF 2015 plant classification challenge, Choi [41] gained a significant margin in precision using bagging of 5 networks.
Maxout
Maxout [85] is based on an activation function, which takes a maximum over k parts (e.g. slices) of a network layer:
$$\begin{aligned} h_i(x)=\max _{j\in \left[ 1,k\right] } z_{ij} , \end{aligned}$$
(7)
where \(z_{ij} = {\mathbf {x}}^\text {T}{\mathbf {W}}_{..ij} + b_{ij}\) can be a standard fully connected (FC) layer with parameters \(W \in {\mathbb {R}}^{d\times m \times k}\), \(b \in {\mathbb {b}}^ {m \times k}\).
One can understand maxout as a piecewise linear approximation to a convex function, specified by the weights of the previous layer. Maxout was designed [85] to be combined with dropout [86].
The maxout is not used on top of the FC classification layer (which would mean increasing its size ktimes), we add an additional FC layer with maxout activation before the classification FC layer.
Bootstrapping
In order to improve learning from noisy labels in the scenario of the PlantCLEF 2017 plant identification challenge, we experimented with the so called “bootstrapping” of Reed et. al. [87]. An objective is proposed that takes into account the current predictions of the network, with the intention to lower the effect of incorrect labels. Reed et al. propose two variants of the objective:

Soft bootstrapping uses the probabilities \(q_k\) given by the network (softmax):
$$\begin{aligned} { L }_\text {soft} ({\mathbf {q}},{\mathbf {t}}) = \sum _{k=1}^N \left[ \beta t_k + ( 1  \beta ) q_k \right] \log q_k, \end{aligned}$$
(8)
where \(t_k\) are the provided labels and \(\beta\) is a parameter of the method. The authors [87] point out that the objective is equivalent to softmax regression with minimum entropy regularization, which was previously studied in [88]; encouraging high confidence in predicting labels.

Hard bootstrapping uses the strongest prediction \(z_k = {\left\{ \begin{array}{ll}1 \text { if } k=\text {argmax}q_i \\ 0 \text { otherwise}\end{array}\right. }\)
$$\begin{aligned} { L }_\text {hard} ({\mathbf {q}},{\mathbf {t}}) = \sum _{k=1}^N \left[ \beta t_k + ( 1  \beta ) z_k \right] \log q_k \end{aligned}$$
(9)
We decided to follow the best performing setting of [87] and use hard booststrapping with \(\beta =0.8\) in our experiments. The search for the optimal value of \(\beta\) was omitted for computational reasons and limited time for the competition, yet the dependence between the amount of label noise and the optimal setting of hyperparameter \(\beta\) is a topic for future work.
ResNet with maxout for LifeCLEF 2016
In LifeCLEF 2016, we utilized the stateoftheart very deep 152layer residual network of He et al. [67]. The residual learning framework allows to efficiently train networks that are substantially deeper than the previously used CNN architectures. We used the model pretrained on ImageNet which is publicly available [89] and inserted an additional fully connected layer sliced into 4 parts with 512 neurons each, and applied the maxout activation function on the slices. The parameters of both the new FC layer and the following 1000way FC classification layer were initialized using the method of Glorot [90].
Thereafter, we finetuned the network for 150,000 iterations with the following parameters:

The learning rate was set to \(10^{3}\) and lowered by a factor of 10 after every 100,000 iterations.

The momentum was set to 0.9, weight decay to \(2\cdot 10^{4}\). r

The effective batch size was set to 28 (either computed at once on NVIDIA Titan X, or split into more batches using Caffe’s iter_size parameter when used on GPUs with lower VRAM).

A horizontal mirroring of input images was performed during training.
Due to computational limits at training time, we only performed bagging of 3 networks, despite we expect that using a higher number of bagged networks would further improve the accuracy. For training the ensemble of networks, a different \(\frac{1}{3}\) of the training data was removed in each bag. The voting was done by taking specieswise maximum of output probabilities.
InceptionResNetv2 with maxout for LifeCLEF 2017
Our model for PlantCLEF 2017 was based on the stateoftheart convolutional neural network architecture, the InceptionResNetv2 model [70], which introduced residual Inception blocks  a new type of the Inception block making use of the residual connections from [67]. Both the paper [70] and our preliminary experiments show that this network architecture leads to results superior to other stateoftheart CNN architectures. The publicly available [91] Tensorflow model pretrained on ImageNet was used to initiate the parameters of convolutional layers. The main hyperparameters were set as follows:

Optimizer: RMSProp with momentum 0.9 and decay 0.9.

Weight decay: 0.00004.

Learning rate: Starting LR 0.01 with decay factor 0.94, exponential decay, ending LR 0.0001.

Batch size: 32.
We added a FC layer with 4096 units. The maxout activation operates over \(k=4\) linear pieces the FC layer, i.e. \(m=1024\). Dropout with a keep probability of 80% is applied before the FC layers. The final layer is a 10,000way softmax classifier corresponding to the number of plant species needed in the 2017 task.
The PlantCLEF 2017 training data consists of 2 sets, both covering the same 10,000 plant species:

1
A “trusted” training set based on the online collaborative Encyclopedia Of Life (EoL), where the ground truth labels should be assigned correctly.

2
The “noisy” training set built using web crawlers (more precisely, the Google and Bing image search results) and may thus contain images which are not related to the declared plant species.
We finetuned our networks in three different ways:

1
Using only “trusted” (EoL) training data.

2
Using both “trusted” and “noisy” training data (EoL + web).

3
Filtering the “noisy” data using a model pretrained on the “trusted” data, and then finetuning on the combination of “trusted” and “filtered noisy” data (EoL + filtered web).
Datasets and evaluation methodology
Bark recognition is evaluated on a dataset collected by Österreichische Bundesforste—Austrian Federal Forests, which was introduced in 2010 by Fiel and Sablatnig [92] and contains 1182 bark images from 11 classes. We denote it as the Austrian Federal Forests (AFF) bark dataset.^{Footnote 4} The resolution of the images varies (between 0.4 and 8.0 Mpx). This dataset is not publicly available, but it was kindly provided by the Computer Vision Lab, TU Vienna, for academic purposes, with courtesy by Österreichische Bundesforste/Archiv.
Unlike in bark recognition, there is a number of existing datasets for leaf classification, most of them being publicly available. The datasets and their experimental settings are briefly described bellow:
The Austrian Federal Forest (AFF) leaf dataset was used by Fiel and Sablatnig [11] for recognition of trees, and was kindly provided together with the bark dataset described previously. It contains 134 photos of leaves of the 5 most common Austrian broad leaf trees. The leaves are placed on a white background. The results are compared using the protocol of Fiel and Sablatnig, i.e. using 8 training images per leaf class.
The Flavia leaf dataset contains 1907 images (1600 × 1200 px) of leaves from 32 plant species on white background, 50–77 images per class. The dataset was introduced by Wu et al. [17], who used 10 images per class for testing and the rest of the images for training. More recent publications use 10 randomly selected test images and 40 randomly selected training images per class, achieving better recognition accuracy even with the lower number of training samples. In the case of the two best result reported by Lee et al. [20, 21], the number of training samples is not clearly stated.^{Footnote 5} Some authors divide the set of images for each class into two halves, one for training and the other for testing.
The Foliage leaf dataset by Kadir et al. [19, 24] contains 60 classes of leaves from 58 species. The dataset is divided into a training set with 100 images per class and a test set with 20 images per class.
The Swedish leaf dataset was introduced in Söderkvist’s diploma thesis [25] and contains images of leaves scanned using a 300 dpi colour scanner. There are 75 images for each of 15 tree classes. The standard evaluation scheme uses 25 images for training and the remaining 50 for testing. Note: The best reported result of Qi et al. [27] was found on the project homepage [29].
The Leafsnap dataset version 1.0 by Kumar et al. [12] was publicly released in 2014. It covers 185 tree species from the Northeastern United States. It contains 23147 high quality Lab images and 7719 Field images. The authors note that the released dataset does not exactly match that used to compute results for the paper, nor the currently running version on their servers, yet it seems to be similar to the dataset used in [12] and should allow at least a rough comparison. In the experiments of [12], leaveoneimageout species identification has been performed, using only the Field images as queries, matching against all other images in the recognition database. Probability of the correct match appearing among the top 5 results is taken as the resulting score. Note: The classification accuracy of [12] for the 1st result in Table 2 is estimated from a plot in [12]. Because leaveoneimageout testing scheme would demand to retrain our classifiers for each tested image, we rather perform 10fold cross validation, i.e. divide the set of Fields images into 10 parts, testing each part on classifiers learned using the set of other parts together with the Lab images.
The Middle European Woods (MEW) dataset was introduced by Novotný and Suk [22]. It contains 300 dpi scans of leaves belonging to 153 classes (from 151 botanical species) of Central European trees and shrubs. There are 9745 samples in total, at least 50 per class. The experiments are performed using half of the images in each class for training and the other half for testing.
The PlantCLEF challenge datasets depict plants in a significantly wider range of views, such as leaves, flowers, fruits, stems, entire plants and branches.
In the plant identification challenge PlantCLEF 2016, the training set contained 113,205 images of 1000 species of herbs, trees and ferns, and included also other metadata, such as the type of view (fruit, flower, entire plant, etc.), observation ID and GPS coordinates (if available). The test set contained 8000 pictures, including “distractor” images which did not depict one of the 1000 species.
In the PlantCLEF 2017 challenge, there were two training sets available: a “trusted” set of 256,287 thousand labelled images of 10,000 plant species with metadata, and a “noisy” set with URLs to more than 1.4 million weaklylabelled web images obtained by Google and Bing image search. The evaluation of the task was performed on a test set containing 25,170 images of 13,471 observations (specimen). There are no “distractor” images in the 2017 test set.
While PlantCLEF 2016 challenge was evaluated based on the mean Average Precision (mAP), PlantCLEF 2017 used a less common measure—the mean reciprocal rank (MRR):
$$\begin{aligned} \mathrm{MRR} = \dfrac{1}{\vert Q \vert }\sum ^{\vert Q \vert }_{i=1}\dfrac{1}{\text {rank}_i}, \end{aligned}$$
(10)
where \(\vert Q \vert\) is the total number of queries in the test set and \(\text {rank}_i\) is the rank of the correct result for the ith query.