Our research framework consists of a pipeline of four consecutive steps: image acquisition, preprocessing, feature extraction, and training of a classifier as shown in Fig. 1. The following subsections discuss each step in detail and especially refer to the variables, image types and preprocessing strategies that we studied in our experiments. We used state of the art feature extraction and classifier training methods and kept them constant for all experiments.
Image acquisition
For each observation of an individual leaf, we systematically varied the following image factors: perspective, illumination, and background. An example of all images collected for a single observation is shown in Fig. 2. We captured two perspectives per leaf in-situ and in a nondestructive way: the top side and the back side, since leaf structure and texture typically substantially differ between these perspectives. If necessary, we used a thin black wire to arrange the leaf accordingly. We recorded each leaf under two illumination conditions using a smartphone: flash off and flash on. Flash off refers to a natural illumination, without artificial light sources. In case of bright sunlight, we used an umbrella to shade the leaf against strong reflections and harsh shadows emerging from the device, the plant itself, or the surrounding vegetation. Flash on was used for a second image, taken in the same manner, but with the built-in flashlight activated. We also varied the background by recording an initial image in the leaf’s environment composed of other leaves and stones, termed natural background. Additionally, we utilized a white sheet of plastic to record images with plain background. Leaves were not plucked for this procedure but arranged onto the sheet using a hole in the sheet’s center. Eventually, the leaf was picked and held up against the sky using a black plastic sheet as background to prevent image overexposure. This additional image type is referred to as back light. In summary, we captured nine different image types per observation.
All images were recorded with the use of an iPhone 6, between April and September 2016, throughout a single vegetation season. Following a strict sampling protocol for each observation, we recorded images for 17 species representing typical, wild-flowering plants that commonly occur on semi-arid grasslands scattered around the city of Jena, located in eastern Germany. At the time of image acquisition, every individual was flowering. The closest focusing distance represented a technical limit for the resolution of smaller leaves while ensuring to capture the entire leaf on the image. The number of observations per species ranged from eleven (Salvia pratensis) to 25 (Pimpinella saxifraga). In total, we acquired 2902 images. The full dataset including all image annotations is freely available from [23].
Image preprocessing
Each leaf image was duplicated twice to execute the three preprocessing strategies: non-preprocessed, cropped, and segmented. Non-preprocessed images were kept unaltered. Cropping was performed based on a bounding box enclosing the leaf (see Fig. 2). To facilitate an efficient segmentation, we developed a semi-automated approach based on the GrabCut method [24]. GrabCut is based on iterated graph cuts, and was considered accurate and time-effective for interactive image segmentation [25, 26]. The first iteration of GrabCut was initialized by a rectangle placed at the relevant image region, the focus area defined during image acquisition and available in an image’s EXIF data. This rectangle should denote the potential foreground whereas the image corners were used as background seeds. The user was then allowed to iteratively refine the computed mask by adding markers denoting either foreground or background, if necessary. The total amount of markers was logged for every image. To speed up the segmentation process, every image was resized to a maximum of 400 px at the longest side while maintaining the aspect ratio. Finally, the binary mask depicting only the area of the leaf was resized to the original image size. The boundary of the upsized mask was then smoothed using a colored watershed variant after morphological erosion of the foreground and background labels, followed by automated cropping to that mask.
Quantifying manual effort
Image acquisition and preprocessing require substantial manual effort depending on the image type and preprocessing strategy. We aim to quantify the effort for each combination in order to facilitate a systematic evaluation and a discussion of their resulting classification accuracy in relation to the necessary effort.
For a set of ten representative observations, we measured the time in seconds and the amount of persons needed for the acquisition of each image. This was done for all combinations of the image factors perspective and background. Whereas a single photographer is sufficient to acquire images in front of natural background, a second person is needed for taking images with plain background and for the back light images in order to arrange the leaf and the plastic sheet. We then quantified the effort of image acquisition for these combinations by means of average ’person-seconds’ by multiplying the time in seconds with the amount of persons.
In order to quantify the manual effort during preprocessing, we measured the time in seconds an experienced user requires for performing either cropping or segmentation on a set of 50 representative images. For each task, the timer was started the moment the image was presented to the user and was stopped when the user confirmed the result of his task. For cropping, the time needed for drawing a bounding box around the leaf. This required 6.8 s on average independently from the image conditions. Image segmentation on the other hand involved substantial manual work depending on the leaf type, e.g., compound or pinnate leaves, and the image background. In case of natural background, often multiple markers were required. We measured the average time for setting one marker, amounting to 4.7 s, followed by multiplying this average time with the amount of markers needed for segmenting each image. In case of plain background and simple leaves, the automatically initialized first iteration of the segmentation process often delivered accurate results. In such cases, the manual effort was taken only to confirm the segmentation result, which took about 2 s. For compound and pinnate leaves, e.g., of Centaurea scabiosa, the segmentation task was considerably more difficult and required 135 s on average per image with natural background.
The mean effort measured in “person-seconds” for all combinations of image types and preprocessing steps is displayed in Fig. 3. We define a baseline scenario for comparing the resulting classification accuracy in relation to the necessary effort for each combination: With an empirically derived average time of 13.4 s, the minimum manual effort is in acquiring a top side leaf image with natural background and no preprocessing steps.
Feature extraction
Using CNNs for feature extraction results in powerful image representations that, coupled with a Support Vector Machine as classifier, outperform handcrafted features in computer vision tasks [17]. Accordingly, we used the pre-trained ResNet-50 CNN, that ranked among the best performing networks in the ImageNet Large Scale Visual Recognition Challenge in 2015 [27], for extracting compact but highly discriminative image features. Every image was bilinearly resized to fit 256 px at the shortest side and then a center crop of 224\(\times\)224 px was forwarded through the network using the Caffe deep learning framework [28]. The output of the last convolutional layer (fc5) was extracted as 2048 dimensional image feature vector, followed by L2-normalization.
Image classification
We used the CNN image features discussed in the previous section to train linear Support Vector Machine (SVM) classifiers. Each combination of the nine image types and the three preprocessing strategies resulted in one dataset creating 27 in total. These datasets were split into training (70% of the images) and test sets (30% of the images). In order to run comparable experiments, we enforced identical observations across all 27 datasets, i.e., for all combinations of image types and preprocessing strategies, the test and train sets were composed of the same individuals. Using the trained SVM, we classified the species for all images of each test dataset and calculated the classification accuracy as percentage of correctly identified species. All experiments were cross-validated using 100 randomized split configurations. Similarly, we quantified the species specific accuracy as percentage of correctly identified individuals per species. We used R version 3.1.1 [29] with the packages e1071 [30] for classifier training along with caret [31] for tuning and evaluation.