Segmentation of structural parts of rosebush plants with 3D point-based deep learning methods

Background Segmentation of structural parts of 3D models of plants is an important step for plant phenotyping, especially for monitoring architectural and morphological traits. Current state-of-the art approaches rely on hand-crafted 3D local features for modeling geometric variations in plant structures. While recent advancements in deep learning on point clouds have the potential of extracting relevant local and global characteristics, the scarcity of labeled 3D plant data impedes the exploration of this potential. Results We adapted six recent point-based deep learning architectures (PointNet, PointNet++, DGCNN, PointCNN, ShellNet, RIConv) for segmentation of structural parts of rosebush models. We generated 3D synthetic rosebush models to provide adequate amount of labeled data for modification and pre-training of these architectures. To evaluate their performance on real rosebush plants, we used the ROSE-X data set of fully annotated point cloud models. We provided experiments with and without the incorporation of synthetic data to demonstrate the potential of point-based deep learning techniques even with limited labeled data of real plants. Conclusion The experimental results show that PointNet++ produces the highest segmentation accuracy among the six point-based deep learning methods. The advantage of PointNet++ is that it provides a flexibility in the scales of the hierarchical organization of the point cloud data. Pre-training with synthetic 3D models boosted the performance of all architectures, except for PointNet.


Background
Automatic plant phenotyping based on computer vision techniques has become essential for enabling high throughput experiments in botanical and agricultural research [1]. While 2D image-based processing facilitates high-throughput phenotyping, advances in 3D data acquisition and modeling provide precise estimation of traits through full, occlusion-free 3D geometric information of plants [2,3].
Several measurements related to plant phenotyping require segmentation of plant parts, such as branches and individual leaves. Shape-related phenotypical traits of potted ornamental plants are especially important for assessing their visual quality [4]. Architectural traits can be simple, such as the diameters of branches, the number of internodes and stem length [5]. An extended list of more complex architectural traits for rosebush plants is given in [6]. Examples to such traits are number of axes terminated in a flower bud, number of branching orders, lengths of axes and branching angles. Estimation of length, width and area of leaves provides information for modeling of rose genotypes [7]. In order to automatically extract these phenotypical traits from acquired 3D plant data, a necessary step is identifying the structural category of each 3D point. After stem, flower and leaf points are identified, further processing can be applied to determine individual organs, such as individual leaves, to extract their statistical and geometric characteristics [8]. Stem points can be processed to detect branching points, which are fundamental for measuring architectural traits [9].
A large body of research has been conducted in recent decades for organ segmentation of plants using machine learning approaches through 2D images and 3D reconstructions [10][11][12][13][14][15][16][17][18][19][20][21]. The common practice for segmentation of 3D models is to extract hand-crafted local surface features, such as eigenvalues of local covariance matrix [22] or the second tensor [12], Fast Point Feature Histograms (FPFH) [14,16,23,24], and surface curvature [15]. Local features can as well be extracted from volumetric representations of plants. Extraction of eigenvalues of the second-moments tensor of the 3D neighbourhood [25], a breath-first flood-fill algorithm with a 26-connected neighbourhood [18], extraction of multi-scale texture and edge features [26] are examples to volumetric approaches. In [16,22,24,26], semantic segmentation methods are equipped with supervised learning techniques such as Support Vector Machines and Random Forests. Markov Random Fields (MRF)-based smoothing over class labels [15,24] or region growing [16,23] are occasionally used to ensure consistency of point labels within local regions.
Apart from segmentation methods based on local features, graph-based approaches involving spectral embedding and clustering [17,27] can also be effective. Another strategy is fitting geometric primitives such as ellipses, tubular structures, cylinders or rings to 3D data for semantic segmentation [11,13,28,29].
Deep learning methods, in contrast to the use of handcrafted features, have the advantage of being able to learn features from raw input data and model the within-class and between-class variations of the features simultaneously. Their application to 2D image-based plant detection, phenotyping and part-segmentation have been proven to be successful [30][31][32][33][34][35][36][37][38]. Despite this trend, deep learning methods that directly consume 3D point clouds have not been explored for 3D plant phenotyping. The main factor that impedes this exploration is the requirement for large amount of training data and the lack of large annotated 3D plant data sets [39]. Even moderate size annotated data sets of full plant models are not available. As opposed to the speed of acquiring and annotating 2D images, the procedures for 3D model reconstruction and annotation of real plants are timedemanding and error-prone.
A strategy to reduce this time consuming step is using synthetic data generated with their associated ground truth. This approach has been extensively used in plant phenotyping with 2D images [40][41][42][43][43][44][45][46][47]. Incorporation of synthetic plants through generative models such as Lindenmayer systems (L-systems) [48,49] into training data is effective with 2D plant phenotyping [50]. The same scheme of creating synthetic 3D plant models can be applied to supply sufficient training data to machine learning frameworks [39].
Virtual plant modeling has been used in agricultural and plant sciences to simulate plant behaviour and analyze interactions of the plants with their environment [51][52][53]. Examples to platforms that constructs virtual plant models are L+C modelling language [54,55] and L-Py framework [56], both of which are based on the formalism of L-systems [48]. Despite the availability of such platforms capable of generating synthetic plants with complex architectures, employing them as 3D training data in the form of point clouds for plant phenotyping is not yet practiced.
Research on deep learning methods that directly consume 3D points clouds as input data exploded since the publication of the pioneering work of Qi et al. [57], introducing the PointNet [58][59][60]. Guo et al. [58] provide a recent and comprehensive review on deep learning for point clouds. For semantic part segmentation application only, Guo et al. [58] compare 30 point-based architectures that have been developed since 2017. It is beyond the scope of this paper to mention all these architectures here. The benchmarks with which these architectures are commonly tested are data sets including indoor scenes (S3DIS [61], ScanNet [62]) or outdoor urban scenes (Semantic3D [63], Semantic KITTI [64,65]).
Despite the fast progress in research on point-based 3D deep learning techniques, their application on plant sciences and agriculture is limited to very few studies. For example, Wu et al. [66] modified the PointNet architecture for separating foliage and woody components in terrestrial laser scanning data. In [67], Point-Net was used to estimate the proper grasping pose of apples for autonomous harvesting. In some studies aiming part segmentation of 3D plant models, Convolutional Neural Networks (CNN) were applied to 2D multi-view images and the inferences were backprojected to 3D for post-processing [68,69]. In [70] a voxel-based convolutional neural network (VCNN) was designed for maize stem and leaf classification and segmentation. The point clouds were converted to volumetric models before being processed. The authors briefly compared their method to PointNet and Point-Net++ in terms of segmentation accuracy. To the best of our knowledge, this is the only work where the authors reported part segmentation results on 3D plant models using point-based deep learning architectures.
Exploration of the performance of recent deep learning techniques on 3D plant phenotyping is imperative since these approaches have the promise of simultaneous extraction of relevant information from the data at various scales and learning to design classifiers that model the variability in the data. They have been proven to outperform classical machine learning methods that rely on hand-crafted features. However, the recently developed 3D point-based deep learning architectures have not previously been analyzed for their suitability for organ segmentation of full 3D plant models.
The objective of this work is to address this lack of analysis and to provide a benchmark for application of 3D point-based deep learning methods to plant part segmentation. The target data set is the recently introduced ROSE-X data set, which includes eleven 3D models of real rosebush models obtained through X-ray imaging [26]. The models are fully annotated with three semantic labels: (1) Flower, (2) Leaf, and (3) Stem (branches and petioles). As baseline methods, six recent 3D point-based deep learning architectures were modified with the help of synthetic models and evaluated for the segmentation of real rosebush plants to their structural parts.
We used a simulator based on L-networks in order to generate 3D synthetic rosebush (Rosa x hybrida) models. Although 3D synthetic plant models were previously utilized for rendering 2D images for 2D deep learning methods, to the best of our knowledge, they were not previously used in full 3D form for directly enriching the 3D training data for deep learning. In addition to providing a first exploration of the potential of various 3D point-based deep networks for plant phenotyping, this work also presents a first investigation of the contribution of 3D synthetic models for modifying and training such networks. This investigation is particularly important for addressing the challenge of limited labeled 3D plant data.
In summary, the contributions of this work are

Methods
We address the application of 3D point-based deep learning segmentation methods to the specific problem of segmentation of 3D plant models to their structural parts. We considered six such architectures for adaptation to the problem and compared their shortcomings and strengths. The architectures are (1) PointNet [57], (2) PointNet++ [71], (3) Dynamic Graph CNN (DGCNN) [72], (4) PointCNN [73], (5) ShellNet [74], and (6) RIConv [75]. We employed the recently introduced ROSE-X data set [26], which includes eleven 3D models of real rosebush plants to train and evaluate the networks. The data set is accompanied with ground truth information in the form of point-level labels of the plant shoot corresponding to three classes: (1) Flower, (2) Leaf, and (3) Stem (branches and petioles).
In order to explore the contribution of using synthetic data for modifying and training the networks, we created a data set consisting of 48 synthetic rosebush (Rosa x hybrida) models. The models were generated by a simulator developed by Favre et al. [76]. The simulator was implemented with L-studio software [55] based on L-systems. The point clouds extracted from the synthetic data are used to modify and pre-train the networks. Using transfer learning [77], the networks are updated using the training set of point clouds of the ROSE-X data set. The results on the test models from the ROSE-X were compared with those of the default networks trained without the use of the synthetic data.

Data sets
In this study, we utilized two sets of 3D models of rosebush plants. The first set is the ROSE-X data set, which is composed of 11 fully annotated 3D models of real rosebush plants acquired through X-ray scanning. The second is the set of synthetic rosebush models which were generated using the L-studio-based simulator developed by Favre et al. [76]. The details of the data sets are provided in the following subsections. The ROSE-X data set is open to public use at [78].

ROSE-X data set
The models in the ROSE-X data set were acquired from real rosebush plants using a 3D X-ray imaging system. The volumetric models were fully annotated with manual supervision and then converted to 3D point clouds. The details of the procedure for annotation and the data structure can be found in [26]. Each point in a point cloud belongs to one of three organ classes: Leaf, stem, and flower. The petioles between leaflets were also labeled as stem, since they have branch-like structures and their inclusion to the architecture of branches is important for further analysis.
In most 3D phenotyping experiments, especially for plants of complex architecture, the number of annotated 3D models will be limited. Thus, we set the number of real rosebush plants reserved for training as three. The distribution of points to the three classes for these models is given in Table 1.
Although the data size in terms of the number of real plants is limited, the plants in the data set are of moderately large ones (30 to 50cm in height) and possess complex architectures with significant variations of the shape and organization of organs within a plant. Furthermore, the plant data is partitioned into blocks each of which is separately processed by the deep learning architectures. The point density of the 3D models allows sampling of 4096 points in each block. From the three rosebush plants reserved for training and validation, we extracted 251 blocks, leading to a moderate amount of data for the purposes of training a machine learning algorithm. For the eight real plants reserved for testing, the number of blocks is even higher (525 blocks) allowing a reliable performance assessment of the deep learning architectures.

Synthetic rosebush models
To create synthetic rosebush (Rosa x hybrida) models, we used a simulation procedure originally developed by Favre et al. [76], and updated in [79]. The procedure was implemented with the L-studio software [55], which provides a modular framework for plant development based on the literature on parametric L-systems [48,80]. This framework makes it possible to integrate measurable characteristics associated with individual modules of specific plant species [81]. For the synthetic rosebush model of Favre et al. [76], such characteristics were derived from observations on real plants. Morphometric measurements (i.e. diameter and length of organs), architectural structures (i.e. leaf formation order) and physiological data were analyzed and integrated into the model. The simulation model of Favre et al. [76] was further updated in [79] with three core architectural parameters: (1) the number of axes; (2) their location or topology; and (3) their morphologic type (short or long), determined from a five-months old crop of pot plants cultivated in a greenhouse under controlled non-restrictive conditions [82].
Using this simulation procedure, we generated 48 different rosebush models in the form of triangle meshes. The triangle mesh and the point cloud of a sample synthetic rosebush model are given in Fig. 1. Each triangle in a model is inherently classified into one of seven organs: Leaflet, petiole, stem, stipule, petal, sepal, and receptacle (Fig. 1a). Since the ROSE-X labels are not as fine-grained, the petiole, stem and stipule classes were merged together to form the stem class and the sepal, petal and receptacle classes were merged into the flower class after converting the mesh model into a point cloud (Fig. 1b).
In order to generate point clouds from these triangle mesh models, we homogeneously sampled points from the triangular surfaces. A point cloud is a set of 3D points P = {p 1 , p 2 , ..., p N } , where each point p i ∈ P is represented with the point's coordinates (x, y, z) in the 3D space. N is the number of points in the P , and it defines the size of the point cloud. The sampling rate was set to 120 points per square unit resulting in point clouds of size of 150,000 to 300,000 points per plant. The dimensions of synthetic models in x− , y− and z− axes are in the range of 30 to 50 cm, in accordance to the scale of the real rosebush models.  For each of the deep learning architectures explored in this paper, we applied many modifications to their default parameters in order to adapt them to segmentation of plants. We modified these parameters experimentally by dividing the synthetic rosebush data into a training and validation set. From the 48 synthetic rosebush models, 8 plants were randomly selected and reserved for validation. The rest of the point clouds are used for training the networks. Similar to the plants in ROSE-X dataset, the synthetic plants are processed through block partitioning. For the total number of blocks extracted from the two sets, please see the Results section.

Data preprocessing
The point-based deep learning architectures accept fixedsize data as input. Feeding the entire rosebush model to the networks requires a large sub-sampling rate resulting in a significant loss of geometric information. Therefore, we follow the strategy commonly used with point-based deep learning methods to handle large-scale point clouds [73]: We partition a rosebush point cloud into fixed-size cubic blocks, each of which is then processed as an independent point cloud by the deep neural networks. The block size in terms of edge length is set as 10 cm through experimentation with the synthetic data set. The networks are trained to segment the organs present in these blocks. At the inference phase, an input plant model is partitioned into blocks, and the predictions from the blocks are combined to obtain a full segmentation.
In general, the choice of the block size depends on the resolution of the input point cloud. A large cube size will correspond to loss of detail due to subsampling to attain a fixed number of points and a smaller cube will reduce contextual information among semantic parts. Starting from a block size that results in an adequate resolution of the organ surfaces and that covers multiple organs, we varied the block size to increase the performance on the validation set. In our experiments, we found that the performance margin was around 3% for the networks, by halving or doubling the initial size.
The points in a block should be sampled such that each block includes a fixed number N of points (N is 4096 for the architectures used in this study). We followed a semi-random sampling strategy in order to ensure that the sampled points are distributed in a homogeneous fashion and structures possessing fewer points (like thin branches) are not lost. If there are less than 10% of N points in a block, the block is discarded and the points in this block are included to a neighboring block. Then, the distribution of the points in each block is analyzed through partitioning the block into voxels with fixed grid size (0.2 cm in this work). The average of the number of points in the voxels is calculated. For voxels that have points fewer than the average value, the number of points they contain is increased to the average value by adding copies of the points to the data. Finally, if the points in the block are higher than the allowed number of points, mutually exclusive subsets of N points are selected randomly to form multiple blocks representing the same region. Finally, the blocks with number of points less than N are populated through random point repetition before the training phase.
To enrich the training data, block partitioning is performed with two different offset values (0 and 5 cm) for each training plant model, keeping the block size fixed. In this way, two sets of blocks containing different data from each model are created, providing additional input training data for the networks.
For segmentation of a new test point cloud, two offset values are used during block partitioning and the blocks of the two sets are fed into the network. As a result, for each point in the point cloud, two sets of probability scores for the part classes are obtained. The class with the highest probability score is assigned to the point.

3D point-based deep learning architectures
We considered six different 3D point-based deep learning architectures for the problem of part segmentation of rosebush models: (1) PointNet [57], (2) PointNet++ [71], (3) Dynamic Graph CNN (DGCNN) [72], (4) PointCNN [73], (5) ShellNet [74], and (6) RIConv [75]. As will be described in detail in Results section, we performed various experiments involving real and synthetic models. We performed extensive experiments with synthetic data alone to modify the architectures in terms of the number of layers, the number of feature channels in the layers, neighborhood sizes, point sampling rates in local neighborhoods, and other hyper-parameters. The final modifications on these parameters correspond to the best-performing settings on the validation set of the synthetic data. The weights of the modified and pre-trained networks are then fine-tuned with real rosebush data. The validation set of real data was instrumental for deciding which weights will be updated during retraining. For the experiments where we excluded synthetic data and used only real models for training, we kept the default settings of the architectures.
In the following subsections, we briefly describe the key approaches of these architectures to the problem of encoding local geometric structure of 3D point clouds. We present the parameters of the architectures that yielded the best performance in the validation set of the synthetic data. For the default structures of the architectures and for other details, please refer to the original articles.

PointNet
PointNet architecture [57] is the first deep neural network architecture that directly accepts a point cloud as input. It uplifts the (x, y, z) coordinates of each 3D point separately to high-dimensional features through Multilayer Perceptrons (MLP) with shared weights. A single maximum pooling operation is applied to summarize all the point features followed by fully-connected (FC) MLPs. The result is a single global feature vector describing the input point cloud. This feature vector is concatenated to individual point-based features to be processed by successive layers. Weight-shared MLP layers are applied to the concatenated features to extract the class scores for each point.
As with other architectures, we modified the default PointNet architecture using the synthetic rosebush models. We inserted an additional FC layer after max-pooling. An additional MLP layer was inserted after the global and point-wise features were concatenated. The number of channels at various layers were also altered. The modified PointNet architecture for segmentation is given in Fig. 2.
PointNet processes each point in an isolated manner upto the max-pooling operation, which generates a global feature vector. The final predictions heavily depend on the locations of the points rather than the local geometric organization around them. There are no connections in the architecture to relate points in close proximity to each other in the Euclidean space.

PointNet++
PointNet++ architecture [71] was devised to summarize point-based features in different local scales instead of on the global level. The input point cloud is partitioned into overlapping local regions, and the PointNet is applied to these regions resulting in feature vectors capturing geometric details of local neighbourhoods. Grouping and feature extraction are performed in a hierarchical manner.  (2) Feature propagation layer (FP). SA layer consists of two phases: sampling and grouping. In the sampling phase, P representative points are selected using farthest point sampling algorithm. In the grouping phase, a local neighborhood of fixed radius R is formed around each representative point, resulting in overlapping local groups. In this neighborhood, M points are randomly selected to form a group. PointNet is applied individually to each group to extract features summarized over all the points in the group. FP layers are responsible to propagate the group-based feature vectors to the original points in the input point cloud. The propagation of features to a point is performed via interpolation from the features of its closest neighbours. By combining the interpolated and existing features of SA phase, PointNet architecture is used to update the features of each point.
In Fig. 3, the modified PointNet++ architecture for segmentation of rosebush point clouds is given. We increased the number of SA and FP layers from 4 to 5, adjusting the radius of the local regions (R) and the number of sampled points (P) at each layer to improve the performance on our plant models. We also altered the number of channels of MLPs within the SA and FP layers.

DGCNN
Dynamic Graph CNN (DGCNN) architecture [72] was designed to integrate local neighborhood information of 3D points directly into the network, rather than a separate grouping process as done in PointNet++. The local neighbourhood of a point is represented with a graph structure. A neural network module called EdgeConv is applied to extract edge features to encode the spatial relationship between a point and its K neighbours. The edge features are extracted through MLPs applied to edge representations instead of point locations.
Unlike the CNN structures used in regular grids, fixed graphs are not used. The graphs are updated since the K nearest neighborhoods of the point-wise features change at each layer. Only in the first layer, geometrical proximity between nearest points are considered. In the following layers, edge representations are formed between nearest neighbours that are close in the feature space. That might be an advantage in terms of diffusing the information with respect to the proximity in the feature space; however, a multi-scale hierarchical local spatial grouping is not present in DGCNN. The local geometric structure is only captured at a very localized level; i.e. only within the nearest neighbours of a point.
The modified DGCNN architecture for segmentation is given in Fig. 4. We reduced the number of EdgeConv layers from three to two and altered the number of channels in MLPs. We increased the number of nearest neighbors K used to form edge representations in spatial and feature space from 20 to 32.

PointCNN
A convolution operator that weights the features of the neighbours of a point has been introduced with PointCNN architecture [73]. In this convolution process defined as X-Conv, a K × K-sized transformation matrix is predicted for K adjacent points with multi-layer perceptrons. Typical convolution layers are then applied to the transformed features. To define larger receptive fields for convolution, representative points are generated by farthest point sampling, and features resulting from X-conv are aggregated onto these representative points. By dilating points by a factor and hierarchically applying X-conv, point features are aggregated into fewer points, representing larger spatial areas. For segmentation, point-based features are processed through an encoderdecoder structure. for the synthetic validation data. We inserted an additional fully connected layer and modified the number of channels in the fully connected layers prior to obtaining point-wise class scores.

ShellNet
The ShellConv convolution operator, introduced with the ShellNet architecture [74], is applied to areas within the concentric shells of the local neighbourhood of a 3D point. The size of the sphere is increased until fixed number of points are included in each shell. Descriptive features are extracted for each shell using statistical information of the points within the shell. Since a sequence of convolution was defined outwards from starting the inner shell, the output of the convolution became relatively independent of the ordering of the points. To remove the dependency on the order of points within each shell, maximum pooling is applied to the point-wise features in the shell. ShellConv is applied hierarchically by sub-sampling the points to representative points, thus operating on larger receptive fields at subsequent layers.
The modified ShellNet architecture for segmentation is given in Fig. 6. Using the synthetic data, we tuned the parameters P and D, corresponding to the number of sampled points in the neighborhood and the number of shells, respectively. The number of nearest neighbours (K) that are used in convolution was kept at its default value. We also altered the number of channels in the fully connected layers prior to obtaining point-wise class scores.

RIConv
Many 3D deep learning architectures rely on the raw 3D coordinates of the input points, hence are inherently dependent on pose variations of objects in the scene. To provide some form of rotation-invariance, data augmentation with rotated versions of the point clouds is applied. However, the networks cannot model unseen rotations. To ensure rotation invariance, a new convolution process called RIConv is proposed in [75]. The main idea is to define the convolution process on rotation-invariant features such as angle and distance between points, rather than the raw 3D coordinates. The learned model is effective against transformations such as translation and rotation in 6-axis space. A simple binning approach for the point permutation problem is integrated into the feature extraction process. The disadvantage of aggregating distances and angles is the loss of geometric data; since two different constellations of 3D points can result in the same rotation-invariant features.
The encoder-decoder architectural structure of RIConv for segmentation is given in Fig. 7. K corresponds to the number of nearest neighbours that are used in convolution. P indicates the number of sampled points, and D is the number of bins. As with ShellNet, these parameters are tuned through synthetic rosebush data for RIConv, and the number of channels at the final fully-connected layers are altered for higher performance.

Results
We adapted and tested six 3D point-based deep learning architectures for segmentation of rosebush models to their structural parts. We used recall (Re), precision (Pr) and Intersection over Union (IoU) to evaluate the success of each architecture. We denote the number of true positives, false positives and false negatives for each class as TP C , FP C , and FN C , respectively, where C ∈ {Flower, Leaf , Stem} is the class of the structural part of a rosebush. Recall (Re), precision (Pr) and Intersection over Union (IoU) per semantic class are then defined as (1) Re = TP C TP C + FN C We also use the mean of the IoU scores over all three classes (MIoU) and the total accuracy (Acc). Acc is defined as the ratio of all correctly classified points to the total number of points in the model. Using the synthetic data generated by L-studio and the real rosebush models from ROSE-X data set, we conducted seven types of experiments with each pointbased deep learning algorithm: • Single real rosebush model for training ( real rosebush plant are called as III-trained networks. • Synthetic data for training (S): 40 of the 48 of the synthetic models generated by L-studio are used as training data. 8 models are reserved for validation. Using the results on the validation models, the parameters of each architecture are optimized. The corresponding trained networks are denoted as S-trained networks. • S-trained networks updated with single real rosebush model (S+I): The S-networks, which are initially trained and optimized with synthetic data, are retrained using the blocks extracted from a single real rosebush model. We call these updated networks S+I-trained networks. • S-trained networks updated with two real rosebush models (S+II): In this experiment, the S-networks are re-trained using the blocks extracted from two real rosebush models. We call these updated networks S+II-trained networks. • S-trained networks updated with three real rosebush models (S+III): In this experiment, the S-networks are re-trained using the blocks extracted from three real rosebush models. We call these updated networks S+III-trained networks.  to modify the networks, to determine hyper-parameters of the networks and other parameters such as block and grid sizes. For the real plant models from the ROSE-X data set, 20% of the blocks are randomly chosen for validation from the full set of blocks reserved for training. This validation set of the real data is used to set experimentally the layers for which the weights will be updated during transfer learning [77].
For the experiments where synthetic data is not involved (I, II, and III) the default settings of the architectures (such as number of features extracted at each layer) are left unchanged. For details of the default settings, please refer to the original articles introducing the architectures.
For the experiments where synthetic data is used to pre-train the modified architectures (S+I, S+II, and S+III), the training stopped after 250 epochs. Similarly, while retraining with real data, the training stopped after 250 epochs. For all cases, the weights of the last epoch are preserved for testing.
The hyper-parameters of the networks determined using the synthetic data are given in Table 3. Table 4 gives the segmentation results of the S-trained networks on the 8 synthetic validation models. Point-Net++, DGCNN, ShellNet and PointCNN were able to produce performance success over 90% for all measures. For the synthetic models, local geometric variations at the organ level (e.g. leaf shape, branch thickness) are limited to the variations imposed by the generation rules of the simulator. Hence, the networks were easily able to model the geometric characteristics that distinguish the three organs. PointNet produced an MIoU below 60% due to its inability to encode geometric information at various scales.
For the rest of the experiments, the networks are tested on the point clouds extracted from 8 real rosebush models from the ROSE-X data set through block partitioning. The predictions on the blocks are merged to obtain the final segmentation of the full plant models as described in the section for data preprocessing.
In Fig. 8, we visualized the segmentation results on a sample real rosebush model obtained with III-trained networks; i.e. only three real rosebush models were used for training. In Fig. 9, the segmentation results on the same test model with S+III-trained networks are given. Table 5 gives the segmentation results obtained with PointNet on the real test plants. Columns in Table 5 correspond to the segmentation results of the seven types of experiments. The results correspond to the performance values averaged over 8 models. Despite    the increase in the training data and the incorporation of synthetic data, the segmentation performance of PointNet is low, especially for the flower and stem parts. Not being able to capture the distinguishing geometrical structures of the parts, PointNet seems to favor the leaf class due to the imbalance in the training data (Fig. 8b).
The segmentation results of 8 test real rosebush models yielded by PointNet++ with seven experimental setups are given in Table 6. The increase of the training data from a single rosebush model to two and then three models led to an increase in the performance, especially for the stem class. The use of synthetic data alone for training was not effective; however when the network pre-trained with synthetic data was updated with real rosebush models the performance was improved. The results with PointNet++ are promising with an accuracy rate over 95% and a mean IoU rate over 85%. The main sources of errors are the confusion between stems and thick parts of flowers (Fig. 10a), between leaves and petals of flowers (Fig. 10b), and between petioles and leaves (Fig. 10c, 10d).
The effect of using synthetic data on the segmentation results is even more pronounced for DGCNN (Table 7), PointCNN (Table 8), and ShellNet (Table 9). Rather than training a network with real data from scratch (as in the cases of I, II, and III), using the real data to fine-tune a network trained by synthetic data (as in the cases of S+I, S+II, and S+III) boosts the performance, especially for the stem and flower classes.
We can observe from Fig. 8d that with DGCNN, parts of main stems were classified as leaves and the flower class is not retrieved at all (27.94% and 7.12% recall rates for the flower and stem classes, respectively, in Table 7). We conjecture that DGCNN is only encoding the geometric structure at the very local level; the spatial receptive field was limited to the K-neighbours of each point in 3D. The data imbalance in the training data in favor of leaves limited the capacity of DGCNN to learn features from stem and flower regions. The effect of data   imbalance was alleviated with incorporating synthetic data in training data as seen in Fig. 9d. DGCNN was able to capture branch and flower structures with pre-training with synthetic models. Despite the incorporation of synthetic data, DGCNN's performance lacks behind PointNet++, PointCNN, and ShellNet. These three architectures, in contrast to DGCNN, have the capacity to increase the size of the spatial receptive fields through successive re-grouping and feature aggregation. Examples to erroneous segmentation results produced by DGCNN are visualized in Fig. 11. Classifying petioles as leaves (Fig. 11a) is a common error for all architectures, however it occurs more frequently with DGCNN. Confusion between leaves and flowers are present (Fig. 11b). Surfaces of main stems can be classified as leaf points (Fig. 11c). In some cases, boundaries of leaves are assigned to the stem class (Fig. 11d).
The second best results after PointNet++ were obtained with PointCNN (Table 8). Examples to erronous segmentation results produced by PointCNN are shown in Fig. 12. We observe petioles classified as leaves ( Fig. 12a and 12 d), and elongated and thick leaves classified as flowers (Fig. 12b). There is also confusion between leaves and petals (Fig. 12c). In some cases, main stem points close to leaves are classified as leaf points (Fig. 12d).
The quantitative performance results obtained with ShellNet architecture (Table 9) are close to those of PointCNN. They use similar strategies to group local points; they both recursively sub-sample the point cloud through selecting representative points and aggregate features from the closest neighbours of these representatives. In PointCNN, however, aggregation through convolution is performed through a predicted ordering of all  the neighbour points; a property to which we attribute its higher performance compared to ShellNet. With ShellNet, as with the other architectures, petioles (Fig. 13a) and petals (Fig. 13b) were occasionally confused with leaf points. Touching leaves resulting in thick structures are also a cause of error (Fig. 13c). Another source of error with ShellNet is the interference of points from close parts, such as the misclassifications of leaf points as stems (Fig. 13d).
The segmentation results obtained with RIConv (Table 10) fall behind all the architectures except Point-Net. The local regions were extracted in the same way as in ShellNet, however, use of rotation invariant features resulted in significant loss of geometric information about the constellation of the points, which is especially important in distinguishing plant parts.
All networks, with the exception of PointNet, when trained with synthetic data only, yield relatively high recall and low precision for the flower class on real rosebush plants. We conjecture that the reason is the mismatch of the flower class betweeen synthetic data and real plants in terms of both geometrical structure and the ratio of occurrence. High recall together with low precision for the flower class means that the networks are biased towards classifying a significant portion of leaves as flowers, causing low recall values for the leaves. When the networks are updated with real training plants, this bias is compensated and the precision for the flower class and the recall for the leaf class improve.
In general, the mIoU increases as the networks are updated with more real training data. However, for PointCNN (Table 8), the improvement between the cases S+II and S+III is not significant, and for RIConv (Table 10) MIoU drops about 1% with S+III compared to S+II. For both networks, the recall for the flower class decreases as the number of real training plants is increased from two to three. More petioles are classified as leaves, as these two networks start to favor classifying elongated structures as leaves, which in turn translates into a drop in the precision of leaves. Despite this observation, PointCNN gives the second best IoU for the flower class among all the networks for the case S+III (Table 11).
To summarize the results and to demonstrate the effect of incorporation of synthetic models, we give the segmentation performances of all architectures with III-trained and S+III-trained networks in Table 11. The use of synthetic data was beneficial for almost all classes and all architectures, except for PointNet. There is a slight decrease in the IoU value for the flower class with RIConv, which is compensated by a significant increase in the performance for the stem class.
We can also observe from Table 11 that RIConv performed poorly compared to other architectures due to the information loss with rotation invariant features. DGCNN used a single spatial receptive field at the very local level and opted for feature proximity in a non-local way; therefore missing the multi-scale spatial variability in plant parts.
The best results were obtained with PointNet++ with or without the use of synthetic data for training. The hierarchically organized local regions for feature extraction with PointNet++ are defined in terms of metric radius. The spatial hierarchy is flexible and can be adjusted without changing the network structure. The next best two methods are PointCNN and ShellNet, both of which hierarchically regroup points and aggregate features within the network. However, the neighbourhoods are defined with respect to K-neighbourhood of points instead of metric radius. Therefore, it is not straightforward to adjust the size of the receptive fields for these architectures while taking into account both the size of

Discussion
In their default settings, the design parameters (such as number of features and layers) of the six networks and other hyperparameters (such as the radii of local regions) were originally adjusted for 3D datasets which contain point cloud scenes of indoor environments and cityscapes. The general practice for adjusting such parameters is to search for the best-performing settings through experimentation with a validation set. In our case, since we have limited data for real rosebush models, we used a subset of the synthetic dataset as validation set, systematically varied the design parameters without altering the general structure and modified each network so as to maximize its performance on the validation set. The objective was to provide a fair comparison among the six networks, whose default parameters were determined using data domains different from plant data. Methodological research is ongoing to automatically adjust not only the hyperparameters but the entire architecture of the network [83]. So far, the effectiveness of genetic algorithms for the search of design parameters was demonstrated with convolutional networks [84]. This could stand as an interesting perspective to explore such approaches with point cloud based neural networks.
While designing a 3D point-based architecture to operate effectively on plant data, an important consideration is the multi-scale and self-similar nature of plants. The architecture should be able to handle multiple, hierarchical spatial receptive fields in the network and their sizes should be easily tuned to the scales of various structures in the plants. The multi-scale feature extraction scheme is also necessary to account for the intra-class size variations; such as variations in branch diameter or leaf length and intra-class geometric variations, such as diverse range of curvature on the branches and leaves. Also grouping features with respect to their proximity in the feature space can lead to non-local similarity modeling to capture repetitive structures that are inherent to plants.
The robustness of the architecture to heterogeneous point density, missing information and reconstruction noise is an important factor, especially for 3D data obtained through structure from motion. The full real plant models in the ROSE-X data set together with the synthetic data we employed in this work can be greatly instrumental for a systematic analysis of the responses of the architectures to low quality and noisy 3D data through simulation of acquisition systems such as ToF cameras and LiDARs in virtual environments Also, data augmentation is possible by introducing variable point density and artificial noise to the point clouds. However, the architectures should eventually be tested on data acquired by low-cost systems including structure from motion.
Another issue is that the variability of local parts is greatly effected by the intricate plant structure, bringing distinct parts close to each other. The training data should be able to account for diverse local geometric occurrences, such as touching leaves or branches due to dense foliage. More realistic synthetic data or Table 11 Segmentation results on 8 real rosebush models for all architectures Bold stands for "gain in IoU obtained by incorporating synthetic models" as expressed in the caption of the Figure The first row for each class corresponds to IoU results of networks trained with three real rosebush models (III). The second row for each class gives the IoU results for the case, where the networks were trained with synthetic models and updated with three real rosebush models (S+III). The third row for each class gives the gain in IoU obtained by incorporating synthetic models. The last three rows of the plant-specific augmentation techniques ensuring folding of leaves and branches can help enrich the labeled data.

Conclusion
We modified six recent 3D point-based deep learning architectures, PointNet, PointNet++, DGCNN, PointCNN, ShellNet, and RIConv, for segmentation of 3D models of real rosebush plants into their structural parts. We used the annotated 3D models in ROSE-X data set for training and testing the networks. We also conducted experiments where the networks were pre-trained with synthetic rosebush models generated by L-studio software, and then updated by real rosebush data. The results indicate that pre-training with synthetic data boosts the performance of all networks, except PointNet. The best segmentation results were obtained by PointNet++ with a mean IoU rate of 86.19%. We attribute this success to the ease of determining the size of the hierarchical local regions to extract multi-scale features with PointNet++. RIConv was not as effective due to reliance on rotation invariant features that provide insufficient local geometric information. DGCNN , PointCNN, and ShellNet produced promising results, however defining local regions for feature extraction by K-neighbourhood of points is less practical for modeling plant geometry; since the optimum K for each scale will be dependent on both the size of the plant part structures and the point density of the 3D point cloud.