Hamid Serry, WMG Graduate Trainee, University of Warwick
In the last post, we discussed the probability distribution of classes across the Kitti dataset, and how we aimed to, and succeeded in replicating this distribution in our tests for Anyverse’s generated dataset. This week we are taking a look into some generation trials which have made light of a new situation to focus on, the minimum pixel size of bounding boxes per class, and the level of occlusion of different objects.
Minimum Pixel Size
Within the context of object detection models, the neural network cannot correctly extract features from objects with small bounding boxes. This makes it very difficult to classify any of the classes from just a few pixels without any recognizable features, which would damage the training of the network. This would also have an effect on the performance when testing the model on a fresh dataset.
In our testing environment within Anyverse Studio, we initially set the maximum distance to be 95 meters away from the Ego reference vehicle, but upon generating some images, we found this to result in a minimum bounding box for some classes to be only a couple of pixels wide. This is far too small to be useful in our use case of training a neural network, so a study into the minimum pixel sizes of the bounding boxes in the Kitti dataset was undertaken. The results for the training split of the dataset are shown in Table 1.
Table 1 – Minimum Pixel sizes of bounding boxes in Kitti Dataset
Some classes, such as Car, have a much lower width than expected, this is due to some anomalies of exceptionally low bounding boxes. When comparing against the testing and validation subsets, their minimum width in pixels was just over 5. The amount of the classes which were under 20 pixels (out of the total labels also shown) have been calculated in Table 2 below.
Table 2 – Number of bounding boxes under 20px in width and height respectfully
The large number of cars which have fallen under 20px is mostly due to the range being between 0 and 18 cars in a single image, meaning there are sometimes many cars very far away. This is not the case for cyclists and people classes which are in general much smaller (especially in terms of the width) so in these cases the height can be compared. Further analysis showed using a lower filtering value for these classes still gave reasonable results from training overall. Table 3 shows the final class bounding box sizes used for training.
Table 3 – Minimum class width and heights for bounding boxes
Occlusion refers to the proportion of an object which is covered by another object in a scene. This can have an effect on the training of a neural network. This is due to a bounding box potentially showing the ground truth of an object which is only 20% visible, resulting in an unjustified penalty on the network for not being able to correctly classify the object. It can also skew the mAP of a test or validation set, as many boxes which are just a few pixels wide, would not even be classified by a person, let alone an algorithm. Table 4 shows the proportion of images in the generation test which have an occlusion of less than 50%.
Table 4 – Labels of classes with an occlusion of <50%
This week has given a lot of thought into potential pitfalls when generating a new dataset, ensuring that all generated classes are within the view of the proceeding network, to get the best possible mean average precision when classifying object detection. One of the advantages of synthetic data is that you can overcome these potential pitfalls in the data by changing the generation parameters to generate more data that will help your network perform better and generalize better.
Next we will go through the generation process to come, but in the meantime we are going to run some experiments with the current synthetic dataset so we have benchmark data to compare when we complete the dataset with more samples that avoid the small bounding box and high occlusion pitfalls we have detected. See you then!
Don’t miss the next chapter
Don’t miss the third chapter of this insights series to know more about how we process the RAW data coming from the sensor to adapt it to what the human eye sees and what the different devices can display.
Read other chapters >>>
With Anyverse™, you can accurately simulate any camera sensor and help you decide which one will perform better with your perception system. No more complex and expensive experiments with real devices, thanks to our state-of-the-art photometric pipeline.
Need to know more?