Journal Papers

On the purity of training and testing data for learning: The case of pedestrian detection

Abstract:
The training and the evaluation of learning algorithms depend critically on the quality of data samples. We denote as pure the samples that identify clearly and without any ambiguity the class of objects of interest. For instance, in pedestrian detection algorithms, we consider as pure samples the ones containing persons who are fully visible and are imaged at a good resolution (larger than the detector window in size). The exclusive use of pure samples entails two kinds of problems. In training, it biases the detector to neglect slightly occluded and small sized samples (which we denote as impure), thus reducing its detection rate in a real world application. In testing, it leads to the unfair evaluation and comparison of different detectors since slightly impure samples, when detected, can be accounted for as false positives. In this paper we study how a sensible use of impure samples can benefit both the training and the evaluation of pedestrian detection algorithms. We improve the labelling of one of the most widely used pedestrian data sets (INRIA) taking into account the degree of sample impurity. We observe that including partially occluded pedestrians in the training improves performance, not only on partially visible examples, but also on the fully visible ones. Furthermore, we found that including pedestrians imaged at low resolutions is beneficial for detecting pedestrians in the same range of heights, leaving the performance on pure samples unchanged. However, including samples with too high a grade of impurity degrades the performance, thus a careful balance must be found. The proposed labelling will allow further studies on the role of impure samples in training pedestrian detectors and on devising fairer comparison metrics between different algorithms.
Impact factor:

Neurocomputing, Vol. 150, Part A, pp. 214–226