Combating Label Noise with Abstention
Introduction
-
When annotating large scale datasets, a certain amount of erroneously labeled data is bound to occur.
-
Significant amounts of label noise can degrade the generalization performance. So it’s better to eliminate the noisy data and train with just the cleaner subset.
-
In this context, the paper proposes an approach for abstaining confusing samples during training; a loss function is proposed to achieve this.
-
Though label noise is a well-studied problem in machine learning, there is not much work on identifying and ignoring the noisy samples during training itself.
-
Paper makes no assumptions on the amount of label noise or the existence of a trusted or clean dataset.
-
A DNN trained with abstention can be used as an effective data cleaner leading to significant performance benefits for downstream training using a cleaner set.
-
This is the first work to show how abstention training can be used to identify and eliminate noisy labels for improving classification performance.
Approach
- Loss Function for the Deep Abstaining classifier (DAC) is a modified version of the standard -class cross entorpy (per-sample) :
-
DAC has an additonal output, indicates the probability of abstenstion. Substituting in the above equation recovers the standard cross entropy.
-
The second term in the loss function penalizes the abstention and is weighted by
- Degree of abstention depends on the values of .
- Higher the value of , more the penalty for abstention, so the classifier learns to never abstain.
- Lower the value of , classifier tries to abstain on everything.
- When is between these extremities, abstention depends on the cross-entropy error made by the sample on learning the true class.
-
Paper also comes up with a proof for showing that learning on true classes persists even in the presence of abstention.
-
is auto-tuned during training. Abstention free training is performed for initial epochs, and then from epoch onwards abstention comes into the picture.
-
The representational power of both the DAC and simple DNN are same; the difference lies in the optimization induced by the loss function.
- Abstention is used both during training as well as inference. This gives DNN an option to abstain on confusing sample thereby mitigating mis-classification loss but incurring an abstention penalty.
Experiments and Results
DAC as a Learner of Structured Noise
-
Structured Noise: Noise can often exhibit a pattern attributable to the training data being corrupted in some non-arbitrary or systematic manner.
-
DAC learns features that are associated with difficult or confusing samples and learns to abstain based on these features
-
When the DAC encounters data with unknown features, it’ll abstain from predicting on these samples and handover the task to an upstream (possibly human) expert.
-
By eliminating the data abstained by DAC and training a model on the cleaner set will give a significant performance boost.
-
DAC can reliably pick up and abstain on samples where the noise is correlated with an underlying feature.
Learning in the presence of Unstructured Noise: DAC as a data cleaner
-
Unstructured Noise: Noisy labels that might occur arbitrarily on some fraction of data.
-
The performance of DAC is compared with two state-of-the-art results on training with noisy labels on image data. MetorNet (Jiang et al., 2018) and Generalised Cross Entropy (Zhang & Sabuncu, 2018)
-
As a best-case model in the data-cleaning scenario, the paper also reports the performance of a hypothetical oracle that has perfect information about the corrupted labels and eliminates only those samples.
-
At the point of best validation error, if there continues to be training error on the non-abstaining portion of the DAC, then this is likely indicative of label noise; if training with standard cross entropy these samples need to removed for getting better performance.
References:
- Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2309–2318, 2018.
- Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. arXiv preprint arXiv:1805.07836, 2018.