### Introduction

• When annotating large scale datasets, a certain amount of erroneously labeled data is bound to occur.

• Significant amounts of label noise can degrade the generalization performance. So it’s better to eliminate the noisy data and train with just the cleaner subset.

• In this context, the paper proposes an approach for abstaining confusing samples during training; a loss function is proposed to achieve this.

• Though label noise is a well-studied problem in machine learning, there is not much work on identifying and ignoring the noisy samples during training itself.

• Paper makes no assumptions on the amount of label noise or the existence of a trusted or clean dataset.

• A DNN trained with abstention can be used as an effective data cleaner leading to significant performance benefits for downstream training using a cleaner set.

• This is the first work to show how abstention training can be used to identify and eliminate noisy labels for improving classification performance.

### Approach

• Loss Function for the Deep Abstaining classifier (DAC) is a modified version of the standard $k$-class cross entorpy (per-sample) :

$\mathcal{L}(x_{j}) = (1-p_{k+1})\biggl(-\sum_{i=1}^{k}t_{i}log\frac{p_{i}}{1 - p_{k+1}}\biggr) + \alpha log\frac{1}{1-p_{k+1}}$

• DAC has an additonal $(k+1)^{st}$ output, $p_{k+1}$ indicates the probability of abstenstion. Substituting $p_{k+1} = 0$ in the above equation recovers the standard cross entropy.

• The second term in the loss function penalizes the abstention and is weighted by $\alpha \geq 0$

• Degree of abstention depends on the values of $\alpha$.
• Higher the value of $\alpha$, more the penalty for abstention, so the classifier learns to never abstain.
• Lower the value of $\alpha$, classifier tries to abstain on everything.
• When $\alpha$ is between these extremities, abstention depends on the cross-entropy error made by the sample on learning the true class.
• Paper also comes up with a proof for showing that learning on true classes persists even in the presence of abstention.

• $\alpha$ is auto-tuned during training. Abstention free training is performed for $L$ initial epochs, and then from $(L+1)^{th}$ epoch onwards abstention comes into the picture.

• The representational power of both the DAC and simple DNN are same; the difference lies in the optimization induced by the loss function.

• Abstention is used both during training as well as inference. This gives DNN an option to abstain on confusing sample thereby mitigating mis-classification loss but incurring an abstention penalty.

### Experiments and Results

DAC as a Learner of Structured Noise

• Structured Noise: Noise can often exhibit a pattern attributable to the training data being corrupted in some non-arbitrary or systematic manner.

• DAC learns features that are associated with difficult or confusing samples and learns to abstain based on these features

• When the DAC encounters data with unknown features, it’ll abstain from predicting on these samples and handover the task to an upstream (possibly human) expert.

• By eliminating the data abstained by DAC and training a model on the cleaner set will give a significant performance boost.

• DAC can reliably pick up and abstain on samples where the noise is correlated with an underlying feature.

Learning in the presence of Unstructured Noise: DAC as a data cleaner

• Unstructured Noise: Noisy labels that might occur arbitrarily on some fraction of data.

• The performance of DAC is compared with two state-of-the-art results on training with noisy labels on image data. MetorNet (Jiang et al., 2018) and Generalised Cross Entropy (Zhang & Sabuncu, 2018)

• As a best-case model in the data-cleaning scenario, the paper also reports the performance of a hypothetical oracle that has perfect information about the corrupted labels and eliminates only those samples.

• At the point of best validation error, if there continues to be training error on the non-abstaining portion of the DAC, then this is likely indicative of label noise; if training with standard cross entropy these samples need to removed for getting better performance.

### References:

1. Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2309–2318, 2018.
2. Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. arXiv preprint arXiv:1805.07836, 2018.