Get another label? improving data quality and data mining using multiple, noisy labelers

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

[1]  P. Whittle Some General Points in the Theory of Optimal Experimental Design , 1973 .

[2]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[3]  Bernard W. Silverman,et al.  Some asymptotic properties of the probabilistic teacher (Corresp.) , 1980, IEEE Trans. Inf. Theory.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Gábor Lugosi,et al.  Learning with an unreliable teacher , 1992, Pattern Recognit..

[6]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  Pietro Perona,et al.  Knowledge Discovery in Large Image Databases: Dealing with Uncertainties in Ground Truth , 1994, KDD Workshop.

[9]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[10]  Padhraic Smyth,et al.  Bounds on the mean classification error rate of multiple experts , 1996, Pattern Recognit. Lett..

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[13]  Ian Witten,et al.  Data Mining , 2000 .

[14]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[15]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[16]  Russell Greiner,et al.  Budgeted learning of nailve-bayes classifiers , 2002, UAI 2002.

[17]  Rong Jin,et al.  Learning with Multiple Labels , 2002, NIPS.

[18]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[19]  Russell Greiner,et al.  Budgeted Learning of Naive-Bayes Classifiers , 2003, UAI.

[20]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[21]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[24]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[25]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[26]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[27]  Foster J. Provost,et al.  Active feature-value acquisition for classifier induction , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[28]  Foster Provost Toward economic machine learning and utility-based data mining , 2005, UBDM '05.

[29]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[30]  Dragos D. Margineantu,et al.  Active Cost-Sensitive Learning , 2005, IJCAI.

[31]  Xindong Wu,et al.  Cost-constrained data acquisition for intelligent data preparation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[32]  Russell Greiner,et al.  Learning and Classifying Under Hard Budgets , 2005, ECML.

[33]  Clayton T. Morrison,et al.  Noisy information value in utility-based decision making , 2005, UBDM '05.

[34]  Foster J. Provost,et al.  An expected utility approach to active feature-value acquisition , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[35]  Zhiqiang Zheng,et al.  Selectively Acquiring Customer Information: A New Data Acquisition Problem and an Active Learning-Based Solution , 2006, Manag. Sci..

[36]  Foster J. Provost,et al.  Active Feature-Value Acquisition , 2009, Manag. Sci..

[37]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.