Ensemble Learning from Crowds

Traditional learning from crowdsourced labeled data consists of two stages: inferring true labels for instances from their multiple noisy labels and building a learning model using these instances with the inferred labels. This straightforward two-stage learning scheme suffers from two weaknesses: (1) the accuracy of inference may be very low; (2) useful information may be lost during inference. In this paper, we proposed a novel ensemble method for learning from crowds. Our proposed method is a meta-learning scheme. It first uses a bootstrapping process to create $M$M sub-datasets from an original crowdsourced labeled dataset. For each sub-dataset, each instance is duplicated with different weights according to the distribution and class memberships of its multiple noisy labels. A base classifier is then trained from this extended sub-dataset. Finally, unlabeled instances are predicted by aggregating the outputs of these $M$M base classifiers. Because the proposed method gets rid of the inference procedure and uses the full dataset to train learning models, it preserves the useful information for learning as much as possible. Experimental results on nine simulated and two real-world crowdsourcing datasets consistently show that the proposed ensemble learning method significantly outperforms five state-of-the-art methods.

[1]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[2]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[3]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Xindong Wu,et al.  Learning from crowdsourced labeled data: a survey , 2016, Artificial Intelligence Review.

[5]  Hiroshi Kajino,et al.  Convex Formulations of Learning from Crowds , 2012 .

[6]  Mark W. Schmidt,et al.  Modeling annotator expertise: Learning when everybody knows a bit of something , 2010, AISTATS.

[7]  Andrés R. Masegosa,et al.  Bagging Decision Trees on Data Sets with Classification Noise , 2010, FoIKS.

[8]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[9]  David J. Hand,et al.  An Empirical Comparison of Three Boosting Algorithms on Real Data Sets with Artificial Class Noise , 2003, Multiple Classifier Systems.

[10]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[11]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[12]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[13]  Nihar B. Shah,et al.  On the Impossibility of Convex Inference in Human Computation , 2014, AAAI.

[14]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[15]  Chao Huang,et al.  Where are you from: Home location profiling of crowd sensors from noisy and sparse crowdsourcing data , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[16]  Francisco Herrera,et al.  On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification , 2014, Neurocomputing.

[17]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[18]  Fenglong Ma,et al.  Crowdsourcing High Quality Labels with a Tight Budget , 2016, WSDM.

[19]  Wilfred Ng,et al.  Truth Discovery in Data Streams: A Single-Pass Probabilistic Approach , 2014, CIKM.

[20]  Zhuowen Tu,et al.  Learning to Predict from Crowdsourced Data , 2014, UAI.

[21]  Claudio Gentile,et al.  Selective sampling and active learning from single and multiple teachers , 2012, J. Mach. Learn. Res..

[22]  Wenxin Jiang,et al.  Some Theoretical Aspects of Boosting in the Presence of Noisy Data , 2001, ICML.

[23]  Chao Huang,et al.  Topic-Aware Social Sensing with Arbitrary Source Dependency Graphs , 2016, 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

[24]  Bo Zhao,et al.  Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery , 2016, IEEE Transactions on Knowledge and Data Engineering.

[25]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[26]  Hongwei Li,et al.  Error Rate Analysis of Labeling by Crowdsourcing , 2013 .

[27]  Victor S. Sheng Simple Multiple Noisy Label Utilization Strategies , 2011, 2011 IEEE 11th International Conference on Data Mining.

[28]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[29]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[30]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[31]  John C. Platt,et al.  Learning from the Wisdom of Crowds by Minimax Entropy , 2012, NIPS.

[32]  Matthew Lease,et al.  SQUARE: A Benchmark for Research on Computing Crowd Consensus , 2013, HCOMP.

[33]  Xindong Wu,et al.  Improving Label Quality in Crowdsourcing Using Noise Correction , 2015, CIKM.

[34]  Heng Ji,et al.  FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation , 2015, KDD.

[35]  Stephen J. Roberts,et al.  Dynamic Bayesian Combination of Multiple Imperfect Classifiers , 2012, Decision Making and Imperfection.

[36]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT.

[37]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[38]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[39]  Rongrong Ji,et al.  Visual tracking via weakly supervised learning from multiple imperfect oracles , 2014, Pattern Recognit..

[40]  Bo Zhao,et al.  Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation , 2014, SIGMOD Conference.

[41]  Jennifer G. Dy,et al.  Active Learning from Multiple Knowledge Sources , 2012, AISTATS.

[42]  Charles D. Mallah,et al.  PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES , 2013 .

[43]  Yakov Ben-Haim,et al.  Evaluation of Neural Network Robust Reliability Using Information-Gap Theory , 2006, IEEE Transactions on Neural Networks.

[44]  Hisashi Kashima,et al.  Clustering Crowds , 2013, AAAI.

[45]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[46]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[47]  Nitesh V. Chawla,et al.  Reliable fake review detection via modeling temporal and behavioral patterns , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[48]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..