Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments

Practical Cost-Conscious Active Learning for Data Annotation in Annotator-Initiated Environments Robbie A. Haertel Department of Computer Science, BYU Doctor of Philosophy Many projects exist whose purpose is to augment raw data with annotations that increase the usefulness of the data. The number of these projects is rapidly growing and in the age of “big data” the amount of data to be annotated is likewise growing within each project. One common use of such data is in supervised machine learning, which requires labeled data to train a predictive model. Annotation is often a very expensive proposition, particularly for structured data. The purpose of this dissertation is to explore methods of reducing the cost of creating such data sets, including annotated text corpora. We focus on active learning to address the annotation problem. Active learning employs models trained using machine learning to identify instances in the data that are most informative and least costly. We introduce novel techniques for adapting vanilla active learning to situations wherein data instances are of varying benefit and cost, annotators request work “on-demand,” and there are multiple, fallible annotators of differing levels of accuracy and cost. In order to account for data instances of varying cost, we build a model of cost from real annotation data based on a user study. We also introduce a novel cost-conscious active learning algorithm which we call return-on-investment, that selects instances for annotation that contain the most benefit per unit cost. To address the issue of annotators that request instances “on-demand,” we develop a parallel, “no-wait” framework that performs computation while the annotator is annotating. As a result, annotators need not wait for the computer to determine the best instance for them to annotate—a common problem with existing approaches. Finally, we introduce a Bayesian model designed to simultaneously infer ground truth annotations from noisy annotations, infer each individual annotators accuracy, and predict its own accuracy on unseen data, without the use of a held-out set. We extend ROI-based active learning and our annotation framework to handle multiple annotators using this model. As a whole, our work shows that the techniques introduced in this dissertation reduce the cost of annotation in scenarios that are more true-to-life than previous research.

[1]  Fredrik Olsson,et al.  A Web Survey on the Use of Active Learning to Support Annotation of Text Data , 2009, HLT-NAACL 2009.

[2]  Eric Horvitz,et al.  Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning , 2007, IJCAI.

[3]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[5]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Rebecca Hwa,et al.  Sample Selection for Statistical Grammar Induction , 2000, EMNLP.

[8]  Brigham Anderson,et al.  Active learning for Hidden Markov Models: objective functions and algorithms , 2005, ICML.

[9]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[10]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[11]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[12]  George Cybenko,et al.  Efficient computation of the hidden Markov model entropy for a given observation sequence , 2005, IEEE Transactions on Information Theory.

[13]  Eric Nyberg,et al.  Assessing Benefit from Feature Feedback in Active Learning for Text Classification , 2011, CoNLL.

[14]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[15]  Jason Baldridge,et al.  Evaluating Automation Strategies in Language Documentation , 2009, Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing - HLT '09.

[16]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[17]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[18]  Dirk Hovy,et al.  Learning Whom to Trust with MACE , 2013, NAACL.

[19]  Peng Dai,et al.  Human Intelligence Needs Artificial Intelligence , 2011, Human Computation.

[20]  Eric K. Ringger,et al.  Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation , 2007, LAW@ACL.

[21]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[22]  Dragos D. Margineantu,et al.  Active Cost-Sensitive Learning , 2005, IJCAI.

[23]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[24]  Eric K. Ringger,et al.  Model-based document clustering with a collapsed gibbs sampler , 2008, KDD.

[25]  Eric K. Ringger,et al.  Assessing the Costs of Sampling Methods in Active Learning for Annotation , 2008, ACL.

[26]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[27]  Udo Hahn,et al.  A Comparison of Models for Cost-Sensitive Active Learning , 2010, COLING.

[28]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[29]  Jiawei Han,et al.  Mining Heterogeneous Information Networks by Exploring the Power of Links , 2009, ALT.

[30]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[31]  Eric K. Ringger,et al.  First Results in a Study Evaluating Pre-annotation and Correction Propagation for Machine-Assisted Syriac Morphological Analysis , 2012, LREC.

[32]  Udo Hahn,et al.  Semi-Supervised Active Learning for Sequence Labeling , 2009, ACL.

[33]  James L. Carroll,et al.  A bayesian decision theoretical approach to supervised learning, selective sampling, and empirical function optimization , 2010 .

[34]  Eric K. Ringger,et al.  Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study , 2008, LREC.

[35]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[36]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[37]  Eric K. Ringger,et al.  Modeling the Annotation Process for Ancient Corpus Creation , 2007 .

[38]  Shipeng Yu,et al.  Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks , 2012, J. Mach. Learn. Res..

[39]  Shlomo Argamon,et al.  Minimizing Manual Annotation Cost in Supervised Training from Corpora , 1996, ACL.

[40]  James L. Carroll Explicit Utility in Supervised Learning , 2012 .

[41]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[42]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[43]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[44]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[45]  Carolyn Penstein Rosé,et al.  Supporting Efficient and Reliable Content Analysis Using Automatic Text Processing Technology , 2005, INTERACT.

[46]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[47]  Rebecca Hwa,et al.  Sample Selection for Statistical Parsing , 2004, CL.

[48]  Carolyn Penstein Rosé,et al.  Estimating Annotation Cost for Active Learning in a Multi-Annotator Environment , 2009, HLT-NAACL 2009.

[49]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[50]  Dan Klein,et al.  Learning from measurements in exponential families , 2009, ICML '09.

[51]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[52]  Lynette Hirschman,et al.  Mixed-Initiative Development of Language Processing Systems , 1997, ANLP.

[53]  Mausam,et al.  Dynamically Switching between Synergistic Workflows for Crowdsourcing , 2012, AAAI.

[54]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[55]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[56]  D. Roth,et al.  COMPREHENSIVE TRUST METRICS FOR INFORMATION NETWORKS , 2010 .

[57]  Eric K. Ringger,et al.  Parallel Active Learning: Eliminating Wait Time with Minimal Staleness , 2010, HLT-NAACL 2010.

[58]  Jason Baldridge,et al.  Ensemble-based Active Learning for Parse Selection , 2004, NAACL.

[59]  Eric Brill,et al.  Classifier Combination for Improved Lexical Disambiguation , 1998, ACL.

[60]  Adam Tauman Kalai,et al.  Analysis of Perceptron-Based Active Learning , 2009, COLT.

[61]  David Yarowsky,et al.  Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking , 2000, ACL.

[62]  Gideon S. Mann,et al.  Efficient Computation of Entropy Gradient for Semi-Supervised Conditional Random Fields , 2007, NAACL.

[63]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[64]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[65]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[66]  Beatrice Alex,et al.  Investigating the Effects of Selective Sampling on the Annotation Task , 2005 .

[67]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[68]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[69]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[70]  Pietro Perona,et al.  Inferring Ground Truth from Subjective Labelling of Venus Images , 1994, NIPS.

[71]  Jaime G. Carbonell,et al.  A Probabilistic Framework to Learn from Multiple Annotators with Time-Varying Accuracy , 2010, SDM.

[72]  Mark Craven,et al.  Curious machines: active learning with structured instances , 2008 .

[73]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[74]  Eric Horvitz,et al.  Principles and applications of continual computation , 2001, Artif. Intell..

[75]  Beatrice Alex,et al.  Optimising Selective Sampling for Bootstrapping Named Entity Recognition , 2005, ICML 2005.

[76]  Fredrik Olsson,et al.  A literature survey of active machine learning in the context of natural language processing , 2009 .

[77]  Jason Baldridge,et al.  Active Learning and the Total Cost of Annotation , 2004, EMNLP.

[78]  Dan Roth,et al.  Latent credibility analysis , 2013, WWW.

[79]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[80]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[81]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[82]  George Anton Kiraz,et al.  Automatic concordance generation of Syriac texts , 1994 .

[83]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[84]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[85]  Christopher H. Bryant,et al.  Functional genomic hypothesis generation and experimentation by a robot scientist , 2004, Nature.

[86]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[87]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.