Modeling of learning curves with applications to POS tagging

Abstract An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[3]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[4]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[5]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[6]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[7]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[8]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[9]  P. Vandome ECONOMETRIC FORECASTING FOR THE UNITED KINGDOM , 2009 .

[10]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[11]  K. Vijay-Shanker,et al.  A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping , 2009, CoNLL.

[12]  Nicoletta Calzolari,et al.  EAGLES Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages , 1996 .

[13]  David R. Anderson,et al.  Modeling Survival and Testing Biological Hypotheses Using Marked Animals: A Unified Approach with Case Studies , 1992 .

[14]  Mauro Cettolo,et al.  Evaluating the Learning Curve of Domain Adaptive Statistical Machine Translation Systems , 2012, WMT@NAACL-HLT.

[15]  Thomas F. Coleman,et al.  A Subspace, Interior, and Conjugate Gradient Method for Large-Scale Bound-Constrained Minimization Problems , 1999, SIAM J. Sci. Comput..

[16]  Nicolás García-Pedrajas,et al.  Boosting instance selection algorithms , 2014, Knowl. Based Syst..

[17]  Manfred K. Warmuth,et al.  Sample compression, learnability, and the Vapnik-Chervonenkis dimension , 1995, Machine Learning.

[18]  T. Apostol Mathematical Analysis , 1957 .

[19]  Hwee Tou Ng,et al.  Domain Adaptation with Active Learning for Word Sense Disambiguation , 2007, ACL.

[20]  Hinrich Schütze,et al.  Performance thresholding in practical text classification , 2006, CIKM '06.

[21]  Chris Fox,et al.  The Handbook of Computational Linguistics and Natural Language Processing , 2010 .

[22]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[23]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[24]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[25]  Seong-Bae Park,et al.  A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors , 2012, ACL.

[26]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[27]  Jian Su,et al.  Multi-Criteria-based Active Learning for Named Entity Recognition , 2004, ACL.

[28]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[29]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[30]  Douglas Stott Parker,et al.  Empirical comparisons of various voting methods in bagging , 2003, KDD '03.

[31]  Christian Biemann,et al.  Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering , 2006, ACL.

[32]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[33]  Ben Taskar,et al.  Wiki-ly Supervised Part-of-Speech Tagging , 2012, EMNLP.

[34]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[35]  Philipp Koehn,et al.  Predicting Success in Machine Translation , 2008, EMNLP.

[36]  Jingbo Zhu,et al.  Uncertainty-based active learning with instability estimation for text classification , 2012, TSLP.

[37]  Josef van Genabith,et al.  Learning Morphology with Morfette , 2008, LREC.

[38]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[39]  Ruhi Sarikaya,et al.  Shrinkage based features for slot tagging with conditional random fields , 2014, INTERSPEECH.

[40]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Computer Speech and Language.

[41]  Daniel Zeman,et al.  Coordination Structures in Dependency Treebanks , 2013, ACL.

[42]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[43]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[44]  Deniz Yuret,et al.  SemEval-2010 Task 12: Parser Evaluation Using Textual Entailments , 2010, *SEMEVAL.

[45]  Miles Osborne,et al.  A Two-Stage Method for Active Learning of Statistical Grammars , 2005, IJCAI.

[46]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[47]  Huan Liu,et al.  Modelling Classification Performance for Large Data Sets , 2001, WAIM.

[48]  F. Chervenak,et al.  Authors' reply re: BJOG Debate ‘Home birth is unsafe’ , 2016, BJOG : an international journal of obstetrics and gynaecology.

[49]  Mariona Taulé,et al.  AnCora: Multilevel Annotated Corpora for Catalan and Spanish , 2008, LREC.

[50]  Yannick Versley,et al.  SemEval-2010 Task 1: Coreference Resolution in Multiple Languages , 2009, *SEMEVAL.

[51]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[52]  Douglas H. Fisher,et al.  Modeling decision tree performance with the power law , 1999, AISTATS.

[53]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[54]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[55]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[56]  Mans Hulden,et al.  Boosting statistical tagger accuracy with simple rule-based grammars , 2012, LREC.

[57]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[58]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[59]  Anders Søgaard,et al.  Simple Semi-Supervised Training of Part-Of-Speech Taggers , 2010, ACL.

[60]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[61]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[62]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[63]  Marc Dymetman,et al.  Prediction of Learning Curves in Machine Translation , 2012, ACL.

[64]  Hinrich Schütze,et al.  Stopping Criteria for Active Learning of Named Entity Recognition , 2008, COLING.

[65]  Debanjan Ghosh,et al.  Using Cross-Lingual Projections to Generate Semantic Role Labeled Annotated Corpus for Urdu - A Resource Poor Language , 2010, COLING.

[66]  Eric K. Ringger,et al.  Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation , 2007, LAW@ACL.

[67]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[68]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[69]  Eric K. Ringger,et al.  Assessing the Costs of Sampling Methods in Active Learning for Annotation , 2008, ACL.

[70]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[71]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[72]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[73]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[74]  Yusuke Miyao,et al.  Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models? , 2011, CoNLL.

[75]  Hans van Halteren Performance of Taggers , 1999 .

[76]  Nello Cristianini,et al.  Learning Performance of a Machine Translation System: a Statistical and Computational Analysis , 2008, WMT@ACL.

[77]  Ye Tian,et al.  Maximizing classifier utility when there are data acquisition and modeling costs , 2008, Data Mining and Knowledge Discovery.

[78]  Kevin Knight,et al.  Minimized Models for Unsupervised Part-of-Speech Tagging , 2009, ACL.

[79]  Udo Hahn,et al.  An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data , 2007, EMNLP.

[80]  Victor S. Sheng,et al.  Partial example acquisition in cost-sensitive learning , 2007, KDD '07.

[81]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[82]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[83]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[84]  Russell Greiner,et al.  Learning and Classifying Under Hard Budgets , 2005, ECML.

[85]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[86]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[87]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[88]  Foster J. Provost,et al.  Inactive learning?: difficulties employing active learning in practice , 2011, SKDD.

[89]  H. Akaike A new look at the statistical model identification , 1974 .

[90]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[91]  Bo Thiesson,et al.  The Learning-Curve Sampling Method Applied to Model-Based Clustering , 2002, J. Mach. Learn. Res..

[92]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[93]  Udo Hahn,et al.  Approximating Learning Curves for Active-Learning-Driven Annotation , 2008, LREC.

[94]  Richard Johansson,et al.  The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages , 2009, CoNLL Shared Task.

[95]  Bhuvana Ramabhadran,et al.  Scaling shrinkage-based language models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[96]  Pavel Brazdil,et al.  An Iterative Process for Building Learning Curves and Predicting Relative Performance of Classifiers , 2007, EPIA Workshops.

[97]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[98]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[99]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..