Forgetting Exceptions is Harmful in Language Learning

We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.

[1]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[2]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[3]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[4]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[5]  E. Rosch,et al.  Family resemblances: Studies in the internal structure of categories , 1975, Cognitive Psychology.

[6]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[7]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[8]  Jean Voisin,et al.  An application of the multiedit-condensing technique to the reference selection problem in a print recognition system , 1987, Pattern Recognit..

[9]  Wendy G. Lehnert,et al.  Case-based Problem Solving with a Large Knowledge Base of Learned Cases , 1987, AAAI.

[10]  Craig Stanfill Memory-based Reasoning Applied to English Pronunciation , 1987, AAAI.

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[13]  Shaul Markovitch,et al.  The Role of Forgetting in Learning , 1988, ML.

[14]  Raymond J. Mooney,et al.  An Experimental Comparison of Symbolic and Connectionist Learning Algorithms , 1989, IJCAI.

[15]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[16]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[17]  David H. Wolpert,et al.  Constructing a generalizer superior to NETtalk via a mathematical theory of generalization , 1990, Neural Networks.

[18]  Steven L. Salzberg,et al.  Learning with Nested Generalized Exemplars , 1990 .

[19]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[20]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[21]  David W. Aha,et al.  Generalizing from Case studies: A Case Study , 1992, ML.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Jianping Zhang,et al.  Selecting Typical Instances in Instance-Based Learning , 1992, ML.

[24]  Walter Daelemans,et al.  Generalization performance of backpropagation learning on a syllabification task , 1992 .

[25]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[26]  Marcos Salganicoff,et al.  Density-Adaptive Learning and Forgetting , 1993, ICML.

[27]  Claire Cardie,et al.  A Case-Based Approach to Knowledge Acquisition for Domain-Specific Sentence Analysis , 1993, AAAI.

[28]  Janet L. Kolodner,et al.  Case-Based Reasoning , 1989, IJCAI 1989.

[29]  Walter Daelemans,et al.  Memory-based lexical acquisition and processing , 1993, EAMT.

[30]  Foster J. Provost,et al.  Small Disjuncts in Action: Learning to Diagnose Errors in the Local Loop of the Telephone Network , 1993, ICML.

[31]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[32]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[33]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[34]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[35]  Claire Cardie,et al.  Domain-specific knowledge acquisition for conceptual sentence analysis , 1995 .

[36]  Michael Collins,et al.  Prepositional Phrase Attachment through a Backed-off Model , 1995, VLC@ACL.

[37]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[38]  Antal van den Bosch,et al.  The Proot of Learning Exceptions , 1995 .

[39]  Raymond J. Mooney,et al.  Induction of First-Order Decision Lists: Results on Learning the Past Tense of English Verbs , 1995, J. Artif. Intell. Res..

[40]  Walter Daelemans,et al.  Experience-driven language acquisition and processing , 1996 .

[41]  Walter Daelemans,et al.  Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion , 1996 .

[42]  Claire Cardie,et al.  Automating Feature Set Selection for Case-Based Learning of Linguistic Knowledge , 1996, EMNLP.

[43]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[44]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[45]  Walter Daelemans,et al.  Morphological Analysis as Classification: an Inductive-Learning Approach , 1996, ArXiv.

[46]  Daniel B. Jones,et al.  Analogical natural language processing , 1996 .

[47]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[48]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[49]  Ido Dagan,et al.  Similarity-based methods for word sense disambiguation , 1997 .

[50]  Walter Daelemans,et al.  Resolving PP attachment Ambiguities with Memory-Based Learning , 1997, CoNLL.

[51]  Hwee Tou Ng,et al.  Exemplar-Based Word Sense Disambiguation” Some Recent Improvements , 1997, EMNLP.

[52]  Walter Daelemans,et al.  Memory-Based Learning: Using Similarity for Smoothing , 1997, ACL.

[53]  Walter Daelemans,et al.  A feature-relevance heuristic for indexing and compressing large case bases , 1997 .

[54]  Adwait Ratnaparkhi,et al.  A Linear Observed Time Statistical Parser Based on Maximum Entropy Models , 1997, EMNLP.

[55]  David W. Aha,et al.  Special Issue on Lazy Learning , 1997 .

[56]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[57]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[58]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[59]  Walter Daelemans,et al.  IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[60]  Thomas G. Dietterich,et al.  A Comparison of ID3 and Backpropagation for English Text-To-Speech Mapping , 2004, Machine Learning.

[61]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[62]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[63]  Thomas G. Dietterich,et al.  An Experimental Comparison of the Nearest-Neighbor and Nearest-Hyperrectangle Algorithms , 1995, Machine Learning.

[64]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[65]  Pedro M. Domingos Unifying Instance-Based and Rule-Based Induction , 1996, Machine Learning.

[66]  Raymond J. Mooney,et al.  Symbolic and Neural Learning Algorithms: An Experimental Comparison , 1991, Machine Learning.

[67]  J. Ross Quinlan Improved Estimates for the Accuracy of Small Disjuncts , 2005, Machine Learning.

[68]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.