Incorporating Knowledge in Natural Language Learning: A Case Study

Incorporating external information during a learning process is expected to improve its efficiency. We study a method for incorporating noun-class information, in the context of learning to resolve Prepositional Phrase Attachment (PPA) disambiguation. This is done within a recently introduced architecture, SNOW, a sparse network of threshold gates utilizing the Winnow learning algorithm. That architecture has already been demonstrated to perform remarkably well on a number of natural language learning tasks. The knowledge sources used were compiled from the WordNet database for general linguistic purposes, irrespective of the PPA problem, and are being incorporated into the learning algorithm by enriching its feature space. We study two strategies of using enriched features and the effects of using class information at different granularities, as well as randomly-generated knowledge which serves as a control set. Incorporating external knowledge sources within SNOW yields a statistically significant performance improvement. In addition, we find an interesting relation between the granularity of the knowledge sources used and the magnitude of the improvement. The encouraging results with noun-class data provide a motivation for carrying out more work on generating better linguistic knowledge sources. 1 I n t r o d u c t i o n A variety of inductive learning techniques have been used in recent years in natural language processing. Given a large training corpus as input and relying on statistical properties of language usage, statistics-based and machine learning algorithms are used to induce a classifier which can be used to resolve a disambiguation task. Applications of this line of research include ambiguity resolution at different levels of sentence analysis: part-of speech tagging, word-sense disambiguation, word selection in machine translation, context-sensitive spelling correction, word selection in speech recognition, and identification of discourse markers. Many natural language inferences, however, seem to rely heavily on semantic and pragmatic knowledge about the world and the language, that is not explicit in the training data. The ability to incorporate knowledge from other sources of information, be it knowledge that is acquired across modalities: prepared by a teacher or by an expert, is crucial for going beyond low level natural language inferences. Within Machine Learning, the use of knowledge is often limited to that of constraining the hypothesis space (either before learning or by probabilistically biasing the search for the hypothesis) or to techniques such as EBL (DeJong, 1981; Mitchell et al., 1986; DeJong and Mooney, 1986) which rely on explicit domain knowledge that can be used to explain (usually, prove deductively) the observed examples. The knowledge needed to perform languageunderstanding related tasks, however, does not exist in any explicit form that is amenable to techniques of this sort, and many believe that it will never be available in such explicit forms. An enormous amount of useful "knowledge" may be available, though. Pieces of information that may be found valuable in language-understanding related tasks may include: the root form of a verb; a list of nouns that are in some relation (e.g., are all countries) and can thus appear in similar contexts; a list of verbs that can be followed by a food item; a list of items you can see through, things that are furniture, a list of dangerous things, etc. This rich collection of information pieces does not form any domain theory to speak of and cannot be acquired from a single source of information. This knowledge is noisy, incomplete and ambiguous. While some of it may be acquired from text, a lot if it may only be acquired from other modalities, as those used by humans. We believe that integration of such knowledge is essential for NLP to attain high-level natural-language inference. Contrary to this intuition, experiments in text retrieval and natural language have not shown much improvement when incorporating information of the kind humans seem to use (Krovetz and Croft, 1992; Kosmynin and Davidson, 1996; Kar0v and Edelman,

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Michael Collins,et al.  Prepositional Phrase Attachment through a Backed-off Model , 1995, VLC@ACL.

[3]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  James Pustejovsky,et al.  Corelex: systematic polysemy and underspecification , 1998 .

[5]  Walter Daelemans,et al.  Resolving PP attachment Ambiguities with Memory-Based Learning , 1997, CoNLL.

[6]  Philip Resnik,et al.  WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery , 1992, AAAI 1992.

[7]  Shimon Edelman,et al.  Learning Similarity-based Word Sense Disambiguation from Sparse Data , 1996, VLC@COLING.

[8]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[9]  Ian Davidson,et al.  Using Background Contextual Knowledge for Document Representation , 1996, PODP.

[10]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[11]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[12]  Dan Roth,et al.  Learning to reason , 1994, JACM.

[13]  Philip Resnik,et al.  Disambiguating Noun Groupings with Respect to Wordnet Senses , 1995, VLC@ACL.

[14]  Leslie G. Valiant,et al.  Circuits of the mind , 1994 .

[15]  Avrim Blum,et al.  Learning boolean functions in an infinite attribute space , 1990, STOC '90.

[16]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[17]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  Gerald DeJong,et al.  Generalizations Based on Explanations , 1981, IJCAI.

[20]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[21]  Eric Brill,et al.  A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation , 1994, COLING.

[22]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[23]  Dan Roth,et al.  Part of Speech Tagging Using a Network of Linear Separators , 1998, ACL.