Learning text analysis rules for domain-specific natural language processing

An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a specific {\em domain}, which is a corpus of texts together with a predefined set of {\em concepts} that are of interest to that domain. Two widely different domains are used to illustrate this domain-specific approach. One domain is a collection of Wall Street Journal articles in which the target concept is management succession events: identifying persons moving into corporate management positions or moving out. A second domain is a collection of hospital discharge summaries in which the target concepts are various classes of diagnosis or symptom. The goal of an information extraction system is to identify references to the concept of interest for a particular domain. A key knowledge source for this purpose is a set of text analysis rules based on the vocabulary, semantic classes, and writing style peculiar to the domain. This thesis presents CRYSTAL, an implemented system that automatically induces domain-specific text analysis rules from training examples. CRYSTAL learns rules that approach the performance of hand-coded rules, are robust in the face of noise and inadequate features, and require only a modest amount of training data. CRYSTAL belongs to a class of machine learning algorithms called covering algorithms, and presents a novel control strategy with time and space complexities that are independent of the number of features. CRYSTAL navigates efficiently through an extremely large space of possible rules. CRYSTAL also demonstrates that expressive rule representation is essential for high performance, robust text analysis rules. While simple rules are adequate to capture the most salient regularities in the training data, high performance can only be achieved when rules are expressive enough to reflect the subtlety and variability of unrestricted natural language.

[1]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[2]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[3]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[4]  S. Vera,et al.  Induction of Concepts in the Predicate Calculus , 1975, IJCAI.

[5]  Beth Sundheim,et al.  Overview of the Fourth Message Understanding Evaluation and Conference , 1992, MUC.

[6]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[7]  George R. Krupka SRA: Description of the SRA System as Used for MUC-6 , 1995, MUC.

[8]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[9]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[10]  Tom Michael Mitchell Version spaces: an approach to concept learning. , 1979 .

[11]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[12]  Steven A. Vere,et al.  Multilevel Counterfactuals for Generalizations of Relational Concepts and Productions , 1980, Artif. Intell..

[13]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[14]  Ellen Riloff,et al.  Applying Statistical Methods to Small Corpora: Benefitting from a Limited Domain* , 1992 .

[15]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[16]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[17]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[18]  Craig A. Will,et al.  Comparing Human and Machine Performance for Natural Language Information Extraction: Results from the Tipster Text Evaluation , 1993, TIPSTER.

[19]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[20]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[21]  Wendy G. Lehnert,et al.  Wrap-Up: a Trainable Discourse Module for Information Extraction , 1994, J. Artif. Intell. Res..

[22]  Eugene Charniak,et al.  Parsing with Context-Free Grammars and Word Statistics , 1995 .

[23]  Michael G. Dyer,et al.  BORIS - An Experiment in In-Depth Understanding of Narratives , 1983, Artif. Intell..

[24]  Scott B. Huffman,et al.  Learning information extraction patterns from examples , 1995, Learning for Natural Language Processing.

[25]  David Fisher,et al.  Description of the UMass system as used for MUC-6 , 1995, MUC.

[26]  Craig A. Will Comparing human and machine performance for natural language information extraction: results for English microelectronics from the MUC-5 evaluation , 1993, MUC.

[27]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[28]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .