论文信息 - Learning text analysis rules for domain-specific natural language processing

Learning text analysis rules for domain-specific natural language processing

An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a specific {\em domain}, which is a corpus of texts together with a predefined set of {\em concepts} that are of interest to that domain. Two widely different domains are used to illustrate this domain-specific approach. One domain is a collection of Wall Street Journal articles in which the target concept is management succession events: identifying persons moving into corporate management positions or moving out. A second domain is a collection of hospital discharge summaries in which the target concepts are various classes of diagnosis or symptom. The goal of an information extraction system is to identify references to the concept of interest for a particular domain. A key knowledge source for this purpose is a set of text analysis rules based on the vocabulary, semantic classes, and writing style peculiar to the domain. This thesis presents CRYSTAL, an implemented system that automatically induces domain-specific text analysis rules from training examples. CRYSTAL learns rules that approach the performance of hand-coded rules, are robust in the face of noise and inadequate features, and require only a modest amount of training data. CRYSTAL belongs to a class of machine learning algorithms called covering algorithms, and presents a novel control strategy with time and space complexities that are independent of the number of features. CRYSTAL navigates efficiently through an extremely large space of possible rules. CRYSTAL also demonstrates that expressive rule representation is essential for high performance, robust text analysis rules. While simple rules are adequate to capture the most salient regularities in the training data, high performance can only be achieved when rules are expressive enough to reflect the subtlety and variability of unrestricted natural language.

Stephen Glenn Soderland | S. Soderland

[1] David M. Magerman. Statistical Decision-Tree Models for Parsing , 1995, ACL.

[2] Tom M. Mitchell,et al. Generalization as Search , 2002 .

[3] Ellen Riloff,et al. Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[4] S. Vera,et al. Induction of Concepts in the Predicate Calculus , 1975, IJCAI.

[5] Beth Sundheim,et al. Overview of the Fourth Message Understanding Evaluation and Conference , 1992, MUC.

[6] Ronald L. Rivest,et al. Learning decision lists , 2004, Machine Learning.

[7] George R. Krupka. SRA: Description of the SRA System as Used for MUC-6 , 1995, MUC.

[8] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[9] Eric Brill,et al. Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[10] Tom Michael Mitchell. Version spaces: an approach to concept learning. , 1979 .

[11] David Yarowsky,et al. Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.