CRYSTAL: Learning Domain-specic Text Analysis Rules

An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a prede ned set of concepts in a speci c domain. Two widely di erent domains are used to illustrate this domain-speci c approach. One domain is a collection of Wall Street Journal articles in which the target concept is management succession events: identifying persons moving into and out of corporate management positions. A second domain is a collection of hospital discharge summaries in which the target concepts are various classes of diagnosis or symptom. The goal of an information extraction system is to identify references to the concept of interest for a particular domain. Each domain needs a set of text extraction rules based on the vocabulary, semantic classes, and writing style peculiar to the domain and the target concept. This paper presents CRYSTAL, an implemented system that automatically induces domain-speci c text analysis rules from training examples. CRYSTAL learns rules that approach the performance of hand-coded rules, are robust in the face of noise and inadequate features, and require only a modest training size. CRYSTAL belongs to the class of machine learning algorithms called covering algorithms, and presents a novel control strategy with time and space complexity independent of the feature size. CRYSTAL navigates e ciently through an extremely large space of possible rules. CRYSTAL also demonstrates that expressive rule representation is essential for high performance, robust text extraction. While simple rules are adequate to capture the most salient regularities in the training data, the subtlety and variability of unrestricted natural language require rich expressiveness in the rules for high performance.

[1]  David M. Magerman Statistical Decision-Tree Models for Parsing , 1995, ACL.

[2]  David Fisher,et al.  Description of the UMass system as used for MUC-6 , 1995, MUC.

[3]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[4]  George R. Krupka SRA: Description of the SRA System as Used for MUC-6 , 1995, MUC.

[5]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[6]  Scott B. Huffman,et al.  Learning information extraction patterns from examples , 1995, Learning for Natural Language Processing.

[7]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[8]  Ellen Riloff,et al.  Applying Statistical Methods to Small Corpora: Benefitting from a Limited Domain* , 1992 .

[9]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[10]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[11]  Michael G. Dyer,et al.  BORIS - An Experiment in In-Depth Understanding of Narratives , 1983, Artif. Intell..

[12]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[13]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[14]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[15]  Ryszard S. Michalski,et al.  A Theory and Methodology of Inductive Learning , 1983, Artificial Intelligence.

[16]  Wendy G. Lehnert,et al.  Wrap-Up: a Trainable Discourse Module for Information Extraction , 1994, J. Artif. Intell. Res..

[17]  Eugene Charniak,et al.  Parsing with Context-Free Grammars and Word Statistics , 1995 .

[18]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .