Adaptive Information Extraction from Text by Rule Induction and Generalisation

(LP)2 is a covering algorithm for adaptive Information Extraction from text (IE). It induces symbolic rules that insert SGML tags into texts by learning from examples found in a user-defined tagged corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Induction is performed by bottom-up generalization of examples in the training corpus. Shallow knowledge about Natural Language Processing (NLP) is used in the generalization process. The algorithm has a considerable success story. From a scientific point of view, experiments report excellent results with respect to the current state of the art on two publicly available corpora. From an application point of view, a successful industrial IE tool has been based on (LP)2. Real world applications have been developed and licenses have been released to external companies for building other applications. This paper presents (LP)2, experimental results and applications, and discusses the role of shallow NLP in rule induction.

[1]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[2]  Ralph Grishman,et al.  Automatic Acquisition of Domain Knowledge for Information Extraction , 2000, COLING.

[3]  Craig A. Knoblock,et al.  Wrapper Induction for Semistructured, Web-based Information Sources , 1998 .

[4]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[5]  Fabio Ciravegna,et al.  Learning to Tag for Information Extraction from Text , 2000 .

[6]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[7]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[8]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[9]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.

[10]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[11]  M. Cali,et al.  Relational learning techniques for natural language information extraction , 1998 .

[12]  Raymond J. Mooney,et al.  Relational learning techniques for natural language information extraction , 1998 .

[13]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[14]  Giorgio Satta,et al.  Bringing information extraction out of the labs: the Pinocchio environment , 2000 .