Mining Information Extraction Rules from Datasheets Without Linguistic Parsing

In the context of the Pangea project at IBM, we needed to design an information extraction module in order to extract some information from datasheets. Contrary to several information extraction systems based on some machine learning techniques that need some linguistic parsing of the documents, we propose an hybrid approach based on association rules mining and decision tree learning that does not require any linguistic processing. The system may be parameterized in various ways that influence the efficiency of the information extraction rules we discovered. The experiments show the system does not need a large training set to perform well.

[1]  Scott B. Huffman,et al.  Learning information extraction patterns from examples , 1995, Learning for Natural Language Processing.

[2]  Matthias Klusch,et al.  Intelligent Information Agents , 1999, Springer Berlin Heidelberg.

[3]  Hwee Tou Ng,et al.  Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods , 2003, ACL.

[4]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[5]  William W. Cohen A structured wrapper induction system for extracting information from semi-structured documents , 2001, IJCAI 2001.

[6]  Peter C. Lockemann,et al.  Advances in Database Technology — EDBT 2000 , 2000, Lecture Notes in Computer Science.

[7]  Gerhard Widmer,et al.  Machine Learning: ECML-97 , 1997, Lecture Notes in Computer Science.

[8]  J A Swets,et al.  Information Retrieval Systems. , 1963, Science.

[9]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[10]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[11]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[12]  Jonathan Ginzburg,et al.  Proceedings of COLING 2004 , 2004 .

[13]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[14]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[15]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[16]  Yutaka Sasaki,et al.  Learning Semantic-Level Information Extraction Rules by Type-Oriented ILP , 2000, COLING.

[17]  Fabio Ciravegna,et al.  (LP) 2 , an Adaptive Algorithm for Information Extraction from Web-related Texts , 2001 .

[18]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[19]  Stephen Glenn Soderland,et al.  Learning text analysis rules for domain-specific natural language processing , 1996 .

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  Rakesh Agrawal,et al.  Continuous querying in database-centric Web applications , 2000, Comput. Networks.

[22]  Hervé Déjean Learning Rules and Their Exceptions , 2002, J. Mach. Learn. Res..

[23]  Luc De Raedt,et al.  Machine Learning: ECML 2001 , 2001, Lecture Notes in Computer Science.

[24]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[25]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[26]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[27]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[28]  Jaeyoung Yang,et al.  Knowledge-Based Wrapper Induction for Intelligent Web Information Extraction , 2003 .

[29]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[30]  Enric Plaza,et al.  Machine Learning: ECML 2000 , 2003, Lecture Notes in Computer Science.

[31]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..

[32]  Boris Chidlovskii,et al.  Wrapping Web Information Providers by Transducer Induction , 2001, ECML.

[33]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[34]  Fabio Ciravegna,et al.  LearningPinocchio: adaptive information extraction for real world applications , 2004, Natural Language Engineering.

[35]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[36]  Ellen Riloff,et al.  Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing , 1996, Lecture Notes in Computer Science.

[37]  Maria Teresa Pazienza,et al.  Information Extraction A Multidisciplinary Approach to an Emerging Information Technology , 1997, Lecture Notes in Computer Science.