Learning Logic Wrappers for Information Extraction from the Web

This paper discusses a methodology for applying general-purpose first-order inductive learning to extract information from Web documents structured as unranked ordered trees. The methodology is applied to information extraction from real-world HTML page sets that represent product information sheets, an important task in product data integration. The methodology addresses the problems of defining information extraction rules in the form of logic wrappers and mapping the task of learning these rules to general purpose first-order inductive learning.

[1]  Markus Junker,et al.  Learning for Text Categorization and Information Extraction with ILP , 1999, Learning Language in Logic.

[2]  Costin Badica,et al.  Rule Learning for Feature Values Extraction from HTML Product Information Sheets , 2004, RuleML.

[3]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[4]  Bernd Thomas,et al.  Token-Templates and Logic Programs for Intelligent Web Search , 2000, Journal of Intelligent Information Systems.

[5]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[6]  Boris Chidlovskii Information Extraction from Tree Documents by Learning Subtree Delimiters , 2003, IIWeb.

[7]  K. Minton Extraction Patterns for Information Extraction Tasks : A Survey , 1999 .

[8]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[9]  Setsuo Arikawa,et al.  Progress in Discovery Science , 2002, Lecture Notes in Computer Science.

[10]  Hiroshi Sakamoto,et al.  Knowledge Discovery from Semistructured Texts , 2002, Progress in Discovery Science.

[11]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[12]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  Beth Sundheim,et al.  A Performance Evaluation of Text-Analysis Technologies , 1991, AI Mag..

[16]  Bernd Thomas Anti-Unification Based Learning of T-Wrappers for Information Extraction , 1999 .