Tuples Extraction from HTML Using Logic Wrappers and Inductive Logic Programming

This paper presents an approach for applying inductive logic programming to information extraction from HTML documents structured as unranked ordered trees. We consider information extraction from Web resources that are abstracted as providing sets of tuples. Our approach is based on defining a new class of wrappers as a special class of logic programs – logic wrappers. The approach is demonstrated with examples and experimental results in the area of collecting product information, highlighting the advantages and the limitations of the method.

[1]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[2]  Hiroshi Sakamoto,et al.  Knowledge Discovery from Semistructured Texts , 2002, Progress in Discovery Science.

[3]  Markus Junker,et al.  Learning for Text Categorization and Information Extraction with ILP , 1999, Learning Language in Logic.

[4]  Costin Badica,et al.  Learning Logic Wrappers for Information Extraction from the Web , 2005 .

[5]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[6]  Setsuo Arikawa,et al.  Progress in Discovery Science , 2002, Lecture Notes in Computer Science.

[7]  Bernd Thomas,et al.  Token-Templates and Logic Programs for Intelligent Web Search , 2000, Journal of Intelligent Information Systems.

[8]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[9]  Costin Badica,et al.  Rule Learning for Feature Values Extraction from HTML Product Information Sheets , 2004, RuleML.

[10]  Sachio Hirokawa,et al.  Expressive Power of Tree and String Based Wrappers , 2003, IIWeb.

[11]  Matthias Klusch,et al.  Intelligent Information Agents , 1999, Springer Berlin Heidelberg.

[12]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[13]  Boris Chidlovskii Information Extraction from Tree Documents by Learning Subtree Delimiters , 2003, IIWeb.

[14]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..