The Web is now a huge information repository with a rich semantic structure that, however, is primarily addressed to human understanding rather than automated processing by a computer. The problem of collecting product information from the Web and organizing it in an appropriate way for automated machine processing is a primary task of software shopping agents and has received a lot of attention during the last years. In this paper we assume that product information is represented as a set of feature-value pairs contained in an HTML product information sheet that is usually formatted using HTML tables. The paper presents a technique for learning extraction rules of product information from such product information sheets. The technique exploits the fact that the Web pages that represent product information of a certain producer are generated on the fly from the producer database and therefore they exhibit uniform structures. Consequently, while the extraction task is executed manually for a few information items by a human user, a general-purpose inductive learner (we have used FOIL in our experiments) can learn extraction rules that will be further applied to the current and other product information sheets to automatically extract other items. The input to the learning algorithm is a relational description of the HTML document tree that defines the HTML tree nodes types and the relationships between them. The approach is demonstrated with appropriate examples, experimental results, and software tools.
[1]
Nicholas Kushmerick,et al.
Wrapper induction: Efficiency and expressiveness
,
2000,
Artif. Intell..
[2]
Frank Neven,et al.
Automata theory for XML researchers
,
2002,
SGMD.
[3]
Maurice Bruynooghe,et al.
Information Extraction in Structured Documents Using Tree Automata Induction
,
2002,
PKDD.
[5]
Dieter Fensel,et al.
Product Data Integration in B2B E-Commerce
,
2001,
IEEE Intell. Syst..
[6]
Beth Sundheim,et al.
A Performance Evaluation of Text-Analysis Technologies
,
1991,
AI Mag..
[7]
Ion Muslea,et al.
Extraction Patterns for Information Extraction Tasks: A Survey
,
1999
.
[8]
Soumen Chakrabarti,et al.
Mining the web - discovering knowledge from hypertext data
,
2002
.
[9]
Jan Komorowski,et al.
Principles of Data Mining and Knowledge Discovery
,
2001,
Lecture Notes in Computer Science.
[10]
Boris Chidlovskii.
Information Extraction from Tree Documents by Learning Subtree Delimiters
,
2003,
IIWeb.
[11]
Dayne Freitag,et al.
Information Extraction from HTML: Application of a General Machine Learning Approach
,
1998,
AAAI/IAAI.