Interactive wrapper generation with minimal user effort

While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.

[1]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[2]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[3]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[4]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[5]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[6]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[7]  Hector Garcia-Molina,et al.  Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[8]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[9]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[10]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[11]  Georg Gottlob,et al.  Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto , 2001, LPNMR.

[12]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[13]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[14]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[15]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[16]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[18]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[19]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[20]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[21]  Craig A. Knoblock,et al.  Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction , 2003, IJCAI.

[22]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[23]  Bertram Ludäscher,et al.  Managing Semistructured Data with FLORID: A Deductive Object-Oriented Perspective , 1998, Inf. Syst..

[24]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[25]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[26]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[27]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[28]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[29]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[30]  K. Minton Extraction Patterns for Information Extraction Tasks : A Survey , 1999 .

[31]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[32]  Stefan Kuhlins,et al.  Toolkits for Generating Wrappers : A Survey of Software Toolkits for Automated Datat Extraction from Websites , 2003 .

[33]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.