论文信息 - Interactive wrapper generation with minimal user effort

Interactive wrapper generation with minimal user effort

While much of the data on the web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data.We describe a complete system for semi-automatic wrapper generation that can be trained on different data sources in a simple interactive manner. Our goal is to minimize the amount of user effort for training reliable wrappers through design of a suitable training interface that is implemented based on a powerful underlying extraction language and a set of training and ranking algorithms. Our experiments show that our system achieves reliable extraction with a very small amount of user effort.

Torsten Suel | Utku Irmak | Utku Irmak | Torsten Suel

[1] Georg Gottlob,et al. Visual Web Information Extraction with Lixto , 2001, VLDB.

[2] Erich J. Neuhold,et al. Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[3] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[4] Valter Crescenzi,et al. Grammars Have Exceptions , 1998, Inf. Syst..

[5] Kyuseok Shim,et al. XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[6] Raymond J. Mooney,et al. Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[7] Hector Garcia-Molina,et al. Template-based wrappers in the TSIMMIS system , 1997, SIGMOD '97.

[8] Bing Liu,et al. Web data extraction based on partial tree alignment , 2005, WWW '05.

[9] Douglas H. Fisher,et al. Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[10] Line Eikvil,et al. Information Extraction from World Wide Web - A Survey , 1999 .

[11] Georg Gottlob,et al. Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto , 2001, LPNMR.

[12] Stephen Soderland,et al. Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[13] Arnaud Sahuguet,et al. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[14] Alberto O. Mendelzon,et al. WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.