Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.

[1]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[2]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[3]  Anne H. H Ngu,et al.  Web Information Systems Engineering - WISE 2005, 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20-22, 2005, Proceedings , 2005, WISE.

[4]  Surithong Srisa‐ard,et al.  Mining the Web: Discovering Knowledge from Hypertext Data , 2003 .

[5]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[6]  Alberto Pan,et al.  Automatically maintaining wrappers for Web sources , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[7]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[8]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[9]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[10]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[11]  Bing Liu,et al.  Extracting Web Data Using Instance-Based Learning , 2005, World Wide Web.

[12]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Ángel Viña,et al.  Semi-Automatic Wrapper Generation for Commercial Web Sources , 2002, Engineering Information Systems in the Internet Context.

[16]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[17]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Alberto Pan,et al.  Automatically maintaining wrappers for semi-structured web sources , 2007, Data Knowl. Eng..