RoadRunner: Towards Automatic Data Extraction from Large Web Sites

The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.

[1]  Arnaud Sahuguet,et al.  Web Ecology: Recycling HTML Pages as XML Documents Using W4F , 1999, WebDB.

[2]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[3]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[4]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[5]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[6]  David W. Embley,et al.  A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[7]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[8]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[9]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[10]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[11]  Paolo Atzeni,et al.  Cut and Paste , 1999, J. Comput. Syst. Sci..

[12]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[13]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[14]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[15]  Stéphane Grumbach,et al.  In Search of the Lost Schema , 1999, ICDT.

[16]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[17]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[18]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[19]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.