Automatic repairing of Web wrappers by combining redundant views

We address the problem of automatic maintenance of Web wrappers used in data integration systems to encapsulate an access to Web information providers. The maintenance of Web wrappers is critical as providers often changes the page format and/or structure making wrappers inoperable. The solution we propose extends the conventional wrapper architecture with a novel component of automatic maintenance and recovery. We consider the automatic recovery as special type of the classification problem and use ensemble methods of machine learning to build alternative views of provider pages. We combine extraction rules of conventional wrappers with content features of extracted information to accurate recovery from three types of format changes, namely, content, context and structural changes. We report results of the recovery performance for format changes at widely used Web providers.

[1]  MiningChun-Nan Hsu Finite-state Transducers for Semi-structured Text Mining , 1999 .

[2]  Keith L. Clark,et al.  Using Grammatical Inference to Automate Information Extraction from the Web , 2001, PKDD.

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Enrique Vidal,et al.  Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[7]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8]  Boris Chidlovskii Wrapping Web Information Providers by Transducer Induction , 2001, ECML.

[9]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[10]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[11]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[12]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[13]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[14]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.