Documentum ECI self-repairing wrappers: performance analysis

Documentum Enterprise Content Integration (ECI) services is a content integration middleware that provides one-query access to the Intranet and Internet content resources. The ECI Adapter technology offers an interface to any application for data and metadata extraction from unstructured Web pages. It offers a unique frame-work of wrapper production, automatic recovery and maintenance, developed at Xerox Research Centre Europe and based on state-of-art algorithms from machine learning and grammatical inference. In this presentation we analyze the performance of ECI adapters deployed in current commercial installations. We benefit from accessing reports on daily tests for all ECI commercially deployed adapters collected from June 2003 to September 2005. Using the daily reports, we analyze different aspects of the wrapper technology.

[1]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[5]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[6]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[7]  Boris Chidlovskii Wrapping Web Information Providers by Transducer Induction , 2001, ECML.

[8]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[9]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[10]  Enrique Vidal,et al.  Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.

[12]  MiningChun-Nan Hsu Finite-state Transducers for Semi-structured Text Mining , 1999 .

[13]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Boris Chidlovskii,et al.  Automatic repairing of Web wrappers by combining redundant views , 2002, 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings..

[16]  Keith L. Clark,et al.  Using Grammatical Inference to Automate Information Extraction from the Web , 2001, PKDD.

[17]  Georg Gottlob,et al.  The Lixto data extraction project: back and forth between theory and practice , 2004, PODS.

[18]  Fabio Ciravegna,et al.  Evaluating machine learning for information extraction , 2005, ICML.

[19]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[20]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[21]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[22]  Stefan Kuhlins,et al.  Toolkits for Generating Wrappers : A Survey of Software Toolkits for Automated Datat Extraction from Websites , 2003 .

[23]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[24]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[25]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.