MAVE: Multilevel wrApper Verification systEm

Wrappers are pieces of software used to extract data from websites and structure them for further application processing. Unfortunately, websites are continuously evolving and structural changes happen with no forewarning, which usually results in wrappers working incorrectly. Thus, wrappers maintenance is necessary for detecting whether wrapper is extracting erroneous data. The solution consists of using verification models to detect whether wrapper output is statistically similar to the output produced by the wrapper itself when it was successfully invoked in the past. Current proposals present some weaknesses, as the data used to build these models are supposed to be homogeneous, independent, or representative enough, or following a single predefined mathematical model. In this paper, we present MAVE, a novel multilevel wrapper verification system that is based on one-class classification techniques to overcome previous weaknesses. The experimental results show that our proposal outperforms accuracy of current solutions.

[1]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[2]  Charalampos E. Tsourakakis,et al.  WEWRA : An algorithm for Wrapper Verification , 2009 .

[3]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[4]  Rafael Corchuelo,et al.  TEX: An efficient and effective unsupervised Web information extractor , 2013, Knowl. Based Syst..

[5]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[6]  Peter Kulchyski and , 2015 .

[7]  CrescenziValter,et al.  Extraction and integration of partially overlapping web sources , 2013, VLDB 2013.

[8]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[9]  Pedro J. Abad,et al.  Toward One Class Classifier techniques applied to verifier information , 2011, 6th Iberian Conference on Information Systems and Technologies (CISTI 2011).

[10]  Kristina Lerman,et al.  Wrapper Maintenance , 2009, Encyclopedia of Database Systems.

[11]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[12]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[13]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[14]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[16]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[17]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, Proc. VLDB Endow..

[18]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[19]  Robert P. W. Duin,et al.  The interaction between classification and reject performance for distance-based reject-option classifiers , 2006, Pattern Recognit. Lett..

[20]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[21]  Emilio Ferrara,et al.  Automatic Wrapper Adaptation by Tree Edit Distance Matching , 2011, ArXiv.

[22]  Boris Chidlovskii,et al.  Documentum ECI self-repairing wrappers: performance analysis , 2006, SIGMOD Conference.

[23]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[24]  Nathalie Japkowicz,et al.  Concept learning in the absence of counterexamples: an autoassociation-based approach to classification , 1999 .

[25]  Padraig Cunningham,et al.  An evaluation of dimension reduction techniques for one-class classification , 2007, Artificial Intelligence Review.

[26]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[27]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[28]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[29]  José Luis Arjona,et al.  Applying One Class Classifier Techniques to Reduce Maintenance Costs of EAI , 2011, ICSOFT.

[30]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[31]  Carlos R. Rivero,et al.  Integrating Deep-Web Information Sources , 2010, PAAMS.

[32]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[33]  Shui-Lung Chuang,et al.  Collaborative Wrapping: A Turbo Framework for Web Data Extraction , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Francisco Herrera,et al.  Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification , 2011, Pattern Recognit..

[35]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[36]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, WWW '10.

[37]  Mary Shaw,et al.  Semantic anomaly detection in online data sources , 2002, ICSE '02.

[38]  AnHai Doan,et al.  Mapping Maintenance for Data Integration Systems , 2005, VLDB.

[39]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[40]  Emilio Ferrara,et al.  Design of Automatically Adaptable Web Wrappers , 2011, ICAART.