Applying ant colony hybrid metaheuristics to wrapper verification

First application of Ant Colony Metaheuristic to verify information extracted by web wrappers.New multilevel verification system that improves the results achieved by current techniques.Enumeration of current techniques weakness.Reformulation of wrapper verification problem as a combinational optimization problem.Applying non-parametric testing techniques to ascertain the statistical significance among results. Wrappers are pieces of software used to extract data from websites and structure them for further application processing. Unfortunately, websites are continuously evolving and structural changes happen with no forewarning, which usually results in wrappers working incorrectly. Thus, wrappers maintenance is necessary for detecting whether wrapper is extracting erroneous data. The solution consists of using verification models to detect whether wrapper output is statistically similar to the output produced by the wrapper itself when it was successfully invoked in the past. Current proposals present some weaknesses, as the data used to build these models are supposed to be homogeneous or that the features of this data set can be mapped to an n-dimensional space of independent dimensions when there is a correlation among their features. In this paper, a new verification system based on the Best-Worst Ant System (BWAS) is presented to overcome previous weaknesses. The experimental results show an accuracy improvement of 7.5% over current solutions.

[1]  T. Stützle,et al.  A Review on the Ant Colony Optimization Metaheuristic: Basis, Models and New Trends , 2002 .

[2]  Carlos R. Rivero,et al.  Integrating Deep-Web Information Sources , 2010, PAAMS.

[3]  Mary Shaw,et al.  Semantic anomaly detection in online data sources , 2002, ICSE '02.

[4]  Shui-Lung Chuang,et al.  Collaborative Wrapping: A Turbo Framework for Web Data Extraction , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Jorge Casillas,et al.  Learning cooperative linguistic fuzzy rules using the best–worst ant system algorithm , 2005 .

[6]  Chia-Hui Chang,et al.  Page-Level Wrapper Verification for Unsupervised Web Data Extraction , 2013, WISE.

[7]  T. Warren Liao,et al.  A comparative study of different local search application strategies in hybrid metaheuristics , 2013, Appl. Soft Comput..

[8]  Oscar Castillo,et al.  A new approach for dynamic fuzzy logic parameter tuning in Ant Colony Optimization and its application in fuzzy control of a mobile robot , 2015, Appl. Soft Comput..

[9]  Pei-Chann Chang,et al.  Two hybrid differential evolution algorithms for optimal inbound and outbound truck sequencing in cross docking operations , 2012, Appl. Soft Comput..

[10]  Francisco Herrera,et al.  Analysis of the Best-Worst Ant System and Its Variants on the QAP , 2002, Ant Algorithms.

[11]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[12]  José Luis Arjona,et al.  Applying One Class Classifier Techniques to Reduce Maintenance Costs of EAI , 2011, ICSOFT.

[13]  Padraig Cunningham,et al.  An evaluation of dimension reduction techniques for one-class classification , 2007, Artificial Intelligence Review.

[14]  AnHai Doan,et al.  Mapping Maintenance for Data Integration Systems , 2005, VLDB.

[15]  Christian Blum,et al.  Hybrid metaheuristics in combinatorial optimization: A survey , 2011, Appl. Soft Comput..

[16]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[17]  J. L. Hodges,et al.  Rank Methods for Combination of Independent Experiments in Analysis of Variance , 1962 .

[18]  Thomas Stützle,et al.  F-Race and Iterated F-Race: An Overview , 2010, Experimental Methods for the Analysis of Optimization Algorithms.

[19]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[20]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[21]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Marco Dorigo,et al.  The ant colony optimization meta-heuristic , 1999 .

[23]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[24]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[25]  B. Bullnheimer,et al.  A NEW RANK BASED VERSION OF THE ANT SYSTEM: A COMPUTATIONAL STUDY , 1997 .

[26]  Michael Sampels,et al.  A MAX-MIN Ant System for the University Course Timetabling Problem , 2002, Ant Algorithms.

[27]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[28]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[29]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[30]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[31]  Marco Dorigo,et al.  Ant system: optimization by a colony of cooperating agents , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[32]  M. Dorigo,et al.  The Ant Colony Optimization MetaHeuristic 1 , 1999 .

[33]  Emilio Ferrara,et al.  Design of Automatically Adaptable Web Wrappers , 2011, ICAART.

[34]  Luca Maria Gambardella,et al.  An Ant Colony System Hybridized with a New Local Search for the Sequential Ordering Problem , 2000, INFORMS J. Comput..

[35]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[36]  Alex Alves Freitas,et al.  Data mining with an ant colony optimization algorithm , 2002, IEEE Trans. Evol. Comput..

[37]  Holger H. Hoos,et al.  An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem , 2005, BMC Bioinformatics.

[38]  Thomas Stützle,et al.  PLANTS: Application of Ant Colony Optimization to Structure-Based Drug Design , 2006, ANTS Workshop.

[39]  Valter Crescenzi,et al.  Extraction and Integration of Partially Overlapping Web Sources , 2013, Proc. VLDB Endow..

[40]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[41]  Jorge Casillas,et al.  Learning cooperative linguistic fuzzy rules using the best–worst ant system algorithm: Research Articles , 2005 .

[42]  Thomas Stützle,et al.  Automatic configuration of state-of-the-art multi-objective optimizers using the TP+PLS framework , 2011, GECCO '11.

[43]  Jose Miguel Puerta,et al.  Ant colony optimization for learning Bayesian networks , 2002, Int. J. Approx. Reason..

[44]  Emilio Ferrara,et al.  Automatic Wrapper Adaptation by Tree Edit Distance Matching , 2011, ArXiv.

[45]  Boris Chidlovskii,et al.  Documentum ECI self-repairing wrappers: performance analysis , 2006, SIGMOD Conference.

[46]  Tim Furche,et al.  DIADEM: Thousands of Websites to a Single Database , 2014, Proc. VLDB Endow..

[47]  Rafael Corchuelo,et al.  TEX: An efficient and effective unsupervised Web information extractor , 2013, Knowl. Based Syst..

[48]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[49]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[50]  T. Warren Liao,et al.  Metaheuristic approaches to grouping problems in high-throughput cryopreservation operations for fish sperm , 2012, Appl. Soft Comput..

[51]  Oscar Castillo,et al.  A Fuzzy Control Design for an Autonomous Mobile Robot Using Ant Colony Optimization , 2014, Recent Advances on Hybrid Approaches for Designing Intelligent Systems.

[52]  Daniel Merkle,et al.  Ant Colony Optimization with Global Pheromone Evaluation for Scheduling a Single Machine , 2004, Applied Intelligence.

[53]  Pablo Valledor,et al.  An ACO Algorithm to Solve an Extended Cutting Stock Problem for Scrap Minimization in a Bar Mill , 2014, ANTS Conference.

[54]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[55]  Oscar Castillo,et al.  New approach using ant colony optimization with ant set partition for fuzzy control design applied to the ball and beam system , 2015, Inf. Sci..

[56]  Andrea Tagarelli,et al.  Schema-Based Web Wrapping , 2004, ER.

[57]  Yida Wang,et al.  Incorporating site-level knowledge to extract structured data from web forums , 2009, WWW '09.

[58]  Christian Blum,et al.  New metaheuristic approaches for the edge-weighted k-cardinality tree problem , 2005, Comput. Oper. Res..

[59]  Thomas Stützle,et al.  MAX-MIN Ant System , 2000, Future Gener. Comput. Syst..

[60]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[61]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[62]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, WWW '10.

[63]  Thomas Stützle,et al.  The Ant Colony Optimization Metaheuristic: Algorithms, Applications, and Advances , 2003 .