Data Source Selection Based on an Improved Greedy Genetic Algorithm

The development of information technology has led to a sharp increase in data volume. The tremendous amount of data has become a strategic capital that allows businesses to derive superior market intelligence or improve existing operations. People expect to consolidate and utilize data as much as possible. However, too much data will bring huge integration cost, such as the cost of purchasing and cleaning. Therefore, under the context of limited resources, obtaining more data integration value is our expectation. In addition, the uneven quality of data sources make the multi-source selection task more difficult, and low-quality data sources can seriously affect integration results without the desired quality gain. In this paper, we have studied how to balance data gain and cost in the source selection, specifically, maximizing the gain of data on the premise of a given budget. We proposed an improved greedy genetic algorithm (IGGA) to solve the problem of source selection, and carried out a wide range of experimental evaluations on the real and synthetic dataset. The empirical results show considerable performance in favor of the proposed algorithm in terms of solution quality.

[1]  Alon Y. Halevy,et al.  Data integration with dependent sources , 2011, EDBT/ICDT '11.

[2]  Lalit M. Patnaik,et al.  Adaptive probabilities of crossover and mutation in genetic algorithms , 1994, IEEE Trans. Syst. Man Cybern..

[3]  Rajesh Kumar,et al.  A heuristic approach for search engine selection in meta-search engine , 2015, International Conference on Computing, Communication & Automation.

[4]  Hakima Mellah,et al.  Enhancing information source selection using a genetic algorithm and social tagging , 2017, Int. J. Inf. Manag..

[5]  Norman W. Paton,et al.  User driven multi-criteria source selection , 2018, Inf. Sci..

[6]  Jianhua Wu,et al.  Solving 0-1 knapsack problem by a novel global harmony search algorithm , 2011, Appl. Soft Comput..

[7]  Andrea Calì,et al.  Integration of deep web sources: a distributed information retrieval approach , 2017, WIMS.

[8]  Xi-Zhao Wang,et al.  Group theory-based optimization algorithm for solving knapsack problems , 2018, Knowl. Based Syst..

[9]  Mohammed Azmi Al-Betar,et al.  Taming the 0/1 knapsack problem with monogamous pairs genetic algorithm , 2016, Expert Syst. Appl..

[10]  Subbarao Kambhampati,et al.  Factal: integrating deep web based on trust and relevance , 2011, WWW.

[11]  Marco Torchiano,et al.  Open data quality measurement framework: Definition and application to Open Government Data , 2016, Gov. Inf. Q..

[12]  S. Martello,et al.  Dynamic Programming and Strong Bounds for the 0-1 Knapsack Problem , 1999 .

[13]  Petr Pospichal,et al.  Parallel Genetic Algorithm Solving 0/1 Knapsack Problem Running on the GPU , 2011 .

[14]  Habiba Drias,et al.  A hybrid genetic algorithm for large scale information retrieval , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[15]  Jianzhong Li,et al.  Efficient quality-driven source selection from massive data sources , 2016, J. Syst. Softw..

[16]  Pedro Larrañaga,et al.  Genetic Algorithms for the Travelling Salesman Problem: A Review of Representations and Operators , 1999, Artificial Intelligence Review.

[17]  Sankaran Mahadevan,et al.  Solving 0-1 knapsack problems based on amoeboid organism algorithm , 2013, Appl. Math. Comput..

[18]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[19]  José Luis Martínez-Fernández,et al.  Combining heterogeneous sources in an interactive multimedia content retrieval model , 2017, Expert Syst. Appl..

[20]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[21]  Joaquín Pérez-Iglesias,et al.  Training a classifier for the selection of good query expansion terms with a genetic algorithm , 2010, IEEE Congress on Evolutionary Computation.

[22]  Kusum Deep,et al.  A Modified Binary Particle Swarm Optimization for Knapsack Problems , 2012, Appl. Math. Comput..

[23]  Ling He,et al.  Research of ant colony algorithm and the application of 0–1 knapsack , 2011, 2011 6th International Conference on Computer Science & Education (ICCSE).

[24]  J. Wenny Rahayu,et al.  Double-layered schema integration of heterogeneous XML sources , 2011, J. Syst. Softw..

[25]  José Torres-Jiménez,et al.  A grouping genetic algorithm with controlled gene transmission for the bin packing problem , 2015, Comput. Oper. Res..

[26]  Gang Liu,et al.  Genetic Algorithm with Directional Mutation Based on Greedy Strategy for Large-scale 0-1 Knapsack Problems , 2012 .

[27]  Hadi Otrok,et al.  Multi-worker multi-task selection framework in mobile crowd sourcing , 2019, J. Netw. Comput. Appl..

[28]  Qingzhong Li,et al.  Quality Estimation of Deep Web Data Sources for Data Fusion , 2012 .

[29]  Paolo Toth,et al.  New trends in exact algorithms for the 0-1 knapsack problem , 2000, Eur. J. Oper. Res..

[30]  Eman Fares Al Mashagba,et al.  Query Optimization Using Genetic Algorithms in the Vector Space Model , 2011, ArXiv.

[31]  Tinglei Huang,et al.  Genetic Algorithm Based on Greedy Strategy in the 0-1 Knapsack Problem , 2009, 2009 Third International Conference on Genetic and Evolutionary Computing.

[32]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[33]  Jian Xu,et al.  Integrating domain heterogeneous data sources using decomposition aggregation queries , 2014, Inf. Syst..

[34]  Ralf Salomon,et al.  Improving the Performance of Genetic Algorithms through Derandomization , 1997 .

[35]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .