Distributed genetic algorithm to big data clustering

Clustering algorithms have emerged as a powerful learning tool to accurately analyze the massive amount of data generated by current applications and smart technologies. Precisely, their main objective is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a wide and diverse body of knowledge in the area of clustering and there has been attempts apply these algorithms and scale it to adopt todays data. However, one major challenge in using clustering algorithms is scalability of such algorithms in a way that faces the challenges and computational cost of clustering big data. In this paper, we are describing a mapping between graph clustering problem and data clustering. Using genetic algorithms and multi-objective optimization as well as distributed graph stores, the proposed algorithm (1) transform big data into Distributed RDF graphs. With (2) a novel distributed encoding techniques. The algorithm (3) scales to deal with big RDF graphs to (4) produce clusters by maximizing graph modularity as a main objective. The results on LUBM generated big data shows the (5) ability to deal with the challenges provided such data and (6) produce comparative results compared to other peers of clustering algorithms

[1]  Michael Ovelgönne,et al.  Distributed community detection in web-scale networks , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[2]  Harleen Kaur,et al.  An Efficient Grouping Genetic Algorithm for Data Clustering and Big Data Analysis , 2015 .

[3]  Antonio J. Nebro,et al.  jMetal: A Java framework for multi-objective optimization , 2011, Adv. Eng. Softw..

[4]  S. Fortunato,et al.  Resolution limit in community detection , 2006, Proceedings of the National Academy of Sciences.

[5]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  T. Vicsek,et al.  Clique percolation in random networks. , 2005, Physical review letters.

[7]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[8]  Przemyslaw Kazienko,et al.  Parallel processing of large graphs , 2013, Future Gener. Comput. Syst..

[9]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[10]  Steve Gregory,et al.  Finding overlapping communities in networks by label propagation , 2009, ArXiv.

[11]  P. Hansen,et al.  Column generation algorithms for exact modularity maximization in networks. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Clara Pizzuti,et al.  A Multi-objective Genetic Algorithm for Community Detection in Networks , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[14]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[15]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[16]  Clara Pizzuti,et al.  GA-Net: A Genetic Algorithm for Community Detection in Social Networks , 2008, PPSN.

[17]  George K. Karagiannidis,et al.  Efficient Machine Learning for Big Data: A Review , 2015, Big Data Res..

[18]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[19]  Ulrik Brandes,et al.  On Modularity Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[20]  Christian Staudt,et al.  Engineering High-Performance Community Detection Heuristics for Massive Graphs , 2013, 2013 42nd International Conference on Parallel Processing.

[21]  E Omid Mahdi Ebadati,et al.  A Hybrid Clustering Technique to Improve Big Data Accessibility Based on Machine Learning Approaches , 2016 .

[22]  Richard E. Schantz,et al.  Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store , 2011, DIDC '11.

[23]  Konstantin Avrachenkov,et al.  Cooperative Game Theory Approaches for Network Partitioning , 2017, COCOON.

[24]  Nivranshu Hans,et al.  Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce , 2015 .

[25]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[27]  Dipankar Dasgupta,et al.  Political Communities in Russian Portion of Liveournal , 2014, 2014 International Conference on Computational Science and Computational Intelligence.

[28]  Mustafa H. Hajeer Distributed Evolutionary Algorithm for Clustering Multi-Characteristic Social Networks , 2014 .

[29]  Chase Qishi Wu,et al.  On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments , 2012, SpringSim.

[30]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[31]  Fiona Skerman,et al.  Modularity of networks , 2015 .

[32]  Orri Erling Towards Web Scale RDF , 2008 .

[33]  Haluk Bingol,et al.  Community Detection in Complex Networks Using Genetic Algorithms , 2006, 0711.0491.

[34]  Jianwu Li,et al.  Community detection in complex networks using extended compact genetic algorithm , 2013, Soft Comput..

[35]  Jianyong Wang,et al.  Parallel community detection on large networks with propinquity dynamics , 2009, KDD.

[36]  Dipankar Dasgupta,et al.  Distributed evolutionary approach to data clustering and modeling , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[37]  Qingfu Zhang,et al.  Community detection in networks by using multiobjective evolutionary algorithm with decomposition , 2012 .

[38]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  Ronghua Shang,et al.  Community detection based on modularity and an improved genetic algorithm , 2013 .

[40]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.