GANY: A genetic spectral-based clustering algorithm for Large Data Analysis

Recently, Data analysis is one of the most growing fields. The big amounts of data are making their analysis a really challenging area. The most relevant techniques are mainly divided in two sub-domains: Classification and Clustering. Even though Classification is currently growing and evolving, one of the promising techniques to deal with the Large Data Analysis is Clustering, because Classification needs human supervision, which makes the analysis more expensive. Clustering is a blind process used to group data by similarity. Currently, the most relevant methods are those based on manifold identification. The main idea behind these techniques is to group data using the form they define in the space. In order to achieve this goal, there are several techniques based on Spectral Analysis which deal with this problem. However, these techniques are not suitable for Large Data, due to they require a lot of memory to determine the groups. Besides, there are some problems of local minima convergence in these techniques which are common in statistical methodologies. This work is focused on combining Genetic Algorithms with spectral-based methodologies to deal with the Large Data Analysis problem. Here, we will combine the Nyström method with the Spectrum to generate an approximation of the problem to an accurate summary of the search space. Also a genetic algorithm is used to reduce the local minimum convergence problem in the new search space. The performance of this methodology has been evaluated using the accuracy with both, synthetic and real-world datasets extracted from the literature.

[1]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[2]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[3]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[4]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[5]  David Camacho,et al.  Evolutionary clustering algorithm for community detection using graph-based information , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[6]  E. Gehan A GENERALIZED WILCOXON TEST FOR COMPARING ARBITRARILY SINGLY-CENSORED SAMPLES. , 1965, Biometrika.

[7]  Shintaro Okazaki,et al.  Combining social-based data mining techniques to extract collective trends from twitter , 2014 .

[8]  David F. Barrero,et al.  A Genetic Graph-Based Approach for Partitional Clustering , 2014, Int. J. Neural Syst..

[9]  Luca Scrucca,et al.  GA: A Package for Genetic Algorithms in R , 2013 .

[10]  Mohamad M. Awad,et al.  Multicomponent Image Segmentation Using a Genetic Algorithm and Artificial Neural Network , 2007, IEEE Geoscience and Remote Sensing Letters.

[11]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[12]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[13]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Divyakant Agrawal,et al.  Big data and cloud computing: current state and future opportunities , 2011, EDBT/ICDT '11.

[16]  David Camacho,et al.  A Multi-Objective Graph-based Genetic Algorithm for image segmentation , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[17]  T. Ebbels,et al.  NMR-based metabonomic toxicity classification: hierarchical cluster analysis and k-nearest-neighbour approaches , 2003 .

[18]  David F. Barrero,et al.  A Multi-Objective Genetic Graph-Based Clustering algorithm with memory optimization , 2013, 2013 IEEE Congress on Evolutionary Computation.

[19]  David Camacho,et al.  Extracting behavioural models from 2010 FIFA world cup , 2013, J. Syst. Sci. Complex..

[20]  Martin D. Buhmann,et al.  Radial Basis Functions , 2021, Encyclopedia of Mathematical Geosciences.

[21]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .