Big Data Clustering using Data Streams Approach

In this paper we propose to process big data using a data streams approach. The data set is divided into subsets, each subsets is considered as a time window from a data stream. Our approach uses a neighborhood-based clustering. Instead of processing each new element one by one, we propose to process each group of new elements simultaneously. A clustering is applied on each new group using neighborhood graphs. The obtained clusters are then used to incrementally construct a representative graph of the data. The data graph is visualized in real time with specific visualizations that reflect the processing algorithm. To validate the approach, we apply it to different data streams and we compare it with known data stream clustering approaches.

[1]  Lydia Boudjeloud,et al.  An Efficient Clustering Method for Massive Dataset Based on DC Programming and DCA Approach , 2013, ICONIP.

[2]  Charu C. Aggarwal,et al.  Event Detection in Social Streams , 2012, SDM.

[3]  R. Sokal,et al.  A New Statistical Approach to Geographic Variation Analysis , 1969 .

[4]  Godfried T. Toussaint Some Unsolved Problems on Proximity Graphs , 1991 .

[5]  Mustapha Lebbah,et al.  Clustering Over Data Streams Based on Growing Neural Gas , 2015, PAKDD.

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Silvia Nittel,et al.  Scaling clustering algorithms for massive data sets using data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[9]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[10]  Andrew Chi-Chih Yao,et al.  On Constructing Minimum Spanning Trees in k-Dimensional Spaces and Related Problems , 1977, SIAM J. Comput..

[11]  Gilles Venturini,et al.  Incremental Construction of Neighborhood Graphs Using the Ants Self-Assembly Behavior , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[12]  Ira Assent,et al.  Self-Adaptive Anytime Stream Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[14]  Vipin Kumar,et al.  Chapman & Hall/CRC Data Mining and Knowledge Discovery Series , 2008 .

[15]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[16]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[17]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[18]  Mihai Lazarescu,et al.  Incremental clustering of dynamic data streams using connectivity based representative points , 2009, Data Knowl. Eng..

[19]  M. Anusha,et al.  Big Data-Survey , 2016 .

[20]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[21]  Godfried T. Toussaint,et al.  The relative neighbourhood graph of a finite planar set , 1980, Pattern Recognit..

[22]  Lien-Fu Lai,et al.  A Two-Step Method for Clustering Mixed Categroical and Numeric Data , 2010 .

[23]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[24]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[25]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[26]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[27]  Lydia Boudjeloud,et al.  Incremental nearest neighborhood graph for data stream clustering , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[28]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).