Incremental density-based ensemble clustering over evolving data streams

Abstract The recent advances in smart meter technology have enabled for collecting information about customer power consumption in real time. The measurements are generated continuously and in some cases, e.g. in the industrial smart metering the data exchange rates are highly-fluctuating. The storage, querying, and mining of such smart meter streaming data with a large number of missing and sparse values are highly computationally challenging tasks. To address such matters, we propose a new method called incremental density-based ensemble clustering (IDEStream) for incremental segmentation of various kinds of factories based on their electricity consumption data. It exploits a gamma mixture model to suppress the influence of sparse data units in the data streams that sequentially arrive within a time window and then generates a clustering from the processed data of that window. IDEStream uses a unique incremental ensemble approach to incrementally aggregate the clusterings of subsequent time windows. Experimental results on data streams collected by smart meters from manufacturing factories in Guangdong province of China have shown that the proposed algorithm outperforms several state-of-the-art data stream clustering algorithms. The obtained segmentation can find numerous applications, an exemplar one being to define customer rates in a flexible way.

[1]  Meng Joo Er,et al.  Large Tanker Motion Model Identification Using Generalized Ellipsoidal Basis Function-Based Fuzzy Neural Networks , 2015, IEEE Transactions on Cybernetics.

[2]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[3]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[4]  Joydeep Ghosh,et al.  Cluster ensembles , 2011, Data Clustering: Algorithms and Applications.

[5]  Mohamed Medhat Gaber,et al.  Density-Based Projected Clustering of Data Streams , 2012, SUM.

[6]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[7]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[8]  Meng Joo Er,et al.  Constructive multi-output extreme learning machine with application to large tanker motion dynamics identification , 2014, Neurocomputing.

[9]  G. Kitagawa,et al.  Akaike Information Criterion Statistics , 1988 .

[10]  Ira Assent,et al.  Self-Adaptive Anytime Stream Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[13]  P. Postolache,et al.  Customer Characterization Options for Improving the Tariff Offer , 2002, IEEE Power Engineering Review.

[14]  J. Lawless Statistical Models and Methods for Lifetime Data , 2002 .

[15]  Nguyen Thanh Tung,et al.  Ensemble Clustering of High Dimensional Data with FastMap Projection , 2014, PAKDD Workshops.

[16]  Jiye Liang,et al.  Trend analysis of categorical data streams with a concept change method , 2014, Inf. Sci..

[17]  Meng Joo Er,et al.  Generalized Single-Hidden Layer Feedforward Networks for Regression Problems , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Meng Joo Er,et al.  Parsimonious Extreme Learning Machine Using Recursive Orthogonal Least Squares , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Manoranjan Dash,et al.  Dimensionality reduction of unsupervised data , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[22]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[23]  Christine M. Anderson-Cook A Primer on Statistical Distributions , 2004 .

[24]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[25]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[26]  GamaJoão,et al.  Data stream clustering , 2013 .

[27]  R.T.H. Alden,et al.  Managing customer and distribution utility costs , 1999 .

[28]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[29]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[30]  P. Stephenson,et al.  Consumer advantages from half-hourly metering and load profiles in the UK competitive electricity market , 2000, DRPT2000. International Conference on Electric Utility Deregulation and Restructuring and Power Technologies. Proceedings (Cat. No.00EX382).