Clustering distributed data streams in peer-to-peer environments

This paper describes a technique for clustering homogeneously distributed data in a peer-to-peer environment like sensor networks. The proposed technique is based on the principles of the K-Means algorithm. It works in a localized asynchronous manner by communicating with the neighboring nodes. The paper offers extensive theoretical analysis of the algorithm that bounds the error in the distributed clustering process compared to the centralized approach that requires downloading all the observed data to a single site. Experimental results show that, in contrast to the case when all the data is transmitted to a central location for application of the conventional clustering algorithm, the communication cost (an important consideration in sensor networks which are typically equipped with limited battery power) of the proposed approach is significantly smaller. At the same time, the accuracy of the obtained centroids is high and the number of samples which are incorrectly labeled is also small.

[1]  Zoran Obradovic,et al.  Distributed clustering and local regression for knowledge discovery in multiple spatial databases , 2000, ESANN.

[2]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[5]  Robert Szewczyk,et al.  System architecture directions for networked sensors , 2000, ASPLOS IX.

[6]  Ossama Younis,et al.  HEED: a hybrid, energy-efficient, distributed clustering approach for ad hoc sensor networks , 2004, IEEE Transactions on Mobile Computing.

[7]  Bin Zhang,et al.  Distributed data clustering can be efficient and exact , 2000, SKDD.

[8]  Deborah Estrin,et al.  Directed diffusion: a scalable and robust communication paradigm for sensor networks , 2000, MobiCom '00.

[9]  Kun Liu,et al.  VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring , 2004, SDM.

[10]  Majid Sarrafzadeh,et al.  Optimal Energy Aware Clustering in Sensor Networks , 2002 .

[11]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[12]  Krishna M. Sivalingam,et al.  Learning from class-imbalanced data in wireless sensor networks , 2003, 2003 IEEE 58th Vehicular Technology Conference. VTC 2003-Fall (IEEE Cat. No.03CH37484).

[13]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[14]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[15]  A. Schuster,et al.  Association rule mining in peer-to-peer systems , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[16]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[17]  A. Winsor Sampling techniques. , 2000, Nursing times.

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Dimitrios Gunopulos,et al.  Distributed deviation detection in sensor networks , 2003, SGMD.

[20]  KeoghEamonn,et al.  Clustering of time-series subsequences is meaningless , 2005 .

[21]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[22]  Anantha Chandrakasan,et al.  Algorithmic transforms for efficient energy scalable computation , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[23]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[24]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[25]  Krishna M. Sivalingam,et al.  Data Gathering Algorithms in Sensor Networks Using Energy Metrics , 2002, IEEE Trans. Parallel Distributed Syst..

[26]  Ian F. Akyildiz,et al.  Sensor Networks , 2002, Encyclopedia of GIS.

[27]  Deborah Estrin,et al.  ASCENT: adaptive self-configuring sensor networks topologies , 2004, IEEE Transactions on Mobile Computing.

[28]  Hamid Gharavi,et al.  Special issue on sensor networks and applications , 2003 .

[29]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[30]  Michael Stonebraker,et al.  The Morgan Kaufmann Series in Data Management Systems , 1999 .

[31]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[32]  Wendi Heinzelman,et al.  Energy-efficient communication protocol for wireless microsensor networks , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[33]  Ian F. Akyildiz,et al.  Wireless sensor networks: a survey , 2002, Comput. Networks.

[34]  S. Sitharama Iyengar,et al.  Distributed Bayesian algorithms for fault-tolerant event region detection in wireless sensor networks , 2004, IEEE Transactions on Computers.

[35]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[36]  John Anderson,et al.  Wireless sensor networks for habitat monitoring , 2002, WSNA '02.

[37]  Gregory J. Pottie,et al.  Protocols for self-organization of a wireless sensor network , 2000, IEEE Wirel. Commun..

[38]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[39]  Gregory J. Pottie,et al.  Wireless integrated network sensors , 2000, Commun. ACM.

[40]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[41]  M WojtekKowalczyk,et al.  Towards Data Mining in Large and Fully Distributed Peer-to-Peer Overlay Networks , 2003 .

[42]  Lui Sha,et al.  Dynamic clustering for acoustic target tracking in wireless sensor networks , 2003, IEEE Transactions on Mobile Computing.

[43]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[44]  Deborah Estrin,et al.  Scalable Coordination in Sensor Networks , 1999, MobiCom 1999.

[45]  Leonidas J. Guibas,et al.  Collaborative signal and information processing: an information-directed approach , 2003 .

[46]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[47]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[48]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[49]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[50]  Wolfgang Müller,et al.  Classifying Documents by Distributed P2P Clustering , 2003, GI Jahrestagung.

[51]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[52]  Satish Kumar,et al.  Next century challenges: scalable coordination in sensor networks , 1999, MobiCom.

[53]  Dimitrios Gunopulos,et al.  Iterative Incremental Clustering of Time Series , 2004, EDBT.

[54]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[55]  Matthias Klusch,et al.  Distributed Clustering Based on Sampling Local Density Estimates , 2003, IJCAI.

[56]  Anantha P. Chandrakasan,et al.  An application-specific protocol architecture for wireless microsensor networks , 2002, IEEE Trans. Wirel. Commun..

[57]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.