Efficient clustering of uncertain data streams

Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method.

[1]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[2]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[3]  Nikos Pelekis,et al.  Clustering uncertain trajectories , 2011, Knowledge and Information Systems.

[4]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[5]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[6]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[7]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[8]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[9]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[10]  Xuemin Lin,et al.  Efficient rank based KNN query processing over uncertain data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[11]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[14]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Jeffrey Xu Yu,et al.  Probabilistic Skyline Operator over Sliding Windows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[18]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[19]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[21]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[22]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[23]  Beng Chin Ooi,et al.  Effectively Indexing Uncertain Moving Objects for Predictive Queries , 2009, Proc. VLDB Endow..

[24]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[25]  David Wai-Lok Cheung,et al.  Clustering Uncertain Data Using Voronoi Diagrams , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[26]  Charu C. Aggarwal,et al.  On High Dimensional Projected Clustering of Uncertain Data Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[27]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[28]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[29]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[30]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[31]  Jeffrey Xu Yu,et al.  Probabilistic skyline operator over sliding windows , 2013, Inf. Syst..

[32]  Barbara Hammer,et al.  Patch clustering for massive data sets , 2009, Neurocomputing.

[33]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[34]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[35]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  Philip S. Yu,et al.  A Framework for Clustering Uncertain Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[37]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[38]  Jihoon Yang,et al.  Experimental Comparison of Feature Subset Selection Methods , 2007 .