Metric and trigonometric pruning for clustering of uncertain data in 2D geometric space

We study the problem of clustering data objects with location uncertainty. In our model, a data object is represented by an uncertainty region over which a probability density function (pdf) is defined. One method to cluster such uncertain objects is to apply the UK-means algorithm [1], an extension of the traditional K-means algorithm, which assigns each object to the cluster whose representative has the smallest expected distance from it. For arbitrary pdf, calculating the expected distance between an object and a cluster representative requires expensive integration of the pdf. We study two pruning methods: pre-computation (PC) and cluster shift (CS) that can significantly reduce the number of integrations computed. Both pruning methods rely on good bounding techniques. We propose and evaluate two such techniques that are based on metric properties (Met) and trigonometry (Tri). Our experimental results show that Tri offers a very high pruning power. In some cases, more than 99.9% of the expected distance calculations are pruned. This results in a very efficient clustering algorithm.

[1]  Wendi Heinzelman,et al.  Proceedings of the 33rd Hawaii International Conference on System Sciences- 2000 Energy-Efficient Communication Protocol for Wireless Microsensor Networks , 2022 .

[2]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, IEEE Transactions on Knowledge and Data Engineering.

[3]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Zhiwen Yu,et al.  Mining Uncertain Data in Low-dimensional Subspace , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  Edward J. Coyle,et al.  An energy efficient hierarchical clustering algorithm for wireless sensor networks , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[6]  Walid G. Aref,et al.  Casper*: Query processing for location services without compromising privacy , 2006, TODS.

[7]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[8]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[9]  Manabu Ichino,et al.  Generalized Minkowski metrics for mixed feature-type data analysis , 1994, IEEE Trans. Syst. Man Cybern..

[10]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[11]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[12]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Yu Zhang,et al.  Preserving User Location Privacy in Mobile Data Management Infrastructures , 2006, Privacy Enhancing Technologies.

[14]  Thanasis Hadzilacos,et al.  Advances in Spatial and Temporal Databases , 2015, Lecture Notes in Computer Science.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  A T de CarvalhoFrancisco de,et al.  Clustering of interval data based on city-block distances , 2004 .

[17]  Reynold Cheng,et al.  Efficient Evaluation of Imprecise Location-Dependent Queries , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Ouri Wolfson,et al.  Accuracy and Resource Concumption in Tracking and Location Prediction , 2003, SSTD.

[19]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007 .

[20]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[21]  Julius T. Tou,et al.  Information Systems , 1973, GI Jahrestagung.

[22]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Dieter Pfoser,et al.  Capturing the Uncertainty of Moving-Object Representations , 1999, SSD.

[24]  M. Tabakov,et al.  A Fuzzy Clustering Technique for Medical Image Segmentation , 2006, 2006 International Symposium on Evolving Fuzzy Systems.

[25]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[26]  Mirco Nanni,et al.  Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics , 2005, PAKDD.

[27]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[28]  Arbee L. P. Chen,et al.  Proceedings of the Sixth International Conference on Database Systems for Advanced Applications , 1999 .

[29]  Aristides Gionis,et al.  Dimension induced clustering , 2005, KDD '05.

[30]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[31]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[32]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[33]  A. Prasad Sistla,et al.  Updating and Querying Databases that Track Mobile Units , 1999, Distributed and Parallel Databases.

[34]  Christos Faloutsos,et al.  Proceedings of the 1999 ACM SIGMOD international conference on Management of data , 1999, SIGMOD 1999.

[35]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[36]  Per K. Enge,et al.  Global positioning system: signals, measurements, and performance [Book Review] , 2002, IEEE Aerospace and Electronic Systems Magazine.

[37]  Pourang Irani,et al.  2008 Eighth IEEE International Conference on Data Mining , 2008 .

[38]  Lakhmi C. Jain,et al.  Fuzzy clustering models and applications , 1997, Studies in Fuzziness and Soft Computing.

[39]  Max J. Egenhofer,et al.  Proceedings of the 4th International Symposium on Advances in Spatial Databases , 1995 .

[40]  Hans-Peter Kriegel,et al.  Boosting spatial pruning: on optimal pruning of MBRs , 2010, SIGMOD Conference.

[41]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[42]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[43]  Michael Spann,et al.  A new approach to clustering , 1990, Pattern Recognit..

[44]  Charu C. Aggarwal,et al.  On Density Based Transforms for Uncertain Data Mining , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[45]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[46]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[47]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[48]  Francisco de A. T. de Carvalho,et al.  Clustering of interval data based on city-block distances , 2004, Pattern Recognit. Lett..

[49]  David Wai-Lok Cheung,et al.  Clustering Uncertain Data Using Voronoi Diagrams , 2008, 2008 Eighth IEEE International Conference on Data Mining.