AdaUK-Means: An Ensemble Boosting Clustering Algorithm on Uncertain Objects

This paper considers the problem of clustering uncertain objects whose locations are uncertain and described by probability density functions (pdf). Though K-means has been extended to UK-means for handling uncertain data, most existing works only focus on improving the efficiency of UK-means. However, the clustering quality of UK-means is rarely considered in existing works. The weights of objects are assumed same in existing works. However, the weights of objects which are far from their cluster representatives should not be the same as the weights of objects which are close to their cluster representatives. Thus, we propose an AdaUK-means to group the uncertain objects by considering the weights of objects in this article. In AdaUK-means, the weights of objects will be adjusted based on the correlation between objects by using Adaboost. If the object pairs are must-link but grouped into different clusters, the weights of the objects will be increased. In our ensemble model, AdaUK-means is run several times, then the objects are assigned by a voting process. Finally, we demonstrate that AdaUK-means performs better than UK-means on both synthetic and real data sets by extensive experiments.

[1]  J. Pei,et al.  Outlier detection on uncertain data: Objects, instances, and inferences , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[2]  Sau Dan Lee,et al.  Decision Trees for Uncertain Data , 2011, IEEE Trans. Knowl. Data Eng..

[3]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, IEEE Transactions on Knowledge and Data Engineering.

[4]  Carson Kai-Sang Leung,et al.  Mining uncertain data for constrained frequent sets , 2009, IDEAS '09.

[5]  Shuai Wang,et al.  UDSFS: Unsupervised deep sparse feature selection , 2016, Neurocomputing.

[6]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[7]  Bin Wang,et al.  Distance-Based Outlier Detection on Uncertain Data , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[8]  Kannan Govindarajan,et al.  Preference queries in deductive databases , 2001, New Generation Computing.

[9]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[10]  David Wai-Lok Cheung,et al.  Clustering Uncertain Data Using Voronoi Diagrams , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Jinbo Bi,et al.  Support Vector Classification with Input Data Uncertainty , 2004, NIPS.

[13]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[14]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[16]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Bin Jiang,et al.  Mining preferences from superior and inferior examples , 2008, KDD.

[18]  Jan Chomicki,et al.  Querying with Intrinsic Preferences , 2002, EDBT.

[19]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[20]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[21]  Edward Hung,et al.  Accelerating Outlier Detection with Uncertain Data Using Graphics Processors , 2012, PAKDD.

[22]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[23]  Arthur Zimek,et al.  Representative clustering of uncertain data , 2014, KDD.

[24]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[25]  Jan Chomicki Database querying under changing preferences , 2007, Annals of Mathematics and Artificial Intelligence.

[26]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[27]  Dianhui Wang,et al.  Learning Pseudo Metric for Multimedia Data Classification and Retrieval , 2004, KES.

[28]  Ben Kao,et al.  A Decremental Approach for Mining Frequent Itemsets from Uncertain Data , 2008, PAKDD.

[29]  Sunil Prabhakar,et al.  Rule induction for uncertain data , 2011, Knowledge and Information Systems.

[30]  M. Lacroix,et al.  Preferences; Putting More Knowledge into Queries , 1987, VLDB.

[31]  Vladimir Kotlyar,et al.  Personalization of Supermarket Product Recommendations , 2004, Data Mining and Knowledge Discovery.

[32]  Reynold Cheng,et al.  Naive Bayes Classification of Uncertain Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[33]  Edward Hung,et al.  AN EFFICIENT REPRESENTATION MODEL OF DISTANCE DISTRIBUTION BETWEEN UNCERTAIN OBJECTS , 2012, Comput. Intell..

[34]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[35]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[36]  Werner Kießling,et al.  Preference SQL - Design, Implementation, Experiences , 2002, VLDB.

[37]  Sunil Prabhakar,et al.  A Rule-Based Classification Algorithm for Uncertain Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[38]  Dianhui Wang,et al.  Learning Based Neural Similarity Metrics for Multimedia Data Mining , 2006, Soft Comput..

[39]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[40]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.