Nearest Neighbor-Based Classification of Uncertain Data

This work deals with the problem of classifying uncertain data. With this aim we introduce the Uncertain Nearest Neighbor (UNN) rule, which represents the generalization of the deterministic nearest neighbor rule to the case in which uncertain objects are available. The UNN rule relies on the concept of nearest neighbor class, rather than on that of nearest neighbor object. The nearest neighbor class of a test object is the class that maximizes the probability of providing its nearest neighbor. The evidence is that the former concept is much more powerful than the latter in the presence of uncertainty, in that it correctly models the right semantics of the nearest neighbor decision rule when applied to the uncertain scenario. An effective and efficient algorithm to perform uncertain nearest neighbor classification of a generic (un)certain test object is designed, based on properties that greatly reduce the temporal cost associated with nearest neighbor class probability computation. Experimental results are presented, showing that the UNN rule is effective and efficient in classifying uncertain data.

[1]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[2]  Elke Achtert,et al.  Online hierarchical clustering in a data warehouse environment , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[3]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[4]  Dennis V. Lindley,et al.  Understanding Uncertainty: Lindley/Understanding Uncertainty , 2006 .

[5]  D. Richards,et al.  Understanding uncertainty , 2012, Evidence-Based Dentistry.

[6]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[7]  G. Lepage A new algorithm for adaptive multidimensional integration , 1978 .

[8]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  S. Łukaszyk A new concept of probability metric and its applications in approximation of scattered data sets , 2004 .

[10]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[11]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[12]  Larry D. Hostetler,et al.  k-nearest-neighbor Bayes-risk estimation , 1975, IEEE Trans. Inf. Theory.

[13]  Ali M. Rushdi,et al.  Efficient computation of the P.M.F. and the C.D.F. of the generalized binomial distribution , 1994 .

[14]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[15]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[16]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[17]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[18]  Charu C. Aggarwal,et al.  On Density Based Transforms for Uncertain Data Mining , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[20]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[21]  Pascal von Rickenbach,et al.  Wireless Ad Hoc And sensor netWorks , 2007 .

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Mehryar Mohri,et al.  Learning from Uncertain Data , 2003, COLT.

[24]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[25]  Luc Devroye,et al.  On the Inequality of Cover and Hart in Nearest Neighbor Discrimination , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[27]  David A. Maltz,et al.  A performance comparison of multi-hop wireless ad hoc network routing protocols , 1998, MobiCom '98.

[28]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[29]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[30]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[31]  Jinbo Bi,et al.  Support Vector Classification with Input Data Uncertainty , 2004, NIPS.

[32]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[33]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Fabrizio Angiulli,et al.  Indexing Uncertain Data in General Metric Spaces , 2012, IEEE Transactions on Knowledge and Data Engineering.

[35]  AngiulliFabrizio,et al.  Nearest Neighbor-Based Classification of Uncertain Data , 2013 .

[36]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[37]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[38]  Yasir Zaki,et al.  Mobile Communication Systems , 2013 .

[39]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[40]  Sau Dan Lee,et al.  Decision Trees for Uncertain Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[41]  Yufei Tao,et al.  Range search on multidimensional uncertain data , 2007, TODS.

[42]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[43]  Hannes Hartenstein,et al.  Stochastic Properties of the Random Waypoint Mobility Model , 2004, Wirel. Networks.

[44]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.