Outlier detection on uncertain data: Objects, instances, and inferences

This paper studies the problem of outlier detection on uncertain data. We start with a comprehensive model considering both uncertain objects and their instances. An uncertain object has some inherent attributes and consists of a set of instances which are modeled by a probability density distribution. We detect outliers at both the instance level and the object level. To detect outlier instances, it is a prerequisite to know normal instances. By assuming that uncertain objects with similar properties tend to have similar instances, we learn the normal instances for each uncertain object using the instances of objects with similar properties. Consequently, outlier instances can be detected by comparing against normal ones. Furthermore, we can detect outlier objects most of whose instances are outliers. Technically, we use a Bayesian inference algorithm to solve the problem, and develop an approximation algorithm and a filtering algorithm to speed up the computation. An extensive empirical study on both real data and synthetic data verifies the effectiveness and efficiency of our algorithms.

[1]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[2]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[3]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[4]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[5]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[7]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[8]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[9]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[10]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[11]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[12]  Serge Abiteboul,et al.  On the Representation and Querying of Sets of Possible Worlds , 1991, Theor. Comput. Sci..

[13]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[14]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[15]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[16]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[17]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[18]  David Wai-Lok Cheung,et al.  Clustering Uncertain Data Using Voronoi Diagrams , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[19]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[20]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[21]  Bin Wang,et al.  Distance-Based Outlier Detection on Uncertain Data , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[22]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[23]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[24]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[25]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[26]  Jian Pei,et al.  Query answering techniques on uncertain and probabilistic data: tutorial summary , 2008, SIGMOD Conference.

[27]  Jihoon Yang,et al.  Experimental Comparison of Feature Subset Selection Methods , 2007 .

[28]  Sanjay Ranka,et al.  Conditional Anomaly Detection , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[30]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[33]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[34]  Wolfgang Lehner,et al.  Clustering Uncertain Data with Possible Worlds , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[35]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[36]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[37]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.