Ranking Outliers Using Symmetric Neighborhood Relationship

Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2,11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

[1]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[4]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers , 2002, Intell. Data Anal..

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  Dimitris Papadias,et al.  Aggregate nearest neighbor queries in road networks , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  Pasi Fränti,et al.  Outlier Detection Using k-Nearest Neighbour Graph , 2004, ICPR.

[9]  Christos Faloutsos,et al.  Cross-Outlier Detection , 2003, SSTD.

[10]  Ada Wai-Chee Fu,et al.  Enhancements on local outlier detection , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[11]  Pasi Fränti,et al.  Outlier detection using k-nearest neighbour graph , 2004, ICPR 2004.

[12]  Deepayan Chakrabarti,et al.  AutoPart: Parameter-Free Graph Partitioning and Outlier Detection , 2004, PKDD.

[13]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[14]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[15]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  Jian Tang,et al.  On Complementarity of Cluster and Outlier Detection Schemes , 2003, DaWaK.

[18]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[19]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[20]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[21]  Jeffrey Scott Vitter,et al.  Mining deviants in time series data streams , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[22]  Man Lung Yiu,et al.  Clustering objects on a spatial network , 2004, SIGMOD '04.

[23]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[24]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[25]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[26]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[27]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers: algorithms and applications (a summary of results) , 2001, KDD '01.

[28]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[29]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.