Novel clustering-based approach for Local Outlier Detection

With the rapid expansion of data scale, big data mining and analysis have attracted increasing attention. Outlier detection as an important task of data mining is widely used in many applications. However, conventional outlier detection methods have difficulty handling large-scale datasets. In addition, most of them typically can only identify global outliers and are over sensitive to parameters variation. In this paper, we propose a novel method for robust local outlier detection with statistical parameters, which incorporates the clustering-based ideas in dealing with big data. Firstly, this method finds some density peaks of dataset by 3σ standard. Secondly, each remaining data object in the dataset is assigned to the same cluster as its nearest neighbor of higher density. Finally, we use Chebyshev's inequality and density peak reachability to identify local outliers of each group. The experimental results demonstrate the efficiency and accuracy of the proposed method in identifying both global and local outliers. Moreover, the method is also proved to be more stability analysis than typical outlier detection methods, such as LOF (Local Outlier Factor) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

[1]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Yong Shi,et al.  COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis , 2011, Knowledge and Information Systems.

[4]  T. Ferryman,et al.  Data outlier detection using the Chebyshev theorem , 2005, 2005 IEEE Aerospace Conference.

[5]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[6]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[7]  R. Tsay,et al.  Outliers in multivariate time series , 2000 .

[8]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[9]  Vikramaditya Jakkula Predictive Data Mining to Learn Health Vitals of a Resident in a Smart Home , 2007 .

[10]  Yufei Tao,et al.  Mining distance-based outliers from large databases in any metric space , 2006, KDD '06.

[11]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[12]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[13]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[14]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[15]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[16]  Sheng-yi Jiang,et al.  Clustering-Based Outlier Detection Method , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[17]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[18]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[19]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[20]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[21]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[23]  Dipankar Dasgupta,et al.  A comparison of negative and positive selection algorithms in novel pattern detection , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.

[24]  Shuxin Li,et al.  Mining Distance-Based Outliers from Categorical Data , 2007 .

[25]  Costas S. Tzafestas,et al.  Maximum Likelihood SLAM in Dynamic Environments , 2007 .

[26]  X. Shao,et al.  Simultaneous Wavelength Selection and Outlier Detection in Multivariate Regression of Near-Infrared Spectra , 2005, Analytical sciences : the international journal of the Japan Society for Analytical Chemistry.

[27]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[28]  Arnold P. Boedihardjo,et al.  GLS-SOD: a generalized local statistical approach for spatial outlier detection , 2010, KDD '10.

[29]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[30]  Pachunoori Naresh,et al.  Anomaly Detection via Online Over-Sampling Principal Component Analysis , 2014 .

[31]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[32]  Yuh-Jye Lee,et al.  Anomaly Detection via Online Oversampling Principal Component Analysis , 2013, IEEE Transactions on Knowledge and Data Engineering.