Outlier Mining Methods Based on Graph Structure Analysis

Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap nonlinear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.

[1]  N. Hoffmann,et al.  Rogue wave observation in a water wave tank. , 2011, Physical review letters.

[2]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[3]  Alessandro Vespignani,et al.  Dynamical Processes on Complex Networks , 2008 .

[4]  Arindam Banerjee,et al.  Anomaly detection using manifold embedding and its applications in transportation corridors , 2009, Intell. Data Anal..

[5]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[6]  H. Amoud,et al.  Early-warning of ARDS using novelty detection and data fusion , 2018, Comput. Biol. Medicine.

[7]  Reuven Cohen,et al.  Complex Networks: Structure, Robustness and Function , 2010 .

[8]  Gianluca Bontempi,et al.  Learned lessons in credit card fraud detection from a practitioner perspective , 2014, Expert Syst. Appl..

[9]  Shigeng Zhang,et al.  Outlier Detection Techniques for Localization in Wireless Sensor Networks: A Survey , 2015 .

[10]  Zhenya Yan Financial Rogue Waves , 2009, 0911.4259.

[11]  Mahmood Fathy,et al.  Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes , 2016, Comput. Vis. Image Underst..

[12]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[13]  Charu C. Aggarwal,et al.  Outlier Detection with Autoencoder Ensembles , 2017, SDM.

[14]  Gianluca Bontempi,et al.  Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization , 2018, International Journal of Data Science and Analytics.

[15]  David A. Clifton,et al.  A review of novelty detection , 2014, Signal Process..

[16]  M. Parlange,et al.  Statistics of extremes in hydrology , 2002 .

[17]  Arthur Zimek,et al.  There and back again: Outlier detection between statistical reasoning and data mining algorithms , 2018, WIREs Data Mining Knowl. Discov..

[18]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[19]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[20]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[21]  Chang-Tien Lu,et al.  Detecting spatial outliers with multiple attributes , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[22]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[23]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[24]  Junbin Gao,et al.  Image Outlier Detection and Feature Extraction via L1-Norm-Based 2D Probabilistic PCA , 2015, IEEE Transactions on Image Processing.

[25]  S. Redner,et al.  Introduction To Percolation Theory , 2018 .

[26]  Shigeng Zhang,et al.  Mobile-Assisted Anchor Outlier Detection for Localization in Wireless Sensor Networks , 2016 .

[27]  M. Shats,et al.  Capillary rogue waves. , 2010, Physical review letters.

[28]  Cesare Alippi,et al.  Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning Strategy , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[29]  M. Newman,et al.  Fast Monte Carlo algorithm for site or bond percolation. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Yunhao Liu,et al.  Detecting Outlier Measurements Based on Graph Rigidity for Wireless Sensor Network Localization , 2013, IEEE Transactions on Vehicular Technology.

[31]  Sanjay Chawla,et al.  On local spatial outliers , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[32]  Reda Alhajj,et al.  Graph-based approach for outlier detection in sequential data and its application on stock market and weather data , 2014, Knowl. Based Syst..

[33]  D S Callaway,et al.  Network robustness and fragility: percolation on random graphs. , 2000, Physical review letters.

[34]  Chang-Tien Lu,et al.  Spatial Weighted Outlier Detection , 2006, SDM.

[35]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[36]  Lawrence B. Holder,et al.  Anomaly detection in data represented as graphs , 2007, Intell. Data Anal..

[37]  Arindam Banerjee,et al.  Anomaly Detection in Transportation Corridors using Manifold Embedding , 2007 .

[38]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[39]  Bernd Freisleben,et al.  CARDWATCH: a neural network based database mining system for credit card fraud detection , 1997, Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).

[40]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[41]  Cristina Masoller,et al.  Roadmap on optical rogue waves and extreme events , 2016 .

[42]  Shikha Agrawal,et al.  Survey on Anomaly Detection using Data Mining Techniques , 2015, KES.

[43]  Michael Gertz,et al.  Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection - A Remedy Against the Curse of Dimensionality? , 2017, SISAP.

[44]  Gianluca Bontempi,et al.  Adaptive Machine Learning for Credit Card Fraud Detection , 2015 .

[45]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[46]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[47]  Yuan Yuan,et al.  Outlier-resisting graph embedding , 2010, Neurocomputing.

[48]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[49]  Georg Langs,et al.  f‐AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks , 2019, Medical Image Anal..

[50]  P.K. Varshney,et al.  Fault detection in dynamic systems via decision fusion , 2008, IEEE Transactions on Aerospace and Electronic Systems.

[51]  Andy Harter,et al.  Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[52]  Wei Jiang,et al.  On-line outlier detection and data cleaning , 2004, Comput. Chem. Eng..

[53]  Lawrence B. Holder,et al.  Graph-Based Data Mining , 2000, IEEE Intell. Syst..

[54]  Jixiang Sun,et al.  Improved ISOMAP algorithm for anomaly detection in hyperspectral images , 2012, International Conference on Machine Vision.

[55]  Jianbo Shi,et al.  Graph Embedding to Improve Supervised Classification and Novel Class Detection: Application to Prostate Cancer , 2005, MICCAI.

[56]  Umberto Bortolozzo,et al.  Rogue waves and their generating mechanisms in different physical contexts , 2013 .

[57]  P. Sajda,et al.  Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model , 2001, Proceedings IEEE Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA 2001).

[58]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[59]  Cristina Masoller,et al.  Unsupervised feature extraction of anterior chamber OCT images for ordering and classification , 2019, Scientific Reports.

[60]  Gianluca Bontempi,et al.  SCARFF: A scalable framework for streaming credit card fraud detection with spark , 2017, Inf. Fusion.

[61]  B. Jalali,et al.  Optical rogue waves , 2007, Nature.

[62]  Fabrizio Angiulli,et al.  DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets , 2009, TKDD.