Outlier Detection on Mixed-Type Data: An Energy-Based Approach

Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. In this paper, we propose a new unsupervised outlier detection method for mixed-type data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that models data density. We propose to use free-energy derived from Mv.RBM as outlier score to detect outliers as those data points lying in low density regions. The method is fast to learn and compute, is scalable to massive datasets. At the same time, the outlier score is identical to data negative log-density up-to an additive constant. We evaluate the proposed method on synthetic and real-world datasets and demonstrate that (a) a proper handling mixed-types is necessary in outlier detection, and (b) free-energy of Mv.RBM is a powerful and efficient outlier scoring method, which is highly competitive against state-of-the-arts.

[1]  Svetha Venkatesh,et al.  Latent Patient Profile Modelling and Applications with Mixed-Variate Restricted Boltzmann Machine , 2013, PAKDD.

[2]  R. Serfling,et al.  General foundations for studying masking and swamping robustness of outlier identifiers , 2014 .

[3]  Srinivasan Parthasarathy,et al.  LOADED: link-based outlier and anomaly detection in evolving data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[5]  Ke Zhang,et al.  An Effective Pattern Based Outlier Detection Approach for Mixed Attribute Data , 2010, Australasian Conference on Artificial Intelligence.

[6]  Mohamed Bouguessa,et al.  A practical outlier detection approach for mixed-attribute data , 2015, Expert Syst. Appl..

[7]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[8]  Christopher Krügel,et al.  Anomaly detection of web-based attacks , 2003, CCS '03.

[9]  Chang-Tien Lu,et al.  Discovering Anomalies on Mixed-Type Data Using a Generalized Student- t Based Approach , 2016, IEEE Trans. Knowl. Data Eng..

[10]  Svetha Venkatesh,et al.  Mixed-Variate Restricted Boltzmann Machines , 2014, ACML.

[11]  Georgios C. Anagnostopoulos,et al.  Detecting Outliers in High-Dimensional Datasets with Mixed Attributes , 2008, DMIN.

[12]  Leonid Portnoy,et al.  Intrusion detection with unlabeled data using clustering , 2000 .

[13]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[14]  Wei Luo,et al.  An integrated framework for suicide risk prediction , 2013, KDD.

[15]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[16]  Tao Mei,et al.  A Scalable Approach for Content-Based Image Retrieval in Peer-to-Peer Networks , 2016, IEEE Transactions on Knowledge and Data Engineering.

[17]  Wojtek Kowalczyk,et al.  Finding Fraud in Health Insurance Data with Two-Layer Outlier Detection Approach , 2011, DaWaK.

[18]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[19]  J. B. Hampshire,et al.  Real-time object classification and novelty detection for collaborative video surveillance , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[20]  Alfredo De Santis,et al.  Network anomaly detection with the restricted Boltzmann machine , 2013, Neurocomputing.

[21]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[23]  Svetha Venkatesh,et al.  Thurstonian Boltzmann Machines: Learning from Multiple Inequalities , 2013, ICML.

[24]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[25]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[26]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[29]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[30]  Alexander R. De Leon,et al.  Analysis of Mixed Data : Methods & Applications , 2013 .

[31]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[32]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[33]  Svetha Venkatesh,et al.  Learning sparse latent representation and distance metric for image retrieval , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).