Discovering Anomalies on Mixed-Type Data Using a Generalized Student- t Based Approach

Anomaly detection in mixed-type data is an important problem that has not been well addressed in the machine learning field. Existing approaches focus on computational efficiency and their correlation modeling between mixed-type attributes is heuristically driven, lacking a statistical foundation. In this paper, we propose MIxed-Type Robust dEtection (MITRE), a robust error buffering approach for anomaly detection in mixed-type datasets. Because of its non-Gaussian design, the problem is analytically intractable. Two novel Bayesian inference approaches are utilized to solve the intractable inferences: Integrated-nested Laplace Approximation (INLA), and Expectation Propagation (EP) with Variational Expectation-Maximization (EM). A set of algorithmic optimizations is implemented to improve the computational efficiency. A comprehensive suite of experiments was conducted on both synthetic and real world data to test the effectiveness and efficiency of MITRE.

[1]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[2]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[3]  P. Sajda,et al.  Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model , 2001, Proceedings IEEE Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA 2001).

[4]  Srinivasan Parthasarathy,et al.  LOADED: link-based outlier and anomaly detection in evolving data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[5]  Ke Zhang,et al.  An Effective Pattern Based Outlier Detection Approach for Mixed Attribute Data , 2010, Australasian Conference on Artificial Intelligence.

[6]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[7]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[8]  T. Brotherton,et al.  Anomaly detection for advanced military aircraft using neural networks , 2001, 2001 IEEE Aerospace Conference Proceedings (Cat. No.01TH8542).

[9]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[10]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[11]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[12]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[13]  Brad Warner,et al.  Understanding Neural Networks as Statistical Tools , 1996 .

[14]  Chabane Djeraba,et al.  What are the grand challenges for data mining?: KDD-2006 panel report , 2006, SKDD.

[15]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[16]  Volker Roth,et al.  Outlier Detection with One-class Kernel Fisher Discriminants , 2004, NIPS.

[17]  Chuanhai Liu Robit Regression: A Simple Robust Alternative to Logistic and Probit Regression , 2005 .

[18]  Peter Filzmoser,et al.  Robust feature selection and robust PCA for internet traffic anomaly detection , 2012, 2012 Proceedings IEEE INFOCOM.

[19]  Svetha Venkatesh,et al.  Mixed-Variate Restricted Boltzmann Machines , 2014, ACML.

[20]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[21]  Ashok N. Srivastava,et al.  Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case study , 2010, KDD.

[22]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[23]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[24]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .

[25]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[26]  Andrea Cerioli,et al.  Multivariate Outlier Detection With High-Breakdown Estimators , 2010 .

[27]  Shengrui Wang,et al.  Information-Theoretic Outlier Detection for Large-Scale Categorical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[28]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[29]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[30]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[31]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[32]  Bernd Freisleben,et al.  CARDWATCH: a neural network based database mining system for credit card fraud detection , 1997, Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).

[33]  Georgios C. Anagnostopoulos,et al.  Detecting Outliers in High-Dimensional Datasets with Mixed Attributes , 2008, DMIN.

[34]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[35]  Huidong Jin,et al.  Detecting Network Anomalies in Mixed-Attribute Data Sets , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[36]  Maria E. Orlowska,et al.  Projected outlier detection in high-dimensional mixed-attributes data set , 2009, Expert Syst. Appl..

[37]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[38]  Vipin Kumar,et al.  Parallel and Distributed Computing for Cybersecurity , 2005, IEEE Distributed Syst. Online.

[39]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[40]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[41]  H. Rue,et al.  Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , 2009 .