Anomaly detection through quasi-functional dependency analysis

Anomaly detection problems have been investigated in several research areas such as database, machine learning, knowledge discovery, and logic programming, with the main goal of identifying objects of a given population whose behavior is anomalous with respect to a set of commonly accepted rules that are part of the knowledge base. In this paper we focus our attention on the analysis of anomaly detection in databases. We propose a method, based on data mining algorithms, which allows one to infer the "normal behavior" of objects, by extracting frequent "rules" from a given dataset. These rules are described in the form of quasi- functional dependencies and mined from the dataset by using association rules. Our approach allows us to consequently analyze anomalies with respect to the previously inferred dependencies: given a quasi-functional dependency, it is possible to discover the related anomalies by querying either the original database or the association rules previously stored. By further investigating the nature of such anomalies, we can either derive the presence of erroneous data or highlight novel information which represents significant exceptions of frequent rules. Our method is independent of the considered database and directly infers rules from the data. The applicability of the proposed approach is validated through a set of experiments on XML databases, whose results are here reported.

[1]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[2]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[5]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[6]  James J. Filliben,et al.  NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis , 2003 .

[7]  Markus Breitenbach,et al.  Clustering through ranking on manifolds , 2005, ICML '05.

[8]  Marcelo Arenas,et al.  A normal form for XML documents , 2004, TODS.

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[11]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[12]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[13]  Donald D. Chamberlin XQuery: An XML query language , 2002, IBM Syst. J..

[14]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[15]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[16]  Luigi Palopoli,et al.  Outlier detection by logic programming , 2004, TOCL.

[17]  Elena Baralis,et al.  Answering Queries on XML Data by means of Association Rules , 2004, SEBD.

[18]  Pier Luca Lanzi,et al.  Mining constraint violations , 2007, TODS.

[19]  Elena Baralis,et al.  Data Cleaning and Semantic Improvement in Biological Databases , 2006, J. Integr. Bioinform..

[20]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[21]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[22]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[23]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.