Identifying and characterising anomalies in data

The increased storage and processing capacity make the collection and storage of data much easier. There is, however, a whole process needed to acquire knowledge from the data or to exploit the data in an application. One of the cornerstones in this process is data mining. In this domain algorithms are designed to summarise the data in models, and to discover unexpected relationships, patterns, in the data. To regain grip on the ever growing amounts of data, these models and patterns need to be both useful and understandable to the data owner. In this dissertation we develop data mining techniques to build, starting from the available data with limited human effort, models with the aim of accurately identifying anomalies, observations deviating from the expected norm, in (new) data and characterise these understandable. Since in practice we are confronted with a variety of potential applications and different types of data, it is too far-fetched to develop a comprehensive approach where in each step of the process all requirements are simultaneously met, optimised and validated. Throughout this thesis we therefore focus on some aspects of this problem. Moreover, we assume that, for building the normal model from the data, we mainly have examples of the expected situation. After a brief general introduction Chapter 1, we will explore in Chapter 2 and 3 two specific real world problems. In chapter 2, we present an algorithm to detect anomalies during the monitoring of production processes in a chemical plant. In Chapter 3, we initiate the data-driven identification of vandalism in Wikipedia. Identification of anomalies alone is not enough however. We want a description that explains why an observation is regarded as unexpected. By explicitly using a limited number of patterns that describe the normal expectations and illuminating the differences with the current observation, we provide the necessary insight in Chapter 4. Collecting such compact descriptions directly and efficiently from the data is the subject Chapter 5.

[1]  S. Marsland Novelty Detection in Learning Systems , 2008 .

[2]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[3]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[4]  Christian Böhm,et al.  Modelling of classification rules on metabolic patterns including machine learning and expert knowledge , 2005, J. Biomed. Informatics.

[5]  Jilles Vreeken,et al.  Filling in the Blanks - Krimp Minimisation for Missing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6]  Ian H. Witten,et al.  One-Class Classification by Combining Density and Class Probability Estimation , 2008, ECML/PKDD.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Niall M. Adams,et al.  Off-the-peg and bespoke classifiers for fraud detection , 2008, Comput. Stat. Data Anal..

[9]  Sanjay Chawla,et al.  On local spatial outliers , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[10]  Brian Litt,et al.  One-Class Novelty Detection for Seizure Analysis from Intracranial EEG , 2006, J. Mach. Learn. Res..

[11]  Jilles Vreeken,et al.  The Odd One Out: Identifying and Characterising Anomalies , 2011, SDM.

[12]  Geoffrey I. Webb Self-sufficient itemsets: An approach to screening potentially interesting associations between items , 2010, TKDD.

[13]  J. Ma,et al.  Time-series novelty detection using one-class support vector machines , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[14]  Seppe K. L. M. vanden Broucke,et al.  Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data , 2011, J. Biomed. Informatics.

[15]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[16]  Bernhard Schölkopf,et al.  Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra , 2000, NIPS.

[17]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[19]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[20]  Aniket Kittur,et al.  He says, she says: conflict and coordination in Wikipedia , 2007, CHI.

[21]  Klaus-Robert Müller,et al.  Intrusion detection in unlabeled data with quarter-sphere Support Vector Machines , 2004 .

[22]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[23]  Jilles Vreeken,et al.  Compression Picks Item Sets That Matter , 2006, PKDD.

[24]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[25]  H. Kantz,et al.  Nonlinear time series analysis , 1997 .

[26]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Arthur Gretton,et al.  An online support vector machine for abnormal events detection , 2006, Signal Process..

[28]  Luca de Alfaro,et al.  A content-driven reputation system for the wikipedia , 2007, WWW '07.

[29]  Jilles Vreeken,et al.  Identifying the components , 2009, Data Mining and Knowledge Discovery.

[30]  Theo Vermeire,et al.  Risk assessment of chemicals : an introduction , 2007 .

[31]  Luciana S. Buriol,et al.  Temporal Analysis of the Wikigraph , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[32]  Chao Yuan,et al.  Support vector methods and use of hidden variables for power plant monitoring , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[33]  Brian Mingus,et al.  Exploring the Feasibility of Automatically Rating Online Article Quality , 2007 .

[34]  Jan Zima,et al.  The Atlas of European Mammals , 1999 .

[35]  Dries Knapen,et al.  Aquatic multi-species acute toxicity of (chlorinated) anilines: experimental versus predicted data. , 2010, Chemosphere.

[36]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[37]  Thomas Wetter,et al.  Feature construction can improve diagnostic criteria for high-dimensional metabolic data in newborn screening for medium-chain acyl-CoA dehydrogenase deficiency. , 2007, Clinical chemistry.

[38]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[39]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[40]  Brigitte Verdonk,et al.  Discovering novelty in spatio/temporal data using one-class support vector machines , 2009, 2009 International Joint Conference on Neural Networks.

[41]  Arno Siebes,et al.  A Structure Function for Transaction Data , 2011, SDM.

[42]  Bojan Cukic,et al.  Validating a neural network-based online adaptive system , 2005 .

[43]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[44]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[45]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[46]  Ignacio Santamaría,et al.  A spectral clustering algorithm for decoding fast time-varying BPSK mimo channels , 2007, 2007 15th European Signal Processing Conference.

[47]  Martin Potthast,et al.  Crowdsourcing a wikipedia vandalism corpus , 2010, SIGIR.

[48]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[49]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[50]  V. Svetnik,et al.  Novelty Detection in Mass Spectral Data using a Support Vector Machine Method , 2002 .

[51]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[52]  Hiroyuki Kitagawa,et al.  Outlier Detection for Transaction Databases Using Association Rules , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[53]  James P. Crutchfield,et al.  Geometry from a Time Series , 1980 .

[54]  Arne Koopman,et al.  Reducing the Frequent Pattern Set , 2006, ICDM Workshops.

[55]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[56]  Heikki Mannila,et al.  Low-Entropy Set Selection , 2009, SDM.

[57]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[58]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[59]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[60]  James T. Kwok,et al.  Kernel eigenvoice speaker adaptation , 2005, IEEE Transactions on Speech and Audio Processing.

[61]  S. Knuutila,et al.  DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[62]  Myeong-Kwan Kevin Cheon,et al.  Frank and I , 2012 .

[63]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[64]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[65]  Bart Goethals,et al.  Automatic Vandalism Detection in Wikipedia : Towards a Machine Learning Approach , 2008 .

[66]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[67]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[68]  G. Grimmett,et al.  Probability and random processes , 2002 .

[69]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[70]  S. Wold,et al.  Multi‐way principal components‐and PLS‐analysis , 1987 .

[71]  John Riedl,et al.  Creating, destroying, and restoring value in wikipedia , 2007, GROUP.

[72]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[73]  Rob Johnson,et al.  More Content - Less Control: Access Control in the Web 2.0 , 2006 .

[74]  S. Canu,et al.  Context changes detection by one-class SVMs ? , 2005 .

[75]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[76]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[77]  Ran El-Yaniv,et al.  Towards Behaviometric Security Systems: Learning to Identify a Typist , 2003, PKDD.

[78]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[79]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[80]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[81]  Srinivasan Parthasarathy,et al.  Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[82]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[83]  Jilles Vreeken,et al.  Characterising the difference , 2007, KDD '07.

[84]  Jilles Vreeken,et al.  Slim: Directly Mining Descriptive Patterns , 2012, SDM.

[85]  Theodora Kourti,et al.  Model Predictive Monitoring for Batch Processes , 2004 .

[86]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.