Patterns that matter

Pattern mining is one of the best-known concepts in Data Mining. A big problem in pattern mining is that humongous amounts of patterns can be mined even from small datasets. This makes it hard for domain experts to discover knowledge using pattern mining, for example in the field of Bioinformatics. In this thesis we address the pattern explosion using compression. We argue that the best pattern set is that set of patterns that compresses the data best. Based on an analysis from MDL (Minimum Description Length) perspective, we introduce a heuristic algorithm, called Krimp, which finds the best set of patterns. High compression ratios and good classification scores confirm that Krimp selects patterns that are very characteristic for the data. After this, we proceed with a series of well-known problems in Knowledge Discovery, which we each unravel with our compression approach. We propose a database dissimilarity measure and show how compression can be used to characterise differences between databases. We present an algorithm that generates synthetic data that is virtually indiscernible from the original data, but can also be used to preserve privacy. Changes in data streams are detected by using a Krimp compressor to check whether the data distribution has been changed or not. Finally, compression is used to identify the components of a database and to find interesting groups in a database. In each chapter, we provide an extensive experimental evaluation to show that the proposed methods perform well on a large variety of datasets. In the end, we conclude that having less, but more characteristic patterns is key to successful Knowledge Discovery and that compression is very useful in this respect. Not as goal in itself, but as means to an end: compression picks the patterns that matter.

[1]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[2]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[3]  Peter Grünwald,et al.  Invited review of the book Statistical and Inductive Inference by Minimum Message Length , 2006 .

[4]  Wietske de Vries,et al.  Agent interaction: abstract approaches to modelling, programming and verifying multi-agent systems , 2002 .

[5]  Yuhong Yang Elements of Information Theory (2nd ed.). Thomas M. Cover and Joy A. Thomas , 2008 .

[6]  Ruggero G. Pensa,et al.  A Bi-clustering Framework for Categorical Data , 2005, PKDD.

[7]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[8]  Joyca Lacroix,et al.  NIM : a situated computational memory model , 2003 .

[9]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[10]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[11]  Chong K. Liew,et al.  A data distortion by probability distribution , 1985, TODS.

[12]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.

[13]  Roelof van Zwol Modelling and searching web-based document collections , 2002 .

[14]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[16]  T. D. Bui,et al.  Creating Emotions and Facial Expressions for Embodied Agents , 2004 .

[17]  P.A.T. van Eck,et al.  A Compositional Semantic Structure for Multi-Agent Systems Dynamics , 2001 .

[18]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[19]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[20]  Andreas Hotho,et al.  Conceptual Clustering of Social Bookmarking Sites , 2007, LWA.

[21]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[22]  S. F. Nagata,et al.  User Assistance for Multitasking with Interruptions on a Mobile Device , 2006 .

[23]  S. Knuutila,et al.  DNA copy number amplification profiling of human neoplasms , 2006, Oncogene.

[24]  BischofHorst,et al.  MDL Principle for Robust Vector Quantisation , 1999 .

[25]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[26]  Riina Hannuli Vuorikari,et al.  Tags and self-organisation: a metadata ecology for learning resources in a multilingual context , 2009 .

[27]  V. Bessa Machado Supporting the Construction of Qualitative Knowledge models , 2004 .

[28]  Kotagiri Ramamohanarao,et al.  Information-Based Classification by Aggregating Emerging Patterns , 2000, IDEAL.

[29]  Neerincx,et al.  Human-computer interaction and presence in virtual reality exposure therapy , 2003 .

[30]  Philip S. Yu,et al.  GraphScope: parameter-free mining of large time-evolving graphs , 2007, KDD '07.

[31]  Z. S. Baida,et al.  Software-aided Service Bundling : Intelligent Methods and Tools for Graphical Service Modeling , 2006 .

[32]  S. Bocconi,et al.  Vox Populi : generating video documentaries from semantically annotated media repositories , 2006 .

[33]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[34]  Nigel Shadbolt,et al.  Understanding the Semantics of Ambiguous Tags in Folksonomies , 2007, ESOE.

[35]  Ander de Keijzer,et al.  Management of Uncertain Data - towards unattended integration , 2008 .

[36]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[37]  Keke Chen,et al.  Detecting the Change of Clustering Structure in Categorical Data Streams , 2006, SDM.

[38]  W.C.A. Wijngaards,et al.  Agent-Based Modelling of Dynamics: Biological and Organisational Applications , 2002 .

[39]  Peter Boncz,et al.  UvA-DARE ( Digital Academic Repository ) Monet ; a next-Generation DBMS Kernel For Query-Intensive Applications , 2007 .

[40]  J.S.J.H. Penders,et al.  The practical art of moving physical objects , 1999 .

[41]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[42]  Francesco Bonchi,et al.  Compressing tags to find interesting media groups , 2009, CIKM.

[43]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[44]  Wilhelmus Lambertus Adrianus Derks Improving Concurrency and Recovery in Database Systems by Exploiting Application Semantics , 2005 .

[45]  Heikki Mannila,et al.  Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction , 2001, KDD '01.

[46]  Martin Wigbertus Antonius Caminada For the sake of the Argument : explorations into argument-based reasoning , 1997 .

[47]  M. Żukowski,et al.  Balancing vectorized query execution with bandwidth-optimized storage , 2009 .

[48]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[49]  Lai Xu Monitoring multi-party contracts for E-business , 2004 .

[50]  José Carlos Príncipe,et al.  Information Theoretic Clustering , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[52]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[53]  M. Sloof,et al.  Physiology of Quality Change Modelling. Automated modelling of quality change of agricultural products , 1999 .

[54]  K. Vanhoof,et al.  Profiling of High-Frequency Accident Locations by Use of Association Rules , 2003 .

[55]  Ronald Poppe,et al.  Discriminative vision-based recovery and recognition of human motion , 2009 .

[56]  Bernhard Pfahringer,et al.  Compression-Based Feature Subset Selection , 2007 .

[57]  Andreas Hotho,et al.  Information Retrieval in Folksonomies: Search and Ranking , 2006, ESWC.

[58]  C.M.T. Metselaar,et al.  Sociaal-organisatorische gevolgen van kennistechnologie : een procesbenadering en actorperspectief , 2000 .

[59]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[60]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[61]  Qi Wang,et al.  Random-data perturbation techniques and privacy-preserving data mining , 2005, Knowledge and Information Systems.

[62]  F. Wetenschappen,et al.  Embodied agents from a user's perspective , 2008 .

[63]  Vojkan Mihajlovic,et al.  Score region algebra : a flexible framework for structured information retrieval , 2006 .

[64]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[65]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[66]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[67]  Marijn Huijbregts,et al.  Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[68]  Kun Liu,et al.  An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining , 2006, PKDD.

[69]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[70]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[71]  Jianyong Wang,et al.  On efficiently summarizing categorical databases , 2005, Knowledge and Information Systems.

[72]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[73]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[74]  M. Kendall,et al.  Classical inference and the linear model , 1999 .

[75]  Jinyan Li,et al.  Mining border descriptions of emerging patterns from dataset pairs , 2005, Knowledge and Information Systems.

[76]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[77]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[78]  Judea Pearl,et al.  Reasoning Under Uncertainty , 1990 .

[79]  Niels Nes,et al.  Image database management systems design considerations algorithms and architecture , 2000 .

[80]  Slinger Jansen Customer Configuration Updating in a Software Supply Network. , 2007 .

[81]  Yang Xiang,et al.  Succinct summarization of transactional databases: an overlapped hyperrectangle scheme , 2008, KDD.

[82]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[83]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[84]  Virginia N. L. Franqueira,et al.  Finding multi-step attacks in computer networks using heuristic search and mobile ambients , 2009 .

[85]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[86]  Henk-Jan Lebbink Dialogue and Decision Games for Information Exchanging Agents , 2006 .

[87]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[88]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[89]  H. Warner,et al.  A mathematical approach to medical diagnosis. Application to congenital heart disease. , 1961, JAMA.

[90]  Ion Juvina Development of cognitive model for navigating on the web , 2006 .

[91]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[92]  Jean-François Boulicaut,et al.  Simplest Rules Characterizing Classes Generated by δ-Free Sets , 2003 .

[93]  Bela Mutschler,et al.  Modeling and simulating causal dependencies on process-aware information systems from a cost perspective , 2008 .

[94]  Roelof van Zwol,et al.  Flickr tag recommendation based on collective knowledge , 2008, WWW.

[95]  Jan Zima,et al.  The Atlas of European Mammals , 1999 .

[96]  V. Hollink,et al.  Optimizing hierarchical menus : a usage-based approach , 2008 .

[97]  Nirvana Meratnia,et al.  Towards database support for moving object data , 2005 .

[98]  Fernando Luiz Koch,et al.  An Agent-Based Model for the Development of Intelligent Mobile Services , 2009 .

[99]  Christos Faloutsos,et al.  Adaptive, unsupervised stream mining , 2004, The VLDB Journal.

[100]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[101]  Bart Willem Schermer,et al.  Software Agents, Surveillance and the right to privacy , 2007 .

[102]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[103]  L. J. Kortmann The resolution of visually guided behaviour , 2003 .

[104]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[105]  H Hongjing Wu,et al.  A reference architecture for adaptive hypermedia applications , 2002 .

[106]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[107]  Charu C. Aggarwal,et al.  On Abnormality Detection in Spuriously Populated Data Streams , 2005, SDM.

[108]  Annerieke Heuvelink Cognitive Models for Training Simulations , 2009 .

[109]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[110]  Vera Kartseva,et al.  Designing Controls for Network Organisations: A Value-Based Approach , 2004 .

[111]  Christian Stahl,et al.  Service substitution: a behavioral approach based on Petri nets , 2009 .

[112]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[113]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[114]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[115]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[116]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[117]  Ans A. G. Steuten A contribution to the linguistic analysis of business conversations within the language/action perspective , 1998 .

[118]  Heikki Mannila,et al.  Low-Entropy Set Selection , 2009, SDM.

[119]  H. Mannila,et al.  Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters , 2007 .

[120]  Arne Koopman Characteristic relational patterns , 2009, KDD.

[121]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[122]  G. Rota The Number of Partitions of a Set , 1964 .

[123]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[124]  S. Muthukrishnan,et al.  Sequential Change Detection on Data Streams , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[125]  Jilles Vreeken,et al.  Characterising the difference , 2007, KDD '07.

[126]  Jaap Gordijn,et al.  Value-based requirements engineering: exploring innovative e-commerce ideas , 2003, Requirements Engineering.

[127]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[128]  Henk Ernst Blok Database Optimization Aspects for Information Retrieval , 2002 .

[129]  M.A.J. van Gerven,et al.  Bayesian networks for clinical decision support: A rational approach to dynamic decision-making under uncertainty , 2007 .

[130]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[131]  Grigory Begelman,et al.  Automated Tag Clustering: Improving search and exploration in the tag space , 2006 .

[132]  Yang Song,et al.  Real-time automatic tag recommendation , 2008, SIGIR '08.

[133]  O. Sharpanskykh,et al.  On Computer-Aided Methods for Modeling and Analysis of Organizations , 2008 .

[134]  Karianne Vermaas,et al.  Fast diffusion and broadening use: A research on residential adoption and usage of broadband internet in the Netherlands between 2001 and 2005 , 2007 .

[135]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[136]  Christian Böhm,et al.  Robust information-theoretic clustering , 2006, KDD '06.

[137]  Wouter Immánuël Koelewijn Privacy en politiegegevens. Over geautomatiseerde normatieve informatie-uitwisseling , 2009 .

[138]  Christos Faloutsos,et al.  On data mining, compression, and Kolmogorov complexity , 2007, Data Mining and Knowledge Discovery.

[139]  Jan Wielemaker,et al.  Logic programming for knowledge-intensive interactive applications , 2009 .

[140]  Philip S. Yu,et al.  Finding Localized Associations in Market Basket Data , 2002, IEEE Trans. Knowl. Data Eng..

[141]  van Joeri Ruth Flattening queries over nested data types , 2006 .

[142]  Arne Koopman,et al.  Reducing the Frequent Pattern Set , 2006, ICDM Workshops.

[143]  P. I. Hofgesang,et al.  Modelling Web Usage in a Changing Environment , 2009 .

[144]  R Richard Vdovják,et al.  A model-driven approach for building distributed ontology-based web applications , 2005 .

[145]  Peter Van Rosmalen,et al.  Supporting the tutor in the design and support of adaptive e-learning , 2008 .

[146]  Jan Broersen Modal Action Logics for Reasoning about Reactive Systems , 2003 .

[147]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[148]  József István Farkas A semiotically oriented cognitive model of knowledge representation , 2008 .

[149]  Thijs Westerveld,et al.  Using generative probabilistic models for multimedia retrieval , 2005, SIGF.

[150]  Arne Koopman,et al.  Discovering Relational Item Sets Efficiently , 2008 .

[151]  Jacob Lenting Informed gambling : conception and analysis of a multi-agent mechanism for discrete reallocation , 1999 .

[152]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[153]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[154]  Jilles Vreeken,et al.  Compression Picks Item Sets That Matter , 2006, PKDD.

[155]  Maarten Sierhuis,et al.  Modeling and simulating work practice : BRAHMS: a multiagent modeling and simulation language for work system analysis and design , 2001 .

[156]  Naren Ramakrishnan,et al.  Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[157]  Sietse Overbeek,et al.  Bridging Supply and Demand for Knowledge Intensive Tasks , 2008 .

[158]  Hongjun Lu,et al.  A Study on the Performance of Large Bayes Classifier , 2000, ECML.

[159]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[160]  Mor Naaman,et al.  Why we tag: motivations for annotation in mobile and online media , 2007, CHI.

[161]  Jilles Vreeken,et al.  Preserving Privacy through Data Generation , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[162]  Hongyuan Zha,et al.  Exploring social annotations for information retrieval , 2008, WWW.

[163]  W. H. van Atteveldt,et al.  Semantic Network Analysis: Techniques for Extracting, Representing, and Querying Media Content , 2008 .

[164]  D. Beal The nature of minimax search , 1999 .

[165]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[166]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[167]  Zlatko Vasilev Zlatev,et al.  Goal-oriented design of value and process models from patterns , 2007 .

[168]  Henning Rode,et al.  From Document to Entity Retrieval: Improving Precision and Performance of Focused Text Search , 2008 .

[169]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[170]  F. J. Wiesman,et al.  Information retrieval by graphically browsing meta-information , 1998 .

[171]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[172]  Jilles Vreeken,et al.  Identifying the components , 2009, Data Mining and Knowledge Discovery.

[173]  N.J.E. Wijngaards,et al.  Re-design of compositional systems , 1999 .

[174]  Eugueni Smirnov,et al.  Conjunctive and Disjunctive Version Spaces with Instance-based Boundary Sets , 2001 .

[175]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[176]  Stefan Visscher,et al.  Bayesian network models for the management of ventilator-associated pneumonia , 2008 .

[177]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[178]  H.H.L.M. Donkers,et al.  NOSCE HOSTEM: Searching with Opponent Models , 1997 .

[179]  Z. Aleksovski,et al.  Using background knowledge in ontology matching , 2008 .

[180]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[181]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[182]  Flavius Frasincar,et al.  Hypermedia presentation generation for semantic web information systems , 2005 .

[183]  Hongjun Lu,et al.  AFOPT: An Efficient Implementation of Pattern Growth Approach , 2003, FIMI.

[184]  H. Stuckenschmidt,et al.  Ontology-Based Information Sharing in Weakly Structured Environments , 2003 .

[185]  I. Bouzouita,et al.  GARC: A New Associative Classification Approach , 2006, DaWaK.

[186]  Toon Calders,et al.  Mining Frequent Itemsets in a Stream , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).