Discovery of frequent patterns in large data collections

Data mining or knowledge discovery in databases aims at nding useful reg ularities in large data sets Interest in the eld is motivated by the growth of computerized data collections and by the high potential value of patterns discovered in those collections For instance bar code readers at supermar kets produce extensive amounts of data about purchases An analysis of this data can reveal useful information about the shopping behavior of the customers Association rules for instance are a class of patterns that tell which products tend to be purchased together The general data mining task we consider is the following given a class of patterns that possibly have occurrences in a given data collection determine which patterns occur frequently and are thus probably the most useful ones It is characteristic for data mining applications to deal with high volumes of both data and patterns We address the algorithmic problems of determining e ciently which pat terns are frequent in the given data Our contributions are new algorithms analyses of problems and pattern classes for data mining We also present extensive experimental results We start by giving an e cient method for the discovery of all frequent association rules a well known data mining problem We then introduce the problem of discovering frequent patterns in general and show how the association rule algorithm can be extended to cover this problem We analyze the problem complexity and derive a lower bound for the number of queries in a simple but realistic model We then show how sampling can be used in the discovery of exact association rules and we give algorithms that are e cient especially in terms of the amount of database processing We also show that association rules with negation

[1]  Leonid Khachiyan,et al.  On the Complexity of Dualization of Monotone Disjunctive Normal Forms , 1996, J. Algorithms.

[2]  Yasuhiko Morimoto,et al.  Mining optimized association rules for numeric attributes , 1996, J. Comput. Syst. Sci..

[3]  Heikki Mannila,et al.  Discovering Generalized Episodes Using Minimal Occurrences , 1996, KDD.

[4]  Heikki Mannila,et al.  On the Complexity of Inferring Functional Dependencies , 1992, Discret. Appl. Math..

[5]  Alex Samorodnitsky,et al.  Inclusion-exclusion: Exact and approximate , 1996, Comb..

[6]  Kaizhong Zhang,et al.  Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.

[7]  Heikki Mannila,et al.  TASA: Telecommunication Alarm Sequence Analyzer or how to enjoy faults in your network , 1996, Proceedings of NOMS '96 - IEEE Network Operations and Management Symposium.

[8]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[9]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[10]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[11]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[12]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Heikki Mannila,et al.  Knowledge discovery from telecommunication network alarm databases , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[15]  Heikki Mannila,et al.  A Perspective on Databases and Data Mining , 1995, KDD.

[16]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[17]  E. F. Codd,et al.  A Relational Model for Large Shared Data Banks , 1970 .

[18]  Daryl Pregibon,et al.  A Statistical Perspective on Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[20]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[21]  Noam Nisan,et al.  Approximate Inclusion-Exclusion , 1990, STOC '90.

[22]  Georg Gottlob,et al.  Identifying the Minimal Transversals of a Hypergraph and Related Problems , 1995, SIAM J. Comput..

[23]  Heikki Mannila,et al.  Design of Relational Databases , 1992 .

[24]  Roni Khardon Translating between Horn Representations and their Characteristic Models , 1995, J. Artif. Intell. Res..

[25]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[26]  Fabrizio Luccio,et al.  Simple and Efficient String Matching with k Mismatches , 1989, Inf. Process. Lett..

[27]  Heikki Mannila,et al.  Interactive Exploration of Discovered Knowledge: A Methodology for Interaction, and Usability Studie , 1996 .

[28]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[29]  Heikki Mannila,et al.  Pruning and grouping of discovered association rules , 1995 .

[30]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[31]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[32]  J DenningPeter The working set model for program behavior , 1968 .

[33]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[34]  Luc De Raedt,et al.  A Theory of Clausal Discovery , 1993, IJCAI.

[35]  M.A.W. Houtsma,et al.  Set-Oriented Mining for Association Rules , 1993, ICDE 1993.

[36]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[37]  Charles L. Forgy,et al.  Rete: a fast algorithm for the many pattern/many object pattern match problem , 1991 .

[38]  Arno Siebes,et al.  Data Surveying: Foundations of an Inductive Query Language , 1995, KDD.

[39]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[40]  Philip Laird,et al.  Identifying and Using Patterns in Sequential Data , 1993, ALT.

[41]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[42]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[43]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[44]  Ronald Fagin,et al.  Inclusion dependencies and their interaction with functional dependencies , 1982, PODS.

[45]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[46]  Siegfried Bell Discovery and Maintenance of Functional Dependencies by Independencies , 1995, KDD.

[47]  Heikki Mannila,et al.  Discovering functional and inclusion dependencies in relational databases , 1992, Int. J. Intell. Syst..

[48]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[49]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.