Mining interesting subgraphs by output space sampling

Lack of scalability of the mining process and the enormous size of the output set are two significant bottlenecks of Frequent Subgraph Mining (FSM). The first restricts the applicability of FSM to large datasets. The second makes it difficult for the user to analyze the frequent patterns for subsequent usage in typical knowledge discovery tasks, such as classification, clustering, outlier detection, etc. In this thesis, I propose output space sampling (OSS) to alleviate the above two problems. In this paradigm, the objective is to sample frequent patterns instead of complete enumeration. The sampling process automatically performs the interestingness based selection by embedding the interestingness score of the patterns in the desired target distribution. This obviates a two-step mechanism since the sampling automatically prefers the patters that are interesting. Another important point to note is that OSS is a generic method that applies for any kind of patterns such as a set, a sequence, a tree and of-course a graph. OSS is based on Markov Chain Monte Carlo (MCMC) sampling. It performs a random walk on the candidate subgraph partial order and returns subgraph samples when the walk converges to a desired stationary distribution. The transition probability matrix of the random walk is computed locally to avoid a complete enumeration of the candidate frequent patterns, which makes the sampling paradigm scalable to large real life graph datasets. Output space sampling is an entire paradigm shift in frequent pattern mining (FPM) that holds enormous promise. While traditional FPM strives for completeness, OSS targets to obtain a few interesting samples. The definition of interestingness can be very generic, so user can sample patterns from different target distributions by choosing different interestingness functions. This is very beneficial as mined patterns are subject to subsequent use in various knowledge discovery tasks. Output space sampling is a general idea that has various applications. In this thesis, we utilize this general idea to solve specific problems. We consider two different problems: (1) frequent pattern summarization (2) sampling discriminatory patterns. For both the above problems, we assume that the direct mining task is infeasible, so that the user wants to adopt sampling to find few interesting patterns in a reasonable amount of time. We also use the concept of OSS to find representative patterns that are very different from each other. OSS naturally supports this requirement as it obtains random samples which are very different from each other. In this thesis, two different algorithms are proposed for representative pattern mining, which are introduced in the next two paragraphs. The first algorithm for the representative pattern mining that is proposed in this thesis is called MUSK. It is based on a uniform sampling of the output space. It obtains representative patterns by sampling uniformly from the pool of all frequent maximal patterns; uniformity is achieved by a variant of Markov Chain Monte Carlo (MCMC) algorithm. MUSK follows the concept of OSS by sampling from a target distribution where the maximal patterns have uniform value for the interestingness score. The second algorithm that we propose for this task is ORIGAMI . It defines the representative pattern-set (R) in a novel manner that attempts to reduce structural similarities among patterns in R while extending the coverage of frequent pattern space as much as possible. Intuitively, two patterns are α-orthogonal if their similarity is bounded above by α. Each α-orthogonal pattern is also a representative for those patterns that are at least β similar to it. Given user defined α, β ∈ [0, 1], the goal of O RIGAMI is to mine an a-orthogonal, β-representative set that minimizes the set of unrepresented patterns. Similar to OSS paradigm, O RIGAMI uses a randomized algorithm to randomly traverse the pattern space, seeking previously unexplored regions, to return a set of maximal patterns. But, uncharacteristic to OSS, ORIGAMI employs a second-step to extract an α-orthogonal, β-representative set from the mined maximal patterns using a local optimal algorithm. The second step is essential to provide the α-orthogonal, β-representative guarantee. (Abstract shortened by UMI.)

[1]  Wei Wang,et al.  Mining protein family specific residue packing patterns from protein structure graphs , 2004, RECOMB.

[2]  Yi Pan,et al.  Mining protein sequence motifs representing common 3D structures , 2005, 2005 IEEE Computational Systems Bioinformatics Conference - Workshops (CSBW'05).

[3]  Mark Jerrum,et al.  The Markov chain Monte Carlo method: an approach to approximate counting and integration , 1996 .

[4]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[5]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[6]  Dimitrios Gunopulos,et al.  Discovering All Most Specific Sentences by Randomized Algorithms , 1997, ICDT.

[7]  Shirish Tatikonda,et al.  Toward terabyte pattern mining: an architecture-conscious solution , 2007, PPoPP.

[8]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[9]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Philip S. Yu,et al.  Mining Asynchronous Periodic Patterns in Time Series Data , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[12]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[13]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[14]  Hans-Peter Kriegel,et al.  Metropolis Algorithms for Representative Subgraph Sampling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Mohammad Al Hasan,et al.  An integrated, generic approach to pattern mining: data mining template library , 2008, Data Mining and Knowledge Discovery.

[16]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[17]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[18]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[19]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[21]  Yun Chi,et al.  Indexing and mining free trees , 2003, Third IEEE International Conference on Data Mining.

[22]  Aristides Gionis,et al.  Mining Large Networks with Subgraph Counting , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[23]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[24]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[25]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[26]  S. V. N. Vishwanathan,et al.  Fast Computation of Graph Kernels , 2006, NIPS.

[27]  Srinivasan Parthasarathy,et al.  Summarizing itemset patterns using probabilistic models , 2006, KDD '06.

[28]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[29]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[30]  Yun Chi,et al.  Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees , 2005, IEEE Trans. Knowl. Data Eng..

[31]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[32]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[33]  B. Yener,et al.  Cell-Graph Mining for Breast Tissue Modeling and Classification , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[34]  Wei Li,et al.  New parallel algorithms for fast discovery of associ-ation rules , 1997 .

[35]  Jiawei Han,et al.  On compressing frequent patterns , 2007, Data Knowl. Eng..

[36]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Mohammad Al Hasan,et al.  MUSK: Uniform Sampling of k Maximal Patterns , 2009, SDM.

[38]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[39]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[40]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[41]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[42]  Christos Faloutsos,et al.  Cut-and-stitch: efficient parallel learning of linear dynamical systems on smps , 2008, KDD.

[43]  Peter J. Cameron,et al.  Spectral graph theory , 2004 .

[44]  Srinivasan Parthasarathy,et al.  Out-of-core frequent pattern mining on a commodity PC , 2006, KDD '06.

[45]  Heikki Mannila,et al.  Discovering Generalized Episodes Using Minimal Occurrences , 1996, KDD.

[46]  Xiaoming Jin,et al.  Indexing and Mining of the Local Patterns in Sequence Database , 2002, IDEAL.

[47]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[48]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[49]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[50]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[51]  Alexandre Termier,et al.  Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[52]  Jong-Seok. Kim Mining association rules using formal concept analysis. , 2002 .

[53]  Yun Chi,et al.  HybridTreeMiner: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[54]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[55]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[56]  Yogish Sabharwal,et al.  Analysis of sampling techniques for association rule mining , 2009, ICDT '09.

[57]  Alistair Sinclair,et al.  Algorithms for Random Generation and Counting: A Markov Chain Approach , 1993, Progress in Theoretical Computer Science.

[58]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[59]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS.

[60]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[61]  Takashi Washio,et al.  Complete Mining of Frequent Patterns from Graphs: Mining Graph Data , 2003, Machine Learning.

[62]  Mohammad Al Hasan,et al.  ORIGAMI: Mining Representative Orthogonal Graph Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[63]  Philip S. Yu,et al.  Near-optimal Supervised Feature Selection among Frequent Subgraphs , 2009, SDM.

[64]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[65]  Zhigang Li,et al.  Efficient data mining for maximal frequent subtrees , 2003, Third IEEE International Conference on Data Mining.

[66]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[67]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[68]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[69]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[70]  Jiawei Han,et al.  Extracting redundancy-aware top-k patterns , 2006, KDD '06.

[71]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[72]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[73]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[74]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[75]  Sen Zhang,et al.  Unordered tree mining with applications to phylogeny , 2004, Proceedings. 20th International Conference on Data Engineering.

[76]  Mohammed J. Zaki Sequence mining in categorical domains: incorporating constraints , 2000, CIKM '00.

[77]  Dirk P. Kroese,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[78]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[79]  Chen Wang,et al.  Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining , 2004, PAKDD.

[80]  J. Gentle Simulation and the Monte Carlo Method, 2nd edition by RUBINSTEIN, R. Y. and KROESE, D. P. , 2008 .

[81]  Mohammad Al Hasan,et al.  DMTL : A Generic Data Mining Template Library , 2005 .

[82]  Yang Xiang,et al.  Effective and efficient itemset pattern summarization: regression-based approaches , 2008, KDD.

[83]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[84]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[85]  Joost N. Kok,et al.  Efficient discovery of frequent unordered trees , 2003 .

[86]  Mohammad Al Hasan,et al.  ORIGAMI: A Novel and Effective Approach for Mining Representative Orthogonal Graph Patterns , 2008 .

[87]  Lisa Singh,et al.  Pruning social networks using structural properties and descriptive attributes , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[88]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[89]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[90]  Xindong Wu,et al.  Computing the minimum-support for mining frequent patterns , 2008, Knowledge and Information Systems.

[91]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[92]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[93]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[94]  Soon Myoung Chung,et al.  A scalable algorithm for mining maximal frequent sequences using a sample , 2008, Knowledge and Information Systems.

[95]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[96]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[97]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[98]  Jiawei Han,et al.  Frequent Closed Sequence Mining without Candidate Maintenance , 2007, IEEE Transactions on Knowledge and Data Engineering.

[99]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[100]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[101]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[102]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[103]  Kamalakar Karlapalem,et al.  MARGIN: Maximal Frequent Subgraph Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[104]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[105]  Gemma C. Garriga,et al.  On Horn Axiomatizations for Sequential Data , 2005, ICDT.

[106]  Mohammed J. Zaki,et al.  GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets , 2005, Data Mining and Knowledge Discovery.

[107]  Henrik Grosskreutz,et al.  A Randomized Approach for Approximating the Number of Frequent Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[108]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[109]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[110]  Luc De Raedt,et al.  Don't Be Afraid of Simpler Patterns , 2006, PKDD.