Mining, Indexing and Similarity Search in Large Graph Data Sets

Scalable analytical algorithms and tools for large graph data sets are in great demand across domains from software engineering to computational biology as it is very difficult, if not impossible, for human beings to manually analyze any reasonably large collection of graphs due to their high complexity. In this dissertation, we investigate two long standing fundamental problems: Given a graph data set, what are the hidden structural patterns and how can we find them? and how can we index graphs and perform similarity search in large graph data sets? Graph pattern mining is an expensive computational problem since subgraph isomorphism is NP-complete. Previous solutions generate inevitable overheads since they rely on joining two graphs to form larger candidates. We develop a graph canonical labeling system, gSpan, showing both theoretically and empirically that this kind of join operation is unnecessary. Graph indexing, the second problem addressed in this dissertation, may incur an exponential number of index entries if all of the substructures in a graph database are used for indexing. The solution, gIndex, proposes a novel, frequent and discriminative graph mining approach that leads to the development of a compact but effective graph index structure that is orders of magnitude smaller in size but an order of magnitude faster in performance than traditional approaches. Besides graph mining and search, this dissertation provides thorough investigation of pattern summarization, pattern-based classification, constraint pattern mining, and graph similarity searching, which could leverage the usage of graph patterns. It also explores several critical applications in bioinformatics, computer systems and software engineering, including gene relevance network analysis for functional annotation, and program flow analysis for automated software bug isolation. The developed concepts, theories, and systems may significantly deepen the understanding of data mining principles in structural pattern discovery, interpretation and search. The formulation of a general graph information system through this study could provide fundamental supports to graph-intensive applications in multiple domains.

[1]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[2]  Euripides G. M. Petrakis,et al.  Similarity Searching in Medical Image Databases , 1997, IEEE Trans. Knowl. Data Eng..

[3]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[5]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[6]  Philip S. Yu,et al.  Graph indexing based on discriminative frequent structure analysis , 2005, TODS.

[7]  Kyuseok Shim,et al.  APEX: an adaptive path index for XML data , 2002, SIGMOD '02.

[8]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[9]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[10]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[11]  Jon M. Kleinberg,et al.  Small-World Phenomena and the Dynamics of Information , 2001, NIPS.

[12]  Zvi M. Kedem,et al.  Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[13]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[15]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[16]  C. Hansch Quantitative approach to biochemical structure-activity relationships , 1969 .

[17]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[18]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[19]  Takeshi Tokuyama,et al.  Finding subsets maximizing minimum structures , 1995, SODA '95.

[20]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[21]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[22]  King-Sun Fu,et al.  A Step Towards Unification of Syntactic and Statistical Pattern Recognition , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[25]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[26]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[27]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[28]  George Karypis,et al.  Automated Approaches for Classifying Structures , 2002, BIOKDD.

[29]  Dennis Shasha,et al.  GraphGrep: A fast and universal method for querying graphs , 2002, Object recognition supported by user interaction for service robots.

[30]  Gregg Rothermel,et al.  Empirical Studies of a Safe Regression Test Selection Technique , 1998, IEEE Trans. Software Eng..

[31]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[32]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[33]  Chao Liu,et al.  Mining Control Flow Abnormality for Logic Error Isolation , 2006, SDM.

[34]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[35]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[36]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[37]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[38]  Chao Liu,et al.  SOBER: statistical model-based bug localization , 2005, ESEC/FSE-13.

[39]  Ehud Gudes,et al.  Computing frequent graph patterns from semistructured data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[40]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[41]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[42]  R. Karp,et al.  Conserved pathways within bacteria and yeast as revealed by global protein network alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[44]  Taneli Mielikäinen Intersecting data to closed sets with constraints , 2003, FIMI.

[45]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[46]  Jiawei Han,et al.  Mining closed relational graphs with connectivity constraints , 2005, 21st International Conference on Data Engineering (ICDE'05).

[47]  Ali Shokoufandeh,et al.  Indexing using a spectral encoding of topological structure , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[48]  Jiong Yang,et al.  SPIN: mining maximal frequent subgraphs from graph databases , 2004, KDD.

[49]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[50]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Jiawei Han,et al.  Extracting redundancy-aware top-k patterns , 2006, KDD '06.

[52]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[53]  Walter G. Kropatsch,et al.  A Minimal Line Property Preserving Representation of Line Images , 1999, Computing.

[54]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[55]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[56]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[57]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[58]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[59]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[60]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[61]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[62]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[63]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[64]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[65]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[66]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[67]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[68]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[69]  David G. Stork,et al.  Pattern Classification , 1973 .

[70]  Andrew Lim,et al.  D(k)-index: an adaptive structural summary for graph-structured data , 2003, SIGMOD '03.

[71]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[72]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[73]  Jian Pei,et al.  On computing condensed frequent pattern bases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[74]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[75]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[76]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[77]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[78]  Jiawei Han,et al.  TFP: an efficient algorithm for mining top-k frequent closed itemsets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[79]  M. Padberg Linear Optimization and Extensions , 1995 .

[80]  Wojciech Szpankowski,et al.  An efficient algorithm for detecting frequent subgraphs in biological networks , 2004, ISMB/ECCB.

[81]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[82]  Alberto Del Bimbo,et al.  Efficient Matching and Indexing of Graph Models in Content-Based Retrieval , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[83]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[84]  Chao Liu,et al.  Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs , 2005, SDM.

[85]  Laks V. S. Lakshmanan,et al.  Mining frequent itemsets with convertible constraints , 2001, Proceedings 17th International Conference on Data Engineering.

[86]  Srinath Srinivasa,et al.  A Platform Based on the Multi-dimensional Data Model for Analysis of Bio-Molecular Structures , 2003, VLDB.

[87]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[88]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .

[89]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[90]  R. Seidel,et al.  Lower bounds for fundamental geometric problems , 1996 .

[91]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[92]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[93]  M E J Newman,et al.  Identity and Search in Social Networks , 2002, Science.

[94]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[95]  E. K. WONG,et al.  Model matching in robot vision by subgraph isomorphism , 1992, Pattern Recognit..

[96]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[97]  Peter Willett,et al.  RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs , 2002, Comput. J..

[98]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[99]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[100]  J. Snoeyink,et al.  Mining Spatial Motifs from Protein Structure Graphs , 2003 .

[101]  Michael J. Franklin,et al.  A Fast Index for Semistructured Data , 2001, VLDB.

[102]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[103]  Philip S. Yu,et al.  Searching Substructures with Superimposed Distance , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[104]  Horst Bunke,et al.  A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[105]  Thomas R. Hagadone,et al.  Molecular substructure similarity searching: efficient retrieval in two-dimensional structure databases , 1992, J. Chem. Inf. Comput. Sci..

[106]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[107]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[108]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[109]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[110]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[111]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[112]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[113]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[114]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[115]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[116]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[117]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[118]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[119]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[120]  Luis Gravano,et al.  Using q-grams in a DBMS for Approximate String Processing , 2001, IEEE Data Eng. Bull..

[121]  F. Chung,et al.  The average distances in random graphs with given expected degrees , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[122]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[123]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[124]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[125]  Haiyan Hu,et al.  Integrative Array Analyzer: a software package for analysis of cross-platform and cross-species microarray data , 2006, Bioinform..

[126]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..