Extending graph homomorphism and simulation for real life graph matching

Among the vital problems in a variety of emerging applications is the graph matching problem, which is to determine whether two graphs are similar, and if so, find all the valid matches in one graph for the other, based on specified metrics. Traditional graph matching approaches are mostly based on graph homomorphism and isomorphism, falling short of capturing both structural and semantic similarity in real life applications. Moreover, it is preferable while difficult to find all matches with high accuracy over complex graphs. Worse still, the graph structures in real life applications constantly bear modifications. In response to these challenges, this thesis presents a series of approaches for efficiently solving graph matching problems, over both static and dynamic real life graphs. Firstly, the thesis extends graph homomorphism and subgraph isomorphism, respectively, by mapping edges from one graph to paths in another, and by measuring the semantic similarity of nodes. The graph similarity is then measured by the metrics based on these extensions. Several optimization problems for graph matching based on the new metrics are studied, with approximation algorithms having provable guarantees on match quality developed. Secondly, although being extended in the above work, graph matching is defined in terms of functions, which cannot capture more meaningful matches and is usually hard to compute. In response to this, the thesis proposes a class of graph patterns, in which an edge denotes the connectivity in a data graph within a predefined number of hops. In addition, the thesis defines graph pattern matching based on a notion of bounded simulation relation, an extension of graph simulation. With this revision, graph pattern matching is in cubic-time by providing such an algorithm, rather than intractable. Thirdly, real life graphs often bear multiple edge types. In response to this, the thesis further extends and generalizes the proposed revisions of graph simulation to a more powerful case: a novel set of reachability queries and graph pattern queries, constrained by a subclass of regular path expressions. Several fundamental problems of the queries are studied: containment, equivalence and minimization. The enriched reachability query does not increase the complexity of the above problems, shown by the corresponding algorithms. Moreover, graph pattern queries can be evaluated in cubic time, where two such algorithms are proposed. Finally, real life graphs are frequently updated with small changes. The thesis investigates incremental algorithms for graph pattern matching defined in terms of graph simulation, bounded simulation and subgraph isomorphism. Besides studying the results on the complexity bounds, the thesis provides the experimental study verifying that these incremental algorithms significantly outperform their batch counterparts in response to small changes, using real-life data and synthetic data.

[1]  Yang Xiang,et al.  Computing label-constraint reachability in graph databases , 2010, SIGMOD Conference.

[2]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[3]  Thomas Schwentick,et al.  XPath Containment in the Presence of Disjunction, DTDs, and Variables , 2003, ICDT.

[4]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  H. Bunke Graph Matching : Theoretical Foundations , Algorithms , and Applications , 2022 .

[6]  Gregory Gutin,et al.  Digraphs - theory, algorithms and applications , 2002 .

[7]  Brian Gallagher,et al.  Matching Structure and Semantics: A Survey on Graph-Based Pattern Matching , 2006, AAAI Fall Symposium: Capturing and Using Patterns for Evidence Detection.

[8]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[9]  Mario Vento,et al.  A Performance Comparison of Five Algorithms for Graph Isomorphism , 2001 .

[10]  Niklas Carlsson,et al.  Evolution of an online social aggregation network: an empirical study , 2009, IMC '09.

[11]  Thomas A. Henzinger,et al.  Computing simulations on finite and infinite graphs , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[12]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[13]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[14]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[15]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[16]  Jignesh M. Patel,et al.  TALE: A Tool for Approximate Large Graph Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Insup Lee,et al.  Simulation-Based Graph Similarity , 2006, TACAS.

[18]  Igor Jurisica,et al.  Efficient estimation of graphlet frequency distributions in protein-protein interaction networks , 2006, Bioinform..

[19]  Aristides Gionis,et al.  Fast shortest path distance estimation in large networks , 2009, CIKM.

[20]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[22]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[23]  Peter T. Wood,et al.  Containment for XPath Fragments under DTD Constraints , 2003, ICDT.

[24]  Dan Suciu,et al.  Query containment for conjunctive queries with regular expressions , 1998, PODS.

[25]  Jianzhong Li,et al.  Adding regular expressions to graph reachability and pattern queries , 2011, ICDE 2011.

[26]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[27]  Rafael Berlanga Llavori,et al.  ArHeX: An Approximate Retrieval System for Highly Heterogeneous XML Document Collections , 2006, EDBT.

[28]  Philip S. Yu,et al.  Fast computing reachability labelings for large graphs with high compression rate , 2008, EDBT '08.

[29]  Yang Xiang,et al.  Efficiently answering reachability queries on very large directed graphs , 2008, SIGMOD Conference.

[30]  Serge Abiteboul,et al.  Incremental Maintenance for Materialized Views over Semistructured Data , 1998, VLDB.

[31]  Chee Yong Chan,et al.  Minimization of tree pattern queries with constraints , 2008, SIGMOD Conference.

[32]  Carla Piazza,et al.  From Bisimulation to Simulation: Coarsest Partition Problems , 2003, Journal of Automated Reasoning.

[33]  Esko Nuutila An Efficient Transitive Closure Algorithm for Cyclic Digraphs , 1994, Inf. Process. Lett..

[34]  Tao Jiang,et al.  Minimal NFA Problems are Hard , 1991, SIAM J. Comput..

[35]  Lei Chen,et al.  Continuous Subgraph Pattern Search over Graph Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[36]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[37]  Thomas W. Reps,et al.  On the Computational Complexity of Dynamic Graph Problems , 1996, Theor. Comput. Sci..

[38]  Rakesh Nagi,et al.  Incremental graph matching for Situation Awareness , 2009, 2009 12th International Conference on Information Fusion.

[39]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[40]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[41]  Fang Wei-Kleiner,et al.  TEDI: Efficient Shortest Path Query Answering on Graphs , 2010, Graph Data Management.

[42]  David W. McDonald,et al.  Social matching: A framework and research agenda , 2005, TCHI.

[43]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[44]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[45]  Jianhua Feng,et al.  Edit Distance Evaluation on Graph Structures , 2008 .

[46]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[47]  Francesco Ranzato,et al.  A New Efficient Simulation Equivalence Algorithm , 2007, 22nd Annual IEEE Symposium on Logic in Computer Science (LICS 2007).

[48]  Sihem Amer-Yahia,et al.  Challenges in Searching Online Communities , 2007, IEEE Data Eng. Bull..

[49]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[50]  Jianzhong Li,et al.  Graph homomorphism revisited for graph matching , 2010, Proc. VLDB Endow..

[51]  Rance Cleaveland,et al.  Simulation Revisited , 2001, TACAS.

[52]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1992, BIT Comput. Sci. Sect..

[53]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[54]  Hector Garcia-Molina,et al.  Web graph similarity for anomaly detection , 2010, Journal of Internet Services and Applications.

[55]  Valdis E. Krebs,et al.  Mapping Networks of Terrorist Cells , 2001 .

[56]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[57]  Carlos A. Hurtado,et al.  Edinburgh Research Explorer Expressive Languages for Path Queries over Graph-Structured Data , 2012 .

[58]  Lisa Kaati,et al.  Detecting Social Positions Using Simulation , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[59]  Sandeep K. Shukla,et al.  The polynomial time decidability of simulation relations for finite processes: A HORNSAT based approach , 1996, Satisfiability Problem: Theory and Applications.

[60]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[61]  Rafael Berlanga Llavori,et al.  Fragment-based approximate retrieval in highly heterogeneous XML collections , 2008, Data Knowl. Eng..

[62]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[63]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[64]  Viggo Kann,et al.  On the Approximability of the Maximum Common Subgraph Problem , 1992, STACS.

[65]  Diptikalyan Saha An Incremental Bisimulation Algorithm , 2007, FSTTCS.

[66]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[67]  Francesco Ranzato,et al.  The Subgraph Similarity Problem , 2009, IEEE Transactions on Knowledge and Data Engineering.

[68]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[69]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[70]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[71]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[72]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[73]  Divesh Srivastava,et al.  Holistic twig joins: optimal XML pattern matching , 2002, SIGMOD '02.

[74]  Thomas W. Reps,et al.  A categorized bibliography on incremental computation , 1993, POPL '93.

[75]  Mario Vento,et al.  An Improved Algorithm for Matching Large Graphs , 2001 .

[76]  Alexander Borgida,et al.  Efficient management of transitive relationships in large data and knowledge bases , 1989, SIGMOD '89.

[77]  Yang Xiang,et al.  3-HOP: a high-compression indexing scheme for reachability query , 2009, SIGMOD Conference.

[78]  Thomas W. Reps,et al.  An Incremental Algorithm for a Generalization of the Shortest-Path Problem , 1996, J. Algorithms.

[79]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[80]  Padhraic Smyth,et al.  Algorithms for estimating relative importance in networks , 2003, KDD '03.

[81]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[82]  Orna Grumberg,et al.  Simulation-based minimization , 2003, TOCL.

[83]  Abraham Kandel,et al.  Classification Of Web Documents Using Graph Matching , 2004, Int. J. Pattern Recognit. Artif. Intell..

[84]  Tad Hogg,et al.  Friends and foes: ideological social networking , 2008, CHI.

[85]  Wenfei Fan,et al.  Information preserving XML schema embedding , 2005, TODS.

[86]  Christos Faloutsos,et al.  Fast best-effort pattern matching in large attributed graphs , 2007, KDD '07.

[87]  Heng Tao Shen,et al.  Monitoring path nearest neighbor in road networks , 2009, SIGMOD Conference.

[88]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[89]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[90]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[91]  Yue Zhuge,et al.  Graph structured views and their incremental maintenance , 1998, Proceedings 14th International Conference on Data Engineering.

[92]  Philip S. Yu,et al.  Dual Labeling: Answering Graph Reachability Queries in Constant Time , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[93]  Mario Vento,et al.  Thirty Years Of Graph Matching In Pattern Recognition , 2004, Int. J. Pattern Recognit. Artif. Intell..

[94]  Edward P. F. Chan,et al.  Optimization and evaluation of shortest path queries , 2007, The VLDB Journal.

[95]  Sharma Chakravarthy,et al.  eMailSift: eMail classification based on structure and content , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[96]  Philip S. Yu,et al.  Feature-based similarity search in graph structures , 2006, TODS.

[97]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[98]  John Jay,et al.  UNDERSTANDING THE STRUCTURE OF A DRUG TRAFFICKING ORGANIZATION : A CONVERSATIONAL ANALYSIS by Mangai Natarajan , 2006 .

[99]  Li Chen,et al.  Stack-based Algorithms for Pattern Matching on DAGs , 2005, VLDB.

[100]  Dan Suciu,et al.  UnQL: a query language and algebra for semistructured data based on structural recursion , 2000, The VLDB Journal.

[101]  Philip S. Yu,et al.  Fast Graph Pattern Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[102]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.