Feature-based similarity search in graph structures

Similarity search of complex structures is an important operation in graph-related applications since exact matching is often too restrictive. In this article, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed feature misses, our structural filtering algorithm can filter graphs without performing pairwise similarity computation. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy that could maximize the filtering capability. We prove that the complexity of optimal feature set selection is Ω(2m) in the worst case, where m is the number of features for selection. In practice, we identify several criteria to build effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly within a multifilter composition framework. The proposed feature-based filtering concept can be generalized and applied to searching approximate nonconsecutive sequences, trees, and other structured data as well.

[1]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  M. Padberg Linear Optimization and Extensions , 1995 .

[3]  Horst Bunke,et al.  A New Algorithm for Error-Tolerant Subgraph Isomorphism Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Thomas R. Hagadone,et al.  Molecular substructure similarity searching: efficient retrieval in two-dimensional structure databases , 1992, J. Chem. Inf. Comput. Sci..

[5]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[6]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[7]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[8]  Alberto Del Bimbo,et al.  Efficient Matching and Indexing of Graph Models in Content-Based Retrieval , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Srinath Srinivasa,et al.  A Platform Based on the Multi-dimensional Data Model for Analysis of Bio-Molecular Structures , 2003, VLDB.

[10]  Euripides G. M. Petrakis,et al.  Similarity Searching in Medical Image Databases , 1997, IEEE Trans. Knowl. Data Eng..

[11]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[12]  Luis Gravano,et al.  Using q-grams in a DBMS for Approximate String Processing , 2001, IEEE Data Eng. Bull..

[13]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[14]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[15]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[16]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[17]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[18]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[19]  Dennis Shasha,et al.  GraphGrep: A fast and universal method for querying graphs , 2002, Object recognition supported by user interaction for service robots.

[20]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[21]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[22]  Del BimboAlberto,et al.  Efficient Matching and Indexing of Graph Models in Content-Based Retrieval , 2001 .

[23]  Peter Willett,et al.  RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs , 2002, Comput. J..

[24]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[25]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[26]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[27]  R. Seidel,et al.  Lower bounds for fundamental geometric problems , 1996 .

[28]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.