Independent informative subgraph mining for graph information retrieval

In order to enable scalable querying of graph databases, intelligent selection of subgraphs to index is essential. An improved index can reduce response times for graph queries significantly. For a given subgraph query, graph candidates that may contain the subgraph are retrieved using the graph index and subgraph isomorphism tests are performed to prune out unsatisfied graphs. However, since the space of all possible subgraphs of the whole set of graphs is prohibitively large, feature selection is required to identify a good subset of subgraph features for indexing. Thus, one of the key issues is: given the set of all possible subgraphs of the graph set, which subset of features is the optimal such that the algorithm retrieves the smallest set of candidate graphs and reduces the number of subgraph isomorphism tests? We introduce a graph search method for subgraph queries based on subgraph frequencies. Then, we propose several novel feature selection criteria, Max-Precision, Max-Irredundant-Information, and Max-Information-Min-Redundancy, based on mutual information. Finally we show theoretically and empirically that our proposed methods retrieve a smaller candidate set than previous methods. For example, using the same number of features, our method improve the precision for the query candidate set by 4%-13% in comparison to previous methods. As a result the response time of subgraph queries also is improved correspondingly.

[1]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Akihiro Inokuchi Mining generalized substructures from a set of labeled graphs , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[3]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[4]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[6]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[7]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[8]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[9]  John Yen,et al.  Multi-task text segmentation and alignment based on weighted mutual information , 2006, CIKM '06.

[10]  Sourav S. Bhowmick,et al.  XML structural delta mining: Issues and challenges , 2006, Data Knowl. Eng..

[11]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[12]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[13]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[14]  Peter Willett,et al.  RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs , 2002, Comput. J..

[15]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[16]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[17]  John Yen,et al.  Topic segmentation with shared topic detection and alignment of multiple documents , 2007, SIGIR.

[18]  Thorsten Meinl,et al.  Graph based molecular data mining - an overview , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[19]  Mario A. Nascimento,et al.  Improving Web search efficiency via a locality based static pruning method , 2005, WWW '05.

[20]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[21]  Bettina Berendt,et al.  Using and Learning Semantics in Frequent Subgraph Mining , 2005, WEBKDD.

[22]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[23]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[24]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[25]  Srinath Srinivasa,et al.  A Platform Based on the Multi-dimensional Data Model for Analysis of Bio-Molecular Structures , 2003, VLDB.

[26]  Prasenjit Mitra,et al.  Predicting Blogging Behavior Using Temporal and Social Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[28]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[29]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.