Indexing and mining topological patterns for drug discovery

Increased availability of large repositories of chemical compounds has created new challenges and opportunities for the application of data-mining and indexing techniques to problems in chemical informatics. The primary goal in analysis of molecular databases is to identify structural patterns that can predict biological activity. Two of the most popular approaches to representing molecular topologies are graphs and 3D geometries. As a result, the problem of indexing and mining structural patterns map to indexing and mining patterns from graph and 3D geometric databases. In this tutorial, we will first introduce the problem of drug discovery and how computer science plays a critical role in that process. We will then proceed by introducing the problem of performing subgraph and similarity searches on large graph databases. Due to the NP-hardness of the problems, a number of heuristics have been designed in recent years and the tutorial will present an overview of those techniques. Next, we will introduce the problem of mining frequent subgraph patterns along with some of their limitations that ignited the interest in the problem of mining statistically significant subgraph patterns. After presenting an in-depth survey of the techniques on mining significant subgraph patterns, the tutorial will proceed towards the problem of analyzing 3D geometric structures of molecules. Finally, we will conclude by presenting two open computer science problems that can have a significant impact in the field of drug discovery.

[1]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[3]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[4]  Anthony K. H. Tung,et al.  Comparing Stars: On Approximating Graph Edit Distance , 2009, Proc. VLDB Endow..

[5]  Jeffrey Xu Yu,et al.  Connected substructure similarity search , 2010, SIGMOD Conference.

[6]  Ambuj K. Singh,et al.  Mining Statistically Significant Molecular Substructures for Efficient Molecular Classification , 2009, J. Chem. Inf. Model..

[7]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[8]  Lei Zou,et al.  A novel spectral coding in a large graph database , 2008, EDBT '08.

[9]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[11]  Wei Wang,et al.  GAIA: graph classification using evolutionary computation , 2010, SIGMOD Conference.

[12]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[13]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[14]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[15]  Ehud Gudes,et al.  Mining frequent labeled and partially labeled graph patterns , 2004, Proceedings. 20th International Conference on Data Engineering.

[16]  Wei Wang,et al.  Graph Database Indexing Using Structured Graph Decomposition , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Philip S. Yu,et al.  Searching Substructures with Superimposed Distance , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[20]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[21]  Philip S. Yu,et al.  GString: A Novel Approach for Efficient Search in Graph Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Ambuj K. Singh,et al.  GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[23]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  S Joshua Swamidass,et al.  Probabilistic Substructure Mining From Small‐Molecule Screens , 2011, Molecular informatics.

[25]  Ambuj K. Singh,et al.  Novel Method for Pharmacophore Analysis by Examining the Joint Pharmacophore Space , 2011, J. Chem. Inf. Model..

[26]  George Karypis,et al.  Common Pharmacophore Identification Using Frequent Clique Detection Algorithm. , 2009 .

[27]  Jeffrey Xu Yu,et al.  Taming verification hardness: an efficient algorithm for testing subgraph isomorphism , 2008, Proc. VLDB Endow..

[28]  Shijie Zhang,et al.  TreePi: A Novel Graph Indexing Method , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[29]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[30]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.