Parallel algorithms for mining frequent structural motifs in scientific data

Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules.Recently, we have developed a general purpose suite of algorithms -- the MotifMiner Toolkit -- that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown.To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters.

[1]  Ting-Fung Chan,et al.  Chemical genomics: a systematic approach in biological research and drug discovery. , 2002, Current issues in molecular biology.

[2]  Luc De Raedt,et al.  The Levelwise Version Space Algorithm and its Application to Molecular Fragment Finding , 2001, IJCAI.

[3]  Amanda Clare,et al.  Genome scale prediction of protein functional class from sequence using data mining , 2000, KDD '00.

[4]  George Karypis,et al.  Automated Approaches for Classifying Structures , 2002, BIOKDD.

[5]  Yehezkel Lamdan,et al.  Geometric Hashing: A General And Efficient Model-based Recognition Scheme , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[6]  David Haussler,et al.  Mining scientific data , 1996, CACM.

[7]  Etsuko N. Moriyama,et al.  Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties , 2000, Bioinform..

[8]  Srinivasan Parthasarathy,et al.  Automatically deriving multi-level protein structures through data mining , 2001 .

[9]  D.J. Cook,et al.  Structural mining of molecular biology data , 2001, IEEE Engineering in Medicine and Biology Magazine.

[10]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[11]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Multiple RNA Secondary Structures , 1996, KDD.

[12]  George Karypis,et al.  Data Mining for Turbulent Flows , 2001 .

[13]  Jon M. Kleinberg,et al.  Fast Detection of Common Geometric Substructure in Proteins , 1999, J. Comput. Biol..

[14]  W. Pan,et al.  Model-based cluster analysis of microarray gene-expression data , 2002, Genome Biology.

[15]  Lawrence B. Holder,et al.  Analyzing the Benefits of Domain Knowledge in Substructure Discovery , 1995, KDD.

[16]  Kaizhong Zhang,et al.  Automated Discovery of Active Motifs in Three Dimensional Molecules , 1997, KDD.

[17]  Chandrika Kamath,et al.  Learning to Classify Galaxy Shapes Using the EM Algorithm , 2002, NIPS.

[18]  Chandrika Kamath,et al.  On Mining Scientific Datasets , 2001 .

[19]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[20]  Srinivasan Parthasarathy,et al.  Mining of Complex Evolutionary Phenomena , 2002 .

[21]  Srinivasan Parthasarathy,et al.  MotifMiner: Efficient discovery of common substructures in biochemical molecules , 2005, Knowledge and Information Systems.

[22]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Srinivasan Parthasarathy,et al.  Efficient discovery of common substructures in macromolecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  Lawrence B. Holder,et al.  Approaches to Parallel Graph-Based Knowledge Discovery , 2001, J. Parallel Distributed Comput..

[25]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[26]  Hannu Toivonen,et al.  Finding Frequent Substructures in Chemical Compounds , 1998, KDD.

[27]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[28]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[29]  Srinivasan Parthasarathy,et al.  Defect Detection in Silicon and Alloys , 2002 .

[30]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[31]  P Willett,et al.  Use of techniques derived from graph theory to compare secondary structure motifs in proteins. , 1990, Journal of molecular biology.

[32]  Kaizhong Zhang,et al.  Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[33]  William R. Taylor,et al.  Structure Motif Discovery and Mining the PDB , 2002, German Conference on Bioinformatics.

[34]  Thomas Lengauer,et al.  An Algorithm for Finding Maximal Common Subtopologies in a Set of Protein Structures , 1996, J. Comput. Biol..

[35]  Srinivasan Parthasarathy,et al.  Decision Tree Classification of Spatial Data Patterns from Videokeratography using Zernicke Polynomials , 2003, SDM.

[36]  Srinivasan Parthasarathy,et al.  MotifMiner: a general toolkit for efficiently identifying common substructures in molecules , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[37]  Jon M. Kleinberg,et al.  Fast detection of common geometric substructure in proteins , 1999, J. Comput. Biol..

[38]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .

[39]  P Willett,et al.  Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. , 1993, Journal of molecular biology.