Efficient Mining of Discriminative Molecular Fragments

Frequent pattern discovery in structured data is receiving an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the wellknown National Cancer Institute’s HIV-screening dataset.

[1]  Katherine Yelick,et al.  Randomized load balancing for tree-structured computation , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[2]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[3]  Falk Schreiber,et al.  Towards Motif Detection in Networks: Frequency Concepts and Flexible Search , 2004 .

[4]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[5]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[6]  Richard M. Karp,et al.  A randomized parallel branch-and-bound procedure , 1988, STOC '88.

[7]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[10]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[11]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[12]  Giuseppe Di Fatta,et al.  Distributed Mining of Molecular Fragments , 2004 .

[13]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[14]  Yongwha Chung,et al.  An Asynchronous Algorithm for Balancing Unpredictable Workload on Distributed-Memory Machines , 1998 .

[15]  Eugene M. Luks Isomorphism of Graphs of Bounded Valence Can Be Tested in Polynomial Time , 1980, FOCS.

[16]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[18]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[19]  Steven Skiena,et al.  Implementing discrete mathematics - combinatorics and graph theory with Mathematica , 1990 .

[20]  George Karypis,et al.  Automated Approaches for Classifying Structures , 2002, BIOKDD.

[21]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[22]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..