Dynamic Load Balancing for the Distributed Mining of Molecular Structures

In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute's HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a nondedicated computational environment. These features make it suitable for large-scale, multidomain, heterogeneous environments, such as computational grids

[1]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[2]  Giuseppe Di Fatta,et al.  Efficient Mining of Discriminative Molecular Fragments , 2005, IASTED PDCS.

[3]  Richard M. Karp,et al.  A randomized parallel branch-and-bound procedure , 1988, STOC '88.

[4]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[5]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..

[6]  Eugene M. Luks,et al.  Isomorphism of graphs of bounded valence can be tested in polynomial time , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[7]  Raj Jain,et al.  A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems , 1998, ArXiv.

[8]  George Karypis,et al.  Automated Approaches for Classifying Structures , 2002, BIOKDD.

[9]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[10]  Rizos Sakellariou,et al.  Compile-time minimisation of load imbalance in loop nests , 1997, ICS '97.

[11]  Yongwha Chung,et al.  An Asynchronous Algorithm for Balancing Unpredictable Workload on Distributed-Memory Machines , 1998 .

[12]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[14]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[15]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[16]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[17]  Giuseppe Di Fatta,et al.  Distributed Mining of Molecular Fragments , 2004 .

[18]  Falk Schreiber,et al.  Towards Motif Detection in Networks: Frequency Concepts and Flexible Search , 2004 .

[19]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[20]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[22]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[23]  Giuseppe Di Fatta,et al.  High Performance Subgraph Mining in Molecular Compounds , 2005, HPCC.

[24]  Katherine Yelick,et al.  Randomized load balancing for tree-structured computation , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[25]  M. Boyd,et al.  New soluble-formazan assay for HIV-1 cytopathic effects: application to high-flux screening of synthetic and natural products for AIDS-antiviral activity. , 1989, Journal of the National Cancer Institute.

[26]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[27]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.