Mining Molecular Datasets on Symmetric Multiprocessor Systems

Although in the last few years about a dozen sophisticated algorithms for mining frequent fragments in molecular databases have been proposed, searching big databases with 100,000 compounds and more is still a time-consuming process. Even the currently fastest algorithms like gSpan, FFSM, Gaston, or MoFa require hours to complete their tasks. This paper presents thread-based parallel versions of MoFa [5] and gSpan [26] that achieve speedups up to 11 on a shared-memory SMP system using 12 processors. We discuss the design space of the parallelization, the results, and the obstacles that are caused by the irregular search space and by the current state of Java technology.

[1]  Jong-Deok Choi,et al.  Escape analysis for Java , 1999, OOPSLA '99.

[2]  Christian Borgelt,et al.  Canonical Forms for Frequent Graph Mining , 2006, GfKl.

[3]  Ruoming Jin,et al.  Shared Memory Paraellization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. , 2002 .

[4]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[5]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[6]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  Rajiv Arora,et al.  Java server performance: A case study of building efficient, scalable Jvms , 2000, IBM Syst. J..

[9]  Vipin Kumar,et al.  Parallel depth first search. Part II. Analysis , 1987, International Journal of Parallel Programming.

[10]  Giuseppe Di Fatta,et al.  Distributed Mining of Molecular Fragments , 2004 .

[11]  Erez Petrank,et al.  Thread-local heaps for Java , 2002, ISMM '02.

[12]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[13]  George Karypis,et al.  An efficient algorithm for discovering frequent subgraphs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[14]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[15]  Lawrence B. Holder,et al.  Approaches to Parallel Graph-Based Knowledge Discovery , 2001, J. Parallel Distributed Comput..

[16]  Ruoming Jin,et al.  Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Thorsten Meinl,et al.  A Quantitative Comparison of the Subgraph Miners MoFa, gSpan, FFSM, and Gaston , 2005, PKDD.

[18]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[19]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[20]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..