Exploiting Parallelism in a Structural Scientific Discovery System to Improve Scalability

The large amount of data collected today is quickly overwhelming researchers' abilities to interpret the data and discover interesting patterns. Knowledge discovery and data mining approaches hold the potential to automate the interpretation process, but these approaches frequently utilize computationally expensive algorithms. In particular, scientific discovery systems focus on the utilization of richer data representation, sometimes without regard for scalability. This research investigates approaches for scaling a particular knowledge discovery in databases (KDD) system, SUBDUE, using parallel and distributed resources. SUBDUE has been used to discover interesting and repetitive concepts in graph-based databases from a variety of domains, but requires a substantial amount of processing time. Experiments that demonstrate scalability of parallel versions of the SUBDUE system are performed using CAD circuit databases and artificially-generated databases, and potential achievements and obstacles are discussed.

[1]  Richard E. Korf,et al.  Parallel heuristic search: two approaches , 1990 .

[2]  Lawrence B. Holder,et al.  Scalable Discovery of Informative Structural Concepts Using Domain Knowledge , 1996, IEEE Expert.

[3]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[4]  Foster J. Provost,et al.  Scaling Up: Distributed Machine Learning with Cooperation , 1996, AAAI/IAAI, Vol. 1.

[5]  Hiroshi Motoda,et al.  Unifying Learning Methods by Colored Digraphs , 1993, ALT.

[6]  Lawrence B. Holder,et al.  Analyzing the Benefits of Domain Knowledge in Substructure Discovery , 1995, KDD.

[7]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[9]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[10]  V. Nageshwara Rao,et al.  Scalable parallel formulations of depth-first search , 1990 .

[11]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  M. Pernice,et al.  PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing [Book Review] , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[13]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[14]  Foster J. Provost,et al.  RL4: a tool for knowledge-based induction , 1990, [1990] Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence.

[15]  Robert Levinson,et al.  A Self-Organizing Retrieval System for Graphs , 1984, AAAI.

[16]  Janice I. Glasgow,et al.  Spatial Analogy and Subsumption , 1992, ML.

[17]  Jakub Segen Graph Clustering and Model Learning by Data Compression , 1990, ML.

[18]  Diane J. Cook,et al.  Maximizing the Benefits of Parallel Search Using Machine Learning , 1997, AAAI/IAAI.