Parallel Data Mining on a Beowulf Cluster

This paper presents a parallel data mining application for predictive modelling running on a Beowulf style Linux cluster. Data mining or Knowledge Discovery in Databases (KDD) is the process of analysing large and complex data sets with the purpose of extracting useful and previously unknown knowledge. The task of predictive modelling is the prediction of an attribute according to a model built with one or more other attributes given in a data collection. We describe two methods for predictive modelling of high-dimensional data sets, namely ADDFIT which implements additive models, and HISURF which uses wavelets for high-dimensional surface smoothing, and present a parallel implementation on a distributed memory cluster architecture which uses the scripting language Python as a flexible front-end to facilitate user-interaction, control the parallel application, and generate graphical outputs.

[1]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[2]  Peter Christen,et al.  A Toolbox Approach to Flexible and Efficient Data Mining , 2001, PAKDD.

[3]  Peter Strazdins,et al.  Accelerated methods for performing the LDLT decomposition , 2000 .

[4]  Vladimir Pestov,et al.  Additive models in high dimensions , 1999, ArXiv.

[5]  Peter Christen,et al.  Scalable parallel algorithms for surface fitting and data mining , 2001, Parallel Comput..

[6]  Truong Q. Nguyen,et al.  Wavelets and filter banks , 1996 .

[7]  Thomas L. Sterling,et al.  BEOWULF: A Parallel Workstation for Scientific Computation , 1995, ICPP.

[8]  Graham J. Williams,et al.  Data mining of administrative claims data for pathology services , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[9]  DaubechiesIngrid Orthonormal bases of compactly supported wavelets II , 1993 .

[10]  David M. Beazley,et al.  Python Essential Reference , 1999 .

[11]  Dirk Düllmann Petabyte databases , 1999, SIGMOD '99.

[12]  Ole Møller Nielsen,et al.  Wavelets in scientific computing , 1998 .

[13]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[14]  J. Friedman Multivariate adaptive regression splines , 1990 .

[15]  G. Wahba Spline models for observational data , 1990 .

[16]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[17]  Mohammed J. Zaki,et al.  A Requirements Analysis for Parallel KDD Systems , 2000, IPDPS Workshops.

[18]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[19]  John G. Lewis,et al.  Accurate Symmetric Indefinite Linear Equation Solvers , 1999, SIAM J. Matrix Anal. Appl..

[20]  Konrad Hinsen,et al.  Numerical Python , 1996 .

[21]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[22]  Stephen Roberts,et al.  A scalable parallel FEM surface fitting algorithm for data mining , 2001 .

[23]  Stephen Roberts,et al.  Finite element thin plate splines for data mining applications , 1998 .

[24]  Private Communications , 2001 .

[25]  I. Daubechies Orthonormal bases of compactly supported wavelets , 1988 .

[26]  Douglas Aberdeen,et al.  92¢ /MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[27]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.