The national scalable cluster project: three lessons about high performance data mining and data intensive computing

We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention to I/O paths. Also, exploiting data parallelism at the file and record level provides efficient mapping of data-intensive problems onto clusters and is particularly well suited to data mining. Finally, the repetitive nature of data mining demands special attention be given to data layout on the hardware and to software access patterns while maintaining a storage schema easily derived from the legacy form of the data.

[1]  Thomas L. Sterling,et al.  BEOWULF: A Parallel Workstation for Scientific Computation , 1995, ICPP.

[2]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[3]  H. Sivakumar,et al.  Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[4]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[5]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[6]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[7]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[8]  Reagan Moore,et al.  Data-intensive computing and digital libraries , 1998, CACM.

[9]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[10]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[11]  Ramesh Subramonian,et al.  A framework for distributed data mining , 1998 .

[12]  Andrew S. Grimshaw,et al.  Legion-a view from 50,000 feet , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[13]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[14]  David A. Patterson,et al.  A case for networks of workstations (now) , 1994, Symposium Record Hot Interconnects II.

[15]  Robert L. Grossman,et al.  Data Mining and Tree-Based Optimization , 1996, KDD.

[16]  Tomohiro Kudoh,et al.  Towards a Seamless Parallel Computing System on Distributed Environments , 1997 .

[17]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[18]  I. Hamzaoglu H. Kargupta,et al.  Distributed Data Mining Using An Agent Based Architecture , 1997, KDD 1997.

[19]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[20]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[21]  Robert L. Grossman,et al.  The Preliminary Design of Papyrus: A System for High Performance Distributed Data Mining over Cluste , 1998, AAAI 1998.