Parallel Data Mining on ATM-Connected PC Cluster and Optimization of Its Execution Environments

In this paper, we have constructed a large scale ATM-connected PC cluster consists of 100 PCs, implemented a data mining application, and optimized its execution environment. Default parameters of TCP retransmission mechanism cannot pro vide good performance for data mining application, since a lot of collisions occur in the case of all-to-all multicasting in the large scale PC cluster. Using a TCP retransmission parameters according to the proposed parameter optimization, reasonably good performance improvement is achiev ed for parallel data mining on 100 PCs.Association rule mining, one of the best-known problems in data mining, differs from conventional scientific calculations in its usage of main memory. We have investigated the feasibility of using available memory on remote nodes as a swap area when working nodes need to swap out their real memory contents. According to the experimental results on our PC cluster, the proposed method is expected to be considerably better than using hard disks as a swapping device.

[1]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[2]  Robert Armstrong,et al.  Commodity clusters: performance comparison between PCs and workstations , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[3]  Masato Oguchi,et al.  Parallel Database Processing on a 100 Node PC Cluster: Cases for Decision Support Query Processing and Data Mining , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[4]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[5]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[6]  Andrea C. Arpaci-Dusseau,et al.  Parallel computing on the berkeley now , 1997 .

[7]  Anna R. Karlin,et al.  Implementing global memory management in a workstation cluster , 1995, SOSP.

[8]  Masaru Kitsuregawa,et al.  Hash based parallel algorithms for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[9]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[10]  Philip K. McKinley,et al.  Communication issues in parallel computing across ATM networks , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[11]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[12]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..