A Heuristic Data Distribution Scheme for data mining applications on grid environments

Effective data distribution techniques can significantly reduce the total execution time of a program on grid computing environments, especially for data mining applications. In this paper, we describe a linear programming formulation for the data distribution problem on grids. Furthermore, a heuristic method, named HDDS (heuristic data distribution scheme), is proposed to solve this problem. We implement the parallel association rule mining method and conduct the experimentations on our grid testbed. Experimental results showed that data mining programs using our HDDS to distribute data could execute more efficiently than traditional schemes could.

[1]  V. Lakshmi Narasimhan,et al.  A Novel Data Distribution Technique for Host-Client Type Parallel Applications , 2002, IEEE Trans. Parallel Distributed Syst..

[2]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[3]  Steven Tuecke,et al.  Protocols and services for distributed data-intensive science , 2002 .

[4]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[5]  Thomas G. Robertazzi,et al.  Ten Reasons to Use Divisible Load Theory , 2003, Computer.

[6]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[7]  Henri Casanova,et al.  Scheduling divisible loads on star and tree networks: results and open problems , 2005, IEEE Transactions on Parallel and Distributed Systems.

[8]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[9]  Ian T. Foster,et al.  A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[12]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  Jonathan Armstrong,et al.  Introduction to grid computing with globus , 2003 .

[14]  Andrew S. Grimshaw,et al.  Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1994, J. Parallel Distributed Comput..

[15]  Ian Foster,et al.  The Grid: A New Infrastructure for 21st Century Science , 2002 .

[16]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[17]  A. S. Grimshaw Meta-Systems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1992, Proceedings. Workshop on Heterogeneous Processing.

[18]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[19]  Debasish Ghose,et al.  Scheduling Divisible Loads in Parallel and Distributed Systems , 1996 .

[20]  Debasish Ghose,et al.  Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems , 2004, Cluster Computing.

[21]  H. V. Jagadish,et al.  Partitioning Techniques for Large-Grained Parallelism , 1988, IEEE Trans. Computers.