Performance-based data distribution for data mining applications on grid computing environments

Effective data distribution techniques can significantly reduce the total execution time of a program on grid computing environments, especially for data mining applications. In this paper, we describe a linear programming formulation for the data distribution problem on grids. Furthermore, a heuristic method, named Heuristic Data Distribution Scheme (HDDS), is proposed to solve this problem. We implement two types of data mining applications, Association Rule Mining and Decision Tree Construction, and conduct experiments on grid testbeds. Experimental results show that data mining programs using the proposed HDDS to distribute data could execute more efficiently than traditional schemes could.

[1]  Thomas G. Robertazzi,et al.  Ten Reasons to Use Divisible Load Theory , 2003, Computer.

[2]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[3]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[4]  Andrew S. Grimshaw,et al.  Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1994, J. Parallel Distributed Comput..

[5]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[6]  Geoffrey C. Fox,et al.  Education and the Enterprise with the Grid , 2003 .

[7]  A. S. Grimshaw Meta-Systems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems , 1992, Proceedings. Workshop on Heterogeneous Processing.

[8]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[9]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[10]  Fabrizio Silvestri,et al.  Scheduling High Performance Data Mining Tasks on a Data Grid Environment , 2002, Euro-Par.

[11]  Henri Casanova,et al.  Scheduling divisible loads on star and tree networks: results and open problems , 2005, IEEE Transactions on Parallel and Distributed Systems.

[12]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[13]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[14]  Debasish Ghose,et al.  Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems , 2004, Cluster Computing.

[15]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[16]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[17]  Steven Tuecke,et al.  Protocols and services for distributed data-intensive science , 2002 .

[18]  Xiaoli Sun,et al.  A Data Mining Model in Knowledge Grid , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[19]  Mario Cannataro,et al.  Distributed data mining on grids: services, tools, and applications , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Chao-Tung Yang,et al.  Using a Performance-based Skeleton to Implement Divisible Load Applications on Grid Computing Environments , 2009, J. Inf. Sci. Eng..

[21]  Giuseppe Di Fatta,et al.  Dynamic Load Balancing for the Distributed Mining of Molecular Structures , 2006, IEEE Transactions on Parallel and Distributed Systems.

[22]  Girija J. Narlikar,et al.  A Parallel, Multithreaded Decision Tree Builder , 1998 .

[23]  Jonathan Armstrong,et al.  Introduction to grid computing with globus , 2003 .

[24]  Tatsuya Shindo,et al.  Commercial applications on the AP3000 parallel computer , 1997, Proceedings. Third Working Conference on Massively Parallel Programming Models (Cat. No.97TB100228).

[25]  Debasish Ghose,et al.  Scheduling Divisible Loads in Parallel and Distributed Systems , 1996 .

[26]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[27]  Chao-Tung Yang,et al.  A Heuristic Data Distribution Scheme for data mining applications on grid environments , 2008, 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence).

[28]  David B. Skillicorn,et al.  High Performance Data Mining and Knowledge Discovery - Introduction , 1999, Euro-Par.

[29]  V. Lakshmi Narasimhan,et al.  A Novel Data Distribution Technique for Host-Client Type Parallel Applications , 2002, IEEE Trans. Parallel Distributed Syst..

[30]  Ian Witten,et al.  Data Mining , 2000 .

[31]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[32]  Ian Foster,et al.  The Grid: A New Infrastructure for 21st Century Science , 2002 .

[33]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[34]  Ian T. Foster,et al.  A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[35]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[36]  H. V. Jagadish,et al.  Partitioning Techniques for Large-Grained Parallelism , 1988, IEEE Trans. Computers.

[37]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[38]  Francine Berman,et al.  Grid Computing: Making the Global Infrastructure a Reality , 2003 .

[39]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[40]  Mohammed J. Zaki,et al.  Parallel classification for data mining on shared-memory multiprocessors , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[41]  BharadwajVeeravalli,et al.  Divisible Load Theory , 2003 .

[42]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[43]  Jason Novotny,et al.  Data mining on NASA's Information Power Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.