Distributed data mining in grid computing environments

The computing-intensive data mining for inherently Internet-wide distributed data, referred to as Distributed Data Mining (DDM), calls for the support of a powerful Grid with an effective scheduling framework. DDM often shares the computing paradigm of local processing and global synthesizing. It involves every phase of Data Mining (DM) processes, which makes the workflow of DDM very complex and can be modelled only by a Directed Acyclic Graph (DAG) with multiple data entries. Motivated by the need for a practical solution of the Grid scheduling problem for the DDM workflow, this paper proposes a novel two-phase scheduling framework, including External Scheduling and Internal Scheduling, on a two-level Grid architecture (InterGrid, IntraGrid). Currently a DM IntraGrid, named DMGCE (Data Mining Grid Computing Environment), has been developed with a dynamic scheduling framework for competitive DAGs in a heterogeneous computing environment. This system is implemented in an established Multi-Agent System (MAS) environment, in which the reuse of existing DM algorithms is achieved by encapsulating them into agents. Practical classification problems from oil well logging analysis are used to measure the system performance. The detailed experiment procedure and result analysis are also discussed in this paper.

[1]  Muthucumaru Maheswaran,et al.  Distributed dynamic scheduling of composite tasks on grid computing systems , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[2]  Füsun Özgüner,et al.  Dynamic, competitive scheduling of multiple DAGs in a distributed heterogeneous environment , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[3]  Domenico Talia,et al.  Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids , 2005, PKDD.

[4]  Yong Cheng,et al.  MAGE: An Agent-Oriented Programming Environment , 2004 .

[5]  Mario Cannataro,et al.  Distributed data mining on grids: services, tools, and applications , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Füsun Özgüner,et al.  Hierarchical, competitive scheduling of multiple DAGs in a dynamic heterogeneous environment , 1999, Distributed Syst. Eng..

[7]  Mario Cannataro,et al.  Distributed data mining on the grid , 2002, Future Gener. Comput. Syst..

[8]  Shonali Krishnaswamy,et al.  Supporting the Optimisation of Distributed Data Mining by Predicting Application Run Times , 2002, ICEIS.

[9]  David Fernández-Baca,et al.  Allocating Modules to Processors in a Distributed System , 1989, IEEE Trans. Software Eng..

[10]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[11]  Ian J. Taylor,et al.  Web services composition for distributed data mining , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[12]  Domenico Talia,et al.  Service-oriented middleware for distributed data mining on the grid , 2008, J. Parallel Distributed Comput..

[13]  Rizos Sakellariou,et al.  A hybrid heuristic for DAG scheduling on heterogeneous systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..