A Service-Oriented Framework for Executing Data Mining Workflows on Grids

Workflow environments are widely used in data mining systems to manage data and execution flows associated to complex applications. Weka, one of the most used open-source data mining systems, includes the KnowledgeFlow environment which provides a drag-and-drop interface to compose and execute data mining workflows. The Weka KnowledgeFlow allows users to execute a whole workflow only on a single computer. On the other hand, most data mining workflows include several independent branches that could be run in parallel on a set of distributed machines to reduce the overall execution time. We implemented distributed workflow execution in Weka4WS, a framework that extends Weka and its KnowledgeFlow environment to exploit distributed resources available in a Grid using Web Service technologies. In this paper we describe the Weka4WS architecture and the functionalities provided by its service-oriented KnowledgeFlow component, showing its use to compose and execute simple parallel data mining workflows. Furthermore, we present ongoing work aimed at supporting also data-parallel workflows on a Grid.

[1]  Anthony Rowe,et al.  The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery , 2003, Int. J. High Perform. Comput. Appl..

[2]  Donald F. Ferguson,et al.  The WS-Resource Framework , 2004 .

[3]  Peter Brezany,et al.  GridMiner: An Infrastructure for Data Mining on Computational Grids , 2003 .

[4]  Domenico Talia,et al.  Distributed data mining services leveraging WSRF , 2007, Future Gener. Comput. Syst..

[5]  Ian T. Foster Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, NPC.

[6]  I. Foster,et al.  The Physiology of the Grid , 2003 .

[7]  Ian J. Taylor,et al.  Web services composition for distributed data mining , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[8]  Domenico Talia,et al.  Service Oriented KDD: A Framework for Grid Data Mining Workflows , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[9]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[10]  Domenico Talia,et al.  KOALA: a co-allocating grid scheduler , 2008 .

[11]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[12]  G. Alonso,et al.  Parallel computing patterns for Grid workflows , 2006, 2006 Workshop on Workflows in Support of Large-Scale Science.

[13]  Salvatore J. Stolfo,et al.  A Comparative Evaluation of Voting and Meta-learning on Partitioned Data , 1995, ICML.