Prediction-based auto-scaling of scientific workflows

In this paper we propose a novel method for auto-scaling data-centric workflow tasks. Scaling is achieved through a prediction mechanism where the input data load on each task within a workflow is used to compute the estimated task execution time. Through load prediction, the framework can take informed decisions on scaling multiple workflow tasks independently to improve overall throughput and reduce workflow bottlenecks. This method was implemented in the WS-VLAM workflow system and with an image analyses workflow we show that this technique achieves faster data processing rates and reduces overall workflow makespan.

[1]  Qichang Chen,et al.  MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS , 2008, 2008 IEEE Fourth International Conference on eScience.

[2]  Marian Bubak,et al.  Processing moldable tasks on the grid: Late job binding with lightweight user-level overlay , 2011, Future Gener. Comput. Syst..

[3]  Louis O. Hertzberger,et al.  VLAM-G: Interactive data driven workflow engine for Grid-enabled resources , 2007, Sci. Program..

[4]  Cees T. A. M. de Laat,et al.  WS-VLAM: towards a scalable workflow system on the grid , 2007, WORKS '07.

[5]  Marian Bubak,et al.  Collaborative e-Science Experiments and Scientific Workflows , 2011, IEEE Internet Computing.

[6]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[7]  Johan Tordsson,et al.  Three fundamental dimensions of scientific workflow interoperability: Model of computation, language, and execution environment , 2010, Future Gener. Comput. Syst..

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Igor Sfiligoi,et al.  The Pilot Way to Grid Resources Using glideinWMS , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[10]  Warren Smith,et al.  Scheduling with advanced reservations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[11]  Cees T. A. M. de Laat,et al.  AMOS: Using the Cloud for On-Demand Execution of e-Science Applications , 2010, 2010 IEEE Sixth International Conference on e-Science.

[12]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[13]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[14]  Likewin Thomas,et al.  Utilization of map-reduce for parallelization of resource scheduling using MPI: PRS , 2011, ICCCS '11.