Programming Visual and Script-based Big Data Analytics Workflows on Clouds

Data analysis applications often include large datasets and complex software systems in which multiple data processing tools are executed in a coordinated way. Data analysis workflows are effective in expressing task coordination and they can be designed through visualand script-based programming paradigms. The Data Mining Cloud Framework (DMCF) supports the design and scalable execution of data analysis applications on Cloud platforms. A workflow in DMCF can be developed using a visualor a script-based language. The visual language, called VL4Cloud, is based on a design approach for high-level users, e.g., domain expert analysts having a limited knowledge of programming paradigms. The script-based language JS4Cloud is provided as a flexible programming paradigm for skilled users who prefer to code their workflows through scripts. Both languages implement a data-driven task parallelism that spawns ready-to-run tasks to Cloud resources. In addition, they exploit implicit parallelism that frees users from duties like workload partitioning, synchronization and communication. In this chapter, we present the DMCF framework and discuss how its workflow paradigm has been integrated with the MapReduce model. In particular, we describe how VL4Cloud/JS4Cloud workflows can include MapReduce tools, and how these workflows are executed in parallel on DMCF enabling scalable data processing on Clouds.

[1]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[2]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[3]  Domenico Talia,et al.  JS4Cloud: script‐based workflow programming for scalable data analysis on cloud platforms , 2015, Concurr. Comput. Pract. Exp..

[4]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[5]  Domenico Talia,et al.  A Cloud Framework for Big Data Analytics Workflows on Azure , 2012, High Performance Computing Workshop.

[6]  Nada Lavrac,et al.  ClowdFlows: A Cloud Based Scientific Workflow Platform , 2012, ECML/PKDD.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Domenico Talia,et al.  A Cloud Framework for Parameter Sweeping Data Mining Applications , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[10]  Domenico Talia,et al.  Service-Oriented Distributed Knowledge Discovery , 2012 .

[11]  Alex Rodriguez,et al.  Enabling multi-task computation on Galaxy-based gateways using swift , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[12]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[13]  Domenico Talia,et al.  Workflow Systems for Science: Concepts and Tools , 2013 .

[14]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[15]  LORIS BELCASTRO,et al.  Using Scalable Data Mining for Predicting Flight Delays , 2016, ACM Trans. Intell. Syst. Technol..

[16]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[17]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[18]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[19]  Cesare Pautasso,et al.  Restful web services vs. "big"' web services: making the right architectural decision , 2008, WWW.

[20]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[22]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[23]  Domenico Talia,et al.  Using Clouds for Scalable Knowledge Discovery Applications , 2012, Euro-Par Workshops.

[24]  M. Ball,et al.  Total Delay Impact Study: A Comprehensive Assessment of the Costs and Impacts of Flight Delay in the United States , 2010 .

[25]  Nada Lavrac,et al.  Orange4WS Environment for Service-Oriented Data Mining , 2012, Comput. J..

[26]  Miklós Kozlovszky,et al.  WS-PGRADE/gUSE Generic DCI Gateway Framework for a Large Variety of User Communities , 2012, Journal of Grid Computing.

[27]  Paul Watson,et al.  Developing cloud applications using the e-Science Central platform , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.