Exploiting in-memory storage for improving workflow executions in cloud platforms

The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O-related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work, we propose the usage of the Hercules system within DMCF as an ad hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. We evaluated the performance of Hercules compared with the Microsoft Azure Storage solution by using synthetic benchmarks with the objective of demonstrating the viability of the proposed solution. Then, we evaluated the integration of Hercules and DMCF on a real application consisting of a workflow that accesses temporary data using either Azure storage or Hercules. The I/O overhead in this real-life scenario using Hercules has been reduced by 36 % with respect to Azure storage, leading to a 13 % reduction of the total execution time. This confirms that our in-memory approach is effective in improving the performance of data-intensive workflow executions in cloud-based platforms.

[1]  Domenico Talia,et al.  Evaluating data caching techniques in DMCF workflows using Hercules , 2015 .

[2]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[3]  Domenico Talia,et al.  A Cloud Framework for Big Data Analytics Workflows on Azure , 2012, High Performance Computing Workshop.

[4]  Daniel S. Katz,et al.  Parallelizing the execution of sequential scripts , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Jesús Carretero,et al.  A hierarchical parallel storage system based on distributed memory for large scale systems , 2013, EuroMPI.

[6]  Gilles Fedak,et al.  The Case for Workflow-Aware Storage:An Opportunity Study , 2015, Journal of Grid Computing.

[7]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[8]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[9]  Matei Ripeanu,et al.  The case for a versatile storage system , 2010, OPSR.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[12]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[13]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[14]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[15]  Domenico Talia,et al.  A Cloud Framework for Parameter Sweeping Data Mining Applications , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[16]  Douglas Thain,et al.  Parrot: Transparent User-Level Middleware for Data-Intensive Computing , 2005, Scalable Comput. Pract. Exp..

[17]  Douglas Thain,et al.  Confuga: Scalable Data Intensive Computing for POSIX Workflows , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[18]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[19]  Xu Yang,et al.  High-Performance Storage Support for Scientific Applications on the Cloud , 2015, ScienceCloud@HPDC.

[20]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[21]  Jesús Carretero,et al.  A Scalable Message Passing Interface Implementation of an Ad-Hoc Parallel I/o system , 2010, Int. J. High Perform. Comput. Appl..

[22]  Domenico Talia,et al.  JS4Cloud: script‐based workflow programming for scalable data analysis on cloud platforms , 2015, Concurr. Comput. Pract. Exp..