Evaluating data caching techniques in DMCF workflows using Hercules

The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work we propose the usage of the Hercules system within DMCF as an ad-hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. Early experimental results are presented in this paper, they show promising performance, particularly for write operations, compared to the performance obtained using the default storage services.

[1]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  Matei Ripeanu,et al.  The case for a versatile storage system , 2010, OPSR.

[3]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[4]  Jesús Carretero,et al.  A Scalable Message Passing Interface Implementation of an Ad-Hoc Parallel I/o system , 2010, Int. J. High Perform. Comput. Appl..

[5]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[6]  Domenico Talia,et al.  JS4Cloud: script‐based workflow programming for scalable data analysis on cloud platforms , 2015, Concurr. Comput. Pract. Exp..

[7]  Douglas Thain,et al.  Parrot: Transparent User-Level Middleware for Data-Intensive Computing , 2005, Scalable Comput. Pract. Exp..

[8]  Gilles Fedak,et al.  The Case for Workflow-Aware Storage:An Opportunity Study , 2015, Journal of Grid Computing.

[9]  Jesús Carretero,et al.  A hierarchical parallel storage system based on distributed memory for large scale systems , 2013, EuroMPI.

[10]  Xu Yang,et al.  High-Performance Storage Support for Scientific Applications on the Cloud , 2015, ScienceCloud@HPDC.

[11]  Douglas Thain,et al.  Chirp: a practical global filesystem for cluster and Grid computing , 2008, Journal of Grid Computing.

[12]  Daniel S. Katz,et al.  Parallelizing the execution of sequential scripts , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).