Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store

This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.

[1]  Roy D. Sleator,et al.  'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[2]  Jesús Carretero,et al.  A hierarchical parallel storage system based on distributed memory for large scale systems , 2013, EuroMPI.

[3]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[4]  K. Wagstaff,et al.  Big data challenges for large radio arrays , 2012, 2012 IEEE Aerospace Conference.

[5]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[6]  Ian T. Foster,et al.  Compiler Techniques for Massively Scalable Implicit Task Parallelism , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Daniel S. Katz,et al.  Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications , 2013, Fundam. Informaticae.

[8]  Ian T. Foster,et al.  Big Data Remote Access Interfaces for Light Source Science , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[9]  John M. Dennis,et al.  Parallel high-resolution climate data analysis using swift , 2011, MTAGS '11.