论文信息 - Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store

Flexible Data-Aware Scheduling for Workflows over an In-memory Object Store

This paper explores novel techniques for improving the performance of many-task workflows based on the Swift scripting language. We propose novel programmer options for automated distributed data placement and task scheduling. These options trigger a data placement mechanism used for distributing intermediate workflow data over the servers of Hercules, a distributed key-value store that can be used to cache file system data. We demonstrate that these new mechanisms can significantly improve the aggregated throughput of many-task workflows with up to 86x, reduce the contention on the shared file system, exploit the data locality, and trade off locality and load balance.

[1] Roy D. Sleator,et al. 'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[2] Jesús Carretero,et al. A hierarchical parallel storage system based on distributed memory for large scale systems , 2013, EuroMPI.

[3] Daniel S. Katz,et al. Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[4] K. Wagstaff,et al. Big data challenges for large radio arrays , 2012, 2012 IEEE Aerospace Conference.

[5] Brad Fitzpatrick,et al. Distributed caching with memcached , 2004 .

[6] Ian T. Foster,et al. Compiler Techniques for Massively Scalable Implicit Task Parallelism , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7] Daniel S. Katz,et al. Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications , 2013, Fundam. Informaticae.

[8] Ian T. Foster,et al. Big Data Remote Access Interfaces for Light Source Science , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[9] John M. Dennis,et al. Parallel high-resolution climate data analysis using swift , 2011, MTAGS '11.