Optimizing the execution of many-task computing applications using in-memory distributed file systems

In many-task computing (MTC), applications such as scientific workflows or parameter sweeps communicate via intermediate files; application performance strongly depends on the file system in use. The state of the art uses runtime systems providing in-memory file storage that is designed for data locality: files are placed on those nodes that write or read them. With data locality, however, task distribution conflicts with data distribution, leading to application slowdown, and worse, to prohibitive storage imbalance. To overcome these limitations, we present MemFS, a fully symmetrical, in-memory runtime file system that stripes files across all compute nodes, based on a distributed hash function. Our cluster experiments with Montage and BLAST workflows, using up to 512 cores, show that MemFS has both better performance and better scalability than the state-of-the-art, locality-based file system, AMFS. Furthermore, our evaluation on a public commercial cloud validates our cluster results. On this platform MemFS shows excellent scalability up to 1024 cores and is able to saturate the 10G Ethernet bandwidth when running BLAST and Montage. The contents of this chapter have been originally published in the Future Generation Computer Systems journal, volume 54, 2016 and have been slightly modified to improve readability.

[1]  Ian T. Foster,et al.  Making a case for distributed file systems at Exascale , 2011, LSAP '11.

[2]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[3]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[4]  Stefania Costache,et al.  MemEFS: An Elastic In-memory Runtime File System for eScience Applications , 2015, 2015 IEEE 11th International Conference on e-Science.

[5]  Alexandru Iosup,et al.  Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[6]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7]  Alexandru Uta,et al.  Overcoming data locality: An in-memory runtime file system with symmetrical data distribution , 2016, Future Gener. Comput. Syst..

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[10]  Sadaf R. Alam,et al.  Parallel I/O and the metadata wall , 2011, PDSW '11.

[11]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[13]  Tia Newhall,et al.  Reliable adaptable Network RAM , 2008, 2008 IEEE International Conference on Cluster Computing.

[14]  Daniel S. Katz,et al.  Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking , 2009, Int. J. Comput. Sci. Eng..

[15]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[16]  Karsten Schwan,et al.  Just in time: adding value to the IO pipelines of high performance applications with JITStaging , 2011, HPDC '11.

[17]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[19]  Daniel S. Katz,et al.  Using Application Skeletons to Improve eScience Infrastructure , 2014, 2014 IEEE 10th International Conference on e-Science.

[20]  Karsten Schwan,et al.  GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[22]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[23]  Dhabaleswar K. Panda,et al.  In-memory I/O and replication for HDFS with Memcached: Early experiences , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[24]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[25]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[26]  Chen Shou,et al.  Distributed data provenance for large-scale data-intensive computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[27]  Daniel S. Katz,et al.  MTC envelope: defining the capability of large scale computers in the context of parallel scripting applications , 2013, HPDC.

[28]  Charles Reiss,et al.  Towards understanding heterogeneous clouds at scale : Google trace analysis , 2012 .

[29]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[30]  Daniel M. Batista,et al.  Scheduling cloud applications under uncertain available bandwidth , 2013, 2013 IEEE International Conference on Communications (ICC).

[31]  Mary K. Vernon,et al.  Characteristics of a Large Shared Memory Production Workload , 2001, JSSPP.

[32]  Daniel S. Katz,et al.  Parallelizing the execution of sequential scripts , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[34]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[35]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[36]  Seung-Jong Park,et al.  Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks , 2012, FederatedClouds '12.

[37]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[38]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[39]  Christof Fetzer,et al.  EHadoop: Network I/O Aware Scheduler for Elastic MapReduce Cluster , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[40]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[41]  Miron Livny,et al.  Correction: High-Throughput, Kingdom-Wide Prediction and Annotation of Bacterial Non-Coding RNAs , 2008, PLoS ONE.

[42]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[43]  Martin L. Kersten,et al.  Peak performance: remote memory revisited , 2013, DaMoN '13.

[44]  Florin Pop,et al.  Adaptive Resource Management and Scheduling for Cloud Computing , 2015, Lecture Notes in Computer Science.

[45]  Haibin Wang,et al.  Cost effective data center servers , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[46]  Mitsuhisa Sato,et al.  Using a cluster as a memory resource: A fast and large virtual memory on MPI , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[47]  David P. Anderson,et al.  SETI@home-massively distributed computing for SETI , 2001, Comput. Sci. Eng..

[48]  Miron Livny,et al.  Online Task Resource Consumption Prediction for Scientific Workflows , 2015, Parallel Process. Lett..

[49]  Eugenio Cesario,et al.  The XtreemFS architecture—a case for object‐based file systems in Grids , 2008, Concurr. Comput. Pract. Exp..

[50]  Brighten Godfrey,et al.  Heterogeneity and load balance in distributed hash tables , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[51]  Alexandru Uta,et al.  Towards Resource Disaggregation — Memory Scavenging for Scientific Workloads , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[52]  Cees T. A. M. de Laat,et al.  A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term , 2016, Computer.

[53]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[54]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[55]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[56]  Jesús Carretero,et al.  A hierarchical parallel storage system based on distributed memory for large scale systems , 2013, EuroMPI.

[57]  Satoshi Matsuoka,et al.  A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[58]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[59]  Andrew A. Chien,et al.  Entropia: architecture and performance of an enterprise desktop grid system , 2003, J. Parallel Distributed Comput..

[60]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[61]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[62]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[63]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[64]  Wei Wang,et al.  Hash-Based Virtual Hierarchies for Scalable Location Service in Mobile Ad-hoc Networks , 2009, Mob. Networks Appl..

[65]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[66]  Bogdan Nicolae,et al.  Bursting the Cloud Data Bubble: Towards Transparent Storage Elasticity in IaaS Clouds , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[67]  Michael M. Swift,et al.  Aerie: flexible file-system interfaces to storage-class memory , 2014, EuroSys '14.

[68]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[69]  Christian Schindelhauer,et al.  Weighted distributed hash tables , 2005, SPAA '05.

[70]  Parag Agrawal,et al.  The case for RAMCloud , 2011, Commun. ACM.

[71]  Benxiong Huang,et al.  Bandwidth-Aware Scheduling With SDN in Hadoop: A New Trend for Big Data , 2017, IEEE Systems Journal.

[72]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[73]  Aniruddha R. Thakar,et al.  ERRATUM: “THE EIGHTH DATA RELEASE OF THE SLOAN DIGITAL SKY SURVEY: FIRST DATA FROM SDSS-III” (2011, ApJS, 193, 29) , 2011 .

[74]  Ioan Raicu,et al.  HyCache: A User-Level Caching Middleware for Distributed File Systems , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[75]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[76]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[77]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[78]  Ashish Gehani,et al.  SPADE: Support for Provenance Auditing in Distributed Environments , 2012, Middleware.

[79]  Alexandru Iosup,et al.  How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[80]  Stefania Costache,et al.  Scalable In-Memory Computing , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[81]  Jeffrey S. Chase,et al.  Automated control for elastic storage , 2010, ICAC '10.

[82]  Thilo Kielmann,et al.  Bag-of-Tasks Scheduling under Budget Constraints , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[83]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[84]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[85]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[86]  Weisong Shi,et al.  Workload characterization on a production Hadoop cluster: A case study on Taobao , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[87]  Junwei Cao,et al.  A Case Study on the Use of Workflow Technologies for Scientific Analysis: Gravitational Wave Data Analysis , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[88]  Christian Scheideler,et al.  Compact, adaptive placement schemes for non-uniform requirements , 2002, SPAA '02.

[89]  Helen J. Wang,et al.  SecondNet: a data center network virtualization architecture with bandwidth guarantees , 2010, CoNEXT.

[90]  Rajkumar Buyya,et al.  Bandwidth‐aware divisible task scheduling for cloud computing , 2014, Softw. Pract. Exp..

[91]  David Thaler,et al.  Using name-based mappings to increase hit rates , 1998, TNET.

[92]  Alexandru Uta,et al.  In-Memory Runtime File Systems for Many-Task Computing , 2014, ARMS-CC@PODC.

[93]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[94]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[95]  Alexandru Uta,et al.  POSTER: MemFS: An in-memory runtime file system with symmetrical data distribution , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[96]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[97]  Daniel S. Katz,et al.  Design and analysis of data management in scalable parallel scripting , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.