A Case Study in Using Discrete-Event Simulation to Improve the Scalability of MG-RAST

As the cost of DNA sequencing has decreased, computational biology data processing platforms are experiencing an increasingly large volume of data analysis requests. The metagenomics analysis server MG-RAST at Argonne National Laboratory, a computational biology data processing platform, is receiving several terabytes of data submissions per month. However, MG-RAST currently relies on a central object-based data store, Shock, for data access and storage that can become a bottleneck under high data transfer loads, adversely affecting the job response time for end users. In this work, we use a discrete-event simulation approach to explore the use of data proxies and an enhanced, proxy-aware scheduling methodology designed to reduce the movement of the intermediate data generated during workflow processing. In this approach, Shock is supplemented with proxy storage servers, employing solid state drives, to decentralize the management and hence reduce the movement of intermediate workflow results. Discrete-event simulation provides a way to evaluate the performance of MG-RAST with increased workloads without disrupting the production system. For our case study, we extrapolate scientific workflows obtained from MG-RAST to represent future usage trends. We demonstrate that the addition of proxies and the proxy-aware scheduling methodology significantly reduces the data movement overhead by distributing the data plane, leading to substantial improvement in end-user job response time.

[1]  A. Varga,et al.  THE OMNET++ DISCRETE EVENT SIMULATION SYSTEM , 2003 .

[2]  Ewa Deelman,et al.  WorkflowSim: A toolkit for simulating scientific workflows in distributed environments , 2012, 2012 IEEE 8th International Conference on E-Science.

[3]  William M. Jones,et al.  Bandwidth-aware co-allocating meta-schedulers for mini-grid architectures , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[4]  Robert B. Ross,et al.  A case study in using massively parallel simulation for extreme-scale torus network codesign , 2014, SIGSIM PADS '14.

[5]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[6]  Antonella Ferrara,et al.  A DISCRETE EVENT MODEL FOR MONITORING AND CONTROLLING FREEWAYS , 1995 .

[7]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[8]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[9]  Marlon Dumas,et al.  Configurable SOAP proxy cache for data provisioning web services , 2011, SAC.

[10]  Quan Z. Sheng,et al.  Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows , 2013, Journal of Grid Computing.

[11]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, PADS '00.

[12]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .

[13]  Andreas Wilke,et al.  A scalable data analysis platform for metagenomics , 2013, 2013 IEEE International Conference on Big Data.

[14]  Andreas Wilke,et al.  Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows , 2014, 2014 5th International Workshop on Data-Intensive Computing in the Clouds.

[15]  J. Gilbert,et al.  Metagenomics - a guide from sampling to data analysis , 2012, Microbial Informatics and Experimentation.

[16]  Andreas Wilke,et al.  Workload characterization for MG-RAST metagenomic data analytics service in the cloud , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[17]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[18]  Traian Cristian Cirstea,et al.  A scalable proxy cache for grid data access , 2012 .

[19]  Robert B. Ross,et al.  Data-Aware Resource Scheduling for Multicloud Workflows: A Fine-Grained Simulation Approach , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[20]  Shane Snyder,et al.  A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems , 2014, PMBS@SC.

[21]  Andreas Wilke,et al.  Shock: Active Storage for Multicloud Streaming Data Analysis , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).