Client-side Straggler-Aware I/O Scheduler for Object-based Parallel File Systems

Object-based parallel file systems have emerged as promising storage solutions for high-performance computing (HPC) systems. Despite the fact that object storage provides a flexible interface, scheduling highly concurrent I/O requests that access a large number of objects still remains as a challenging problem, especially in the case when stragglers (storage servers that are significantly slower than others) exist in the system. An efficient I/O scheduler needs to avoid possible stragglers to achieve low latency and high throughput. In this paper, we introduce a log-assisted straggler-aware I/O scheduling to mitigate the impact of storage server stragglers. The contribution of this study is threefold. First, we introduce a client-side, log-assisted, straggler-aware I/O scheduler architecture to tackle the storage straggler issue in HPC systems. Second, we present three scheduling algorithms that can make efficient decision for scheduling I/Os while avoiding stragglers based on such an architecture. Third, we evaluate the proposed I/O scheduler using simulations, and the simulation results have confirmed the promise of the newly introduced straggler-aware I/O scheduler.

[1]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2]  Scott Klasky,et al.  Output Performance Study on a Production Petascale Filesystem , 2017, ISC Workshops.

[3]  Scott A. Brandt,et al.  OBFS: A File System for Object-Based Storage Devices , 2004, MSST.

[4]  Yu Zhuang,et al.  Hierarchical Collective I/O Scheduling for High-Performance Computing , 2015, Big Data Res..

[5]  Robert Latham,et al.  Revealing applications' access pattern in collective I/O for cache management , 2014, ICS '14.

[6]  Mark S. Squillante,et al.  Models of Parallel Applications with Large Computation and I/O Requirements , 2002, IEEE Trans. Software Eng..

[7]  I. Olkin,et al.  A Multivariate Exponential Distribution , 1967 .

[8]  Robert B. Ross,et al.  CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9]  Alok N. Choudhary,et al.  Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[10]  Yong Chen,et al.  Log-Assisted Straggler-Aware I/O Scheduler for High-End Computing , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[11]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[12]  Zhen Xiao,et al.  Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[13]  Scott Shenker,et al.  The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[14]  Robert B. Ross,et al.  Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[16]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[17]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[18]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[19]  M. Factor,et al.  Object storage: the future building block for storage systems , 2005, 2005 IEEE International Symposium on Mass Storage Systems and Technology.

[20]  Gregory R. Ganger,et al.  Object-based storage , 2003, IEEE Commun. Mag..

[21]  Ravi Jain,et al.  Parallel I/O scheduling using randomized, distributed edge coloring algorithms , 2003, J. Parallel Distributed Comput..

[22]  Dror G. Feitelson,et al.  Paired Gang Scheduling , 2003, IEEE Trans. Parallel Distributed Syst..

[23]  Wing Cheong Lau,et al.  Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[24]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[25]  Chao Wang,et al.  Sedna: A Memory Based Key-Value Storage System for Realtime Processing in Cloud , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[26]  Xiaoqi Ren Speculation-Aware Resource Allocation for Cluster Schedulers , 2015 .

[27]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[28]  Peter Braam,et al.  The Lustre Storage Architecture , 2019, ArXiv.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Carlos Maltzahn,et al.  I/O acceleration with pattern detection , 2013, HPDC.

[31]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Adam Wierman,et al.  This Paper Is Included in the Proceedings of the 11th Usenix Symposium on Networked Systems Design and Implementation (nsdi '14). Grass: Trimming Stragglers in Approximation Analytics Grass: Trimming Stragglers in Approximation Analytics , 2022 .

[33]  Samuel Lang,et al.  Server-side I/O coordination for parallel file systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[34]  Ravi Jain,et al.  Heuristics for Scheduling I/O Operations , 1997, IEEE Trans. Parallel Distributed Syst..

[35]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[36]  Renato Figueiredo,et al.  Towards simulation of parallel file system scheduling algorithms with PFSsim , 2011 .

[37]  Scott Klasky,et al.  Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[39]  Mark S. Squillante,et al.  The impact of I/O on program behavior and parallel scheduling , 1998, SIGMETRICS '98/PERFORMANCE '98.

[40]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[41]  Pascal Raymond,et al.  The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[42]  Robert B. Ross,et al.  Provenance-based object storage prediction scheme for scientific big data applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[43]  Shikharesh Majumdar,et al.  Performance of parallel I/O scheduling strategies on a network of workstations , 2001, Proceedings. Eighth International Conference on Parallel and Distributed Systems. ICPADS 2001.

[44]  Scott Shenker,et al.  Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly , 2012, HotCloud.