Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance

It is hard for applications to make full utilization of the peak bandwidth of the storage system in highperformance computers because of I/O interferences, storage resource misallocations and complex long I/O paths. We performed several studies to bridge this gap in the Sunway storage system, which serves the supercomputer Sunway TaihuLight. To locate these issues and connections between them, an end-to-end performance monitoring and diagnosis tool was developed to understand I/O behaviors of applications and the system. With the help of the tool, we were about to find out the root causes of such performance barriers at the I/O forwarding layer and the parallel file system layer. An application-aware I/O forwarding allocation framework was used to address the I/O interferences and resource misallocations at the I/O forwarding layer. A performance-aware data placement mechanism was proposed to mitigate the impact of I/O interferences and performance variations of storage devices in the PFS. Together, applications obtained much better I/O performance. During the process, we also proposed a lightweight storage stack to shorten the I/O path of applications with -N I/O pattern. This paper summarizes these studies and presents the lessons learned from the process.

[1]  Robert B. Ross,et al.  Optimization Techniques at the I/O Forwarding Layer , 2010, 2010 IEEE International Conference on Cluster Computing.

[2]  Christian Scheideler,et al.  Towards a Scalable and Robust DHT , 2006, SPAA '06.

[3]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[5]  Yang Liu,et al.  Automatic identification of application I/O signatures from noisy server-side traces , 2014, FAST.

[6]  Robert Latham,et al.  Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7]  Raghul Gunasekaran,et al.  Understanding I/O workload characteristics of a Peta-scale storage system , 2015, The Journal of Supercomputing.

[8]  Fan Guo,et al.  Scaling Embedded In-Situ Indexing with DeltaFS , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  John Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Robert B. Ross,et al.  On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Robert B. Ross,et al.  Accelerating I/O Forwarding in IBM Blue Gene/P Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[13]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[14]  Wang Teng,et al.  An Ephemeral Burst-Buffer File System for Scientific Applications , 2016 .

[15]  Valentina Timcenko,et al.  Ext4 file system performance analysis in linux environment , 2011 .

[16]  David A Dillow,et al.  Lessons Learned in Deploying the World’s Largest Scale Lustre File System , 2010 .

[17]  Toni Cortes,et al.  Using filesystem virtualization to avoid metadata bottlenecks , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[18]  André Brinkmann,et al.  GekkoFS - A Temporary Distributed File System for HPC Applications , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[19]  Wei-keng Liao,et al.  Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, HiPC 2008.

[20]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[21]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[22]  Kamil Iskra,et al.  ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.

[23]  Zhe Zhang,et al.  Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems , 2011, 30th IEEE International Performance Computing and Communications Conference.

[24]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  S GunawiHaryadi,et al.  Fail-Slow at Scale , 2018 .

[26]  Shane Snyder,et al.  A Year in the Life of a Parallel File System , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Michael A. Bender,et al.  File Systems Fated for Senescence? Nonsense, Says Science! , 2017, FAST.

[28]  Jie Yao,et al.  ROS , 2018, ACM Transactions on Storage.

[29]  Bingsheng He,et al.  Incorporating Probabilistic Optimizations for Resource Provisioning of Data Processing Workflows , 2019, ICPP.

[30]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[31]  Wenguang Chen,et al.  ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Weiguo Liu,et al.  Automatic, Application-Aware I/O Forwarding Resource Allocation , 2019, FAST.

[33]  Franck Cappello,et al.  Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[34]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[35]  André Brinkmann,et al.  A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Felix Wolf,et al.  Scalable massively parallel I/O to task-local files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[37]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[38]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[39]  Weiguo Liu,et al.  End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.