Challenges and Solutions for Tracing Storage Systems

IBM Spectrum Scale’s parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of by-product trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system. In this research article, we discuss the evolution of the mature and highly scalable GPFS tracing tool and present the exploratory study of GPFS’ new tracing interface, FlexTrace, which allows developers and users to accurately specify what to trace for the problem they are trying to solve. We evaluate our methodology and prototype, demonstrating that the proposed approach has negligible overhead, even under intensive I/O workloads and with low-latency storage devices.

[1]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[2]  Wolfgang E. Nagel,et al.  Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach , 2001, International Conference on Computational Science.

[3]  Jie Ma,et al.  Adaptive and scalable metadata management to support a trillion files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Ning Wang,et al.  XRay: A Function Call Tracing System , 2016 .

[5]  Margo I. Seltzer,et al.  Passive NFS Tracing of Email and Research Workloads , 2003, FAST.

[6]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[8]  Alan Jay Smith,et al.  A File System Tracing Package for Berkeley UNIX , 1985 .

[9]  André Brinkmann,et al.  A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[11]  Felix Wolf,et al.  Scalable massively parallel I/O to task-local files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Robert Latham,et al.  Production I / O Characterization on the Cray XE 6 , 2013 .

[13]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[14]  Peter J. Braam,et al.  Lustre: The intergalactic file system , 2002 .

[15]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[16]  Leonid Oliker,et al.  Parallel I/O performance: From events to ensembles , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[17]  Bernd Mohr,et al.  Automatic Trace-Based Performance Analysis of Metacomputing Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[19]  Margo I. Seltzer,et al.  New NFS Tracing Tools and Techniques for System Analysis , 2003, LISA.

[20]  Zongpeng Li,et al.  Cost-minimizing dynamic migration of content distribution services into hybrid clouds , 2012, 2012 Proceedings IEEE INFOCOM.

[21]  Clay Shields,et al.  Tracing the Source of Network Attack: A Technical, Legal and Societal Problem , 2001 .

[22]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[23]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[24]  Mahmut T. Kandemir,et al.  IOPin: Runtime Profiling of Parallel I/O in HPC Systems , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[25]  Erez Zadok,et al.  Tracefs: A File System to Trace Them All , 2004, FAST.

[26]  Stanley B. Zdonik,et al.  Data Ingestion for the Connected World , 2017, CIDR.

[27]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[28]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Robert Latham,et al.  ScalaIOExtrap: Elastic I/O Tracing and Extrapolation , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[30]  Michel Dagenais,et al.  Enhanced Userspace and In-Kernel Trace Filtering for Production Systems , 2016, Journal of Computer Science and Technology.

[31]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[32]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[33]  Scott Shenker,et al.  Replay debugging for distributed applications , 2006 .

[34]  Val Henson,et al.  The Zettabyte File System , 2003 .

[35]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[36]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[37]  Robert B. Ross,et al.  Modular HPC I/O Characterization with Darshan , 2016, 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT).

[38]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[39]  Dean Hildebrand,et al.  On Making GPFS Truly General , 2015, login Usenix Mag..

[40]  Roger L. Haskin,et al.  The Tiger Shark file system , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[41]  Walter B. Ligon,et al.  Scalable Distributed Directory Implementation on Orange File System , 2011 .

[42]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[43]  Kai Ren,et al.  A Case for Scaling HPC Metadata Performance through De-specialization , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[44]  R. Krishnakumar Kernel korner: kprobes-a kernel debugger , 2005 .

[45]  David R. O'Hallaron,et al.  //TRACE: Parallel Trace Replay with Approximate Causal Events , 2007, FAST.

[46]  Brendan Gregg,et al.  Dtrace: Dynamic Tracing in Oracle Solaris, Mac OS X and Freebsd , 2011 .

[47]  M. Desnoyers,et al.  The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux , 2006 .

[48]  GhemawatSanjay,et al.  The Google file system , 2003 .

[49]  Erez Zadok,et al.  Extracting flexible, replayable models from large block traces , 2012, FAST.

[50]  Stephen A. Jarvis,et al.  Parallel File System Analysis Through Application I/O Tracing , 2013, Comput. J..