Making Problem Diagnosis Work for Large-Scale, Production Storage Systems

Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-file-system cluster, and making it work on Intrepid's storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid's instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.

[1]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[2]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[5]  Rajeev Gandhi,et al.  Ganesha: blackBox diagnosis of MapReduce systems , 2010, PERV.

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[8]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[9]  Chris Newman,et al.  Date and Time on the Internet: Timestamps , 2002, RFC.

[10]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[11]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[12]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[13]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[14]  Vanish Talwar,et al.  Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[15]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[16]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[17]  Kwan-Liu Ma,et al.  Visual analysis of I/O system behavior for high-end computing , 2011, LSAP '11.

[18]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[20]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[21]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[22]  Rajeev Gandhi,et al.  Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters , 2012, LISA.

[23]  Nikolaj Bjørner,et al.  Latent fault detection in large scale services , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[24]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[25]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[26]  Haifeng Chen,et al.  PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.

[27]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .