论文信息 - Making Problem Diagnosis Work for Large-Scale, Production Storage Systems

Making Problem Diagnosis Work for Large-Scale, Production Storage Systems

Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-file-system cluster, and making it work on Intrepid's storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid's instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.

Michael P. Kasick | Kevin Harms | P. Narasimhan | K. Harms

[1] Rajeev Gandhi,et al. Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[2] Armando Fox,et al. Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[3] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[4] Amin Vahdat,et al. Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[5] Rajeev Gandhi,et al. Ganesha: blackBox diagnosis of MapReduce systems , 2010, PERV.

[6] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7] Armando Fox,et al. Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[8] Marcos K. Aguilera,et al. Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[9] Chris Newman,et al. Date and Time on the Internet: Timestamps , 2002, RFC.

[10] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[11] Rajeev Gandhi,et al. Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[12] Richard Mortier,et al. Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[13] Barton P. Miller,et al. The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[14] Vanish Talwar,et al. Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[15] C. Pipper,et al. [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[16] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[17] Kwan-Liu Ma,et al. Visual analysis of I/O system behavior for high-end computing , 2011, LSAP '11.

[18] Robert Latham,et al. I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19] Robert Latham,et al. 24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[20] D. Freedman,et al. On the histogram as a density estimator:L2 theory , 1981 .

[21] Robert Latham,et al. Understanding and improving computational science storage access through continuous characterization , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[22] Rajeev Gandhi,et al. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters , 2012, LISA.

[23] Nikolaj Bjørner,et al. Latent fault detection in large scale services , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[24] Eric A. Brewer,et al. Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[25] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[26] Haifeng Chen,et al. PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.

[27] Herbert A. Sturges,et al. The Choice of a Class Interval , 1926 .