Diagnosing NFS errors: Preliminary Findings from a Syslog Analysis of Bridges

Bridges is the current main system at the Pittsburgh Supercomputing Center. Given the complexity of the system and the volume of its use, it is a very good environment for exploring the potential of machine learning techniques in studying sub-optimal performance. This short report discusses preliminary and ongoing work of a new graduate student exploring this novel realm. Our initial focus has been on learning to predict the occurrence of NFS time out errors from preceding syslog messages.

[1]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[2]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[3]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Elisabeth Baseman,et al.  Interpretable Anomaly Detection for Monitoring of High Performance Computing Systems , 2016 .

[5]  Ralph Roskies,et al.  Bridges: a uniquely flexible HPC resource for new communities and data analytics , 2015, XSEDE.