Exploring void search for fault detection on extreme scale systems

Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance technologies. For instance, it has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer total checkpoints. In this paper we present a new approach for fault detection based on the Void Search (VS) algorithm. VS is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. We evaluate our algorithm using real environmental logs from Mira Blue Gene/Q supercomputer at Argonne National Laboratory. Our experiments show that our approach can detect almost all faults (i.e., sensitivity close to 1) with a low false positive rate (i.e., specificity values above 0.7). We also compare our algorithm with a number of existing detection algorithms, and find that ours outperforms all of them.

[1]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[2]  Rajeev Thakur,et al.  A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..

[3]  R. Weygaert,et al.  A cosmic watershed: the WVF void detection technique , 2007, 0706.2788.

[4]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[6]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[8]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[9]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[10]  Bronis R. de Supinski,et al.  MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Yuan Xie,et al.  Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[14]  Anand Sivasubramaniam,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[15]  V. Icke,et al.  Fragmenting the universe , 1987 .

[16]  Simon J. Godsill,et al.  Detection of abrupt spectral changes using support vector machines an application to audio signal segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Ibrahim M. El-Henawy,et al.  VISUALIZE NETWORK ANOMALY DETECTION BY USING K-MEANS CLUSTERING ALGORITHM , 2013 .

[19]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[20]  Basil S. Maglaris,et al.  Towards multisensor data fusion for DoS detection , 2004, SAC '04.

[21]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.

[22]  Zhiling Lan,et al.  Filtering log data: Finding the needles in the Haystack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[23]  Yves Robert,et al.  Checkpointing algorithms and fault prediction , 2014, J. Parallel Distributed Comput..

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  K. Thangavel,et al.  Semi-supervised k-means clustering for outlier detection in mammogram classification , 2010, Trendz in Information Sciences & Computing(TISC2010).

[26]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[27]  Ibm Redbooks IBM System Blue Gene Solution: Blue Gene/Q System Administration , 2012 .

[28]  Ron Brightwell,et al.  On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.

[29]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[30]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[31]  Wenjie Hu,et al.  Robust Anomaly Detection Using Support Vector Machines , 2003 .

[32]  Lionel Sacks,et al.  Active Platform Security through Intrusion Detection Using Naïve Bayesian Network for Anomaly Detection , 2002 .

[33]  M. Neyrinck zobov: a parameter-free void-finding algorithm , 2007, 0712.3049.

[34]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[35]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[36]  P. Mähönen,et al.  A Simple Void-searching Algorithm , 1998 .

[37]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[38]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[39]  F. Hoyle,et al.  Voids in the Two-Degree Field Galaxy Redshift Survey , 2003, astro-ph/0312533.

[40]  Sushil Jajodia,et al.  Detecting Novel Network Intrusions Using Bayes Estimators , 2001, SDM.

[41]  V. Rao Vemuri,et al.  Robust Support Vector Machines for Anomaly Detection in Computer Security , 2003, ICMLA.

[42]  Yves Robert,et al.  Impact of fault prediction on checkpointing strategies , 2012, ArXiv.

[43]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[44]  Michael Schatz,et al.  Learning Program Behavior Profiles for Intrusion Detection , 1999, Workshop on Intrusion Detection and Network Monitoring.