Exploring the capabilities of support vector machines in detecting silent data corruptions

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Nowell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830. In addition, this material is based upon work supported by the National Science Foundation under Grant No. 1619253, and also by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC02-06CH11357 (DOE Catalog project) and in part by the European Union FEDER funds under contract TIN2015-65316-P.

[1]  Franck Cappello,et al.  Unified fault-tolerance framework for hybrid task-parallel message-passing applications , 2018, Int. J. High Perform. Comput. Appl..

[2]  James P. Collins,et al.  Numerical Solution of the Riemann Problem for Two-Dimensional Gas Dynamics , 1993, SIAM J. Sci. Comput..

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Israel Koren,et al.  Application-level fault tolerance in the orbital thermal imaging spectrometer , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[5]  G. Bronevetsky,et al.  Detecting Soft Errors in Stencil based Computations , 2015 .

[6]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[7]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[8]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Sridhar Krishnan,et al.  Chaotic time series prediction using knowledge based Green’s Kernel and least-squares support vector machines , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Chidchanok Lursinsap,et al.  Application of critical support vector machine to time series prediction , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[11]  Francis Eng Hock Tay,et al.  Support vector machine with adaptive parameters in financial time series forecasting , 2003, IEEE Trans. Neural Networks.

[12]  Omer Subasi,et al.  Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[13]  M. Brio,et al.  An upwind differencing scheme for the equations of ideal magnetohydrodynamics , 1988 .

[14]  Franck Cappello,et al.  Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.

[15]  Ping Li,et al.  Dynamic Least Squares Support Vector Machine , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[17]  Omer Subasi,et al.  Programmer-directed partial redundancy for resilient HPC , 2015, Conf. Computing Frontiers.

[18]  Omer Subasi,et al.  CRC-Based Memory Reliability for Task-Parallel HPC Applications , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[19]  G. Sod A survey of several finite difference methods for systems of nonlinear hyperbolic conservation laws , 1978 .

[20]  P. Woodward,et al.  The Piecewise Parabolic Method (PPM) for Gas Dynamical Simulations , 1984 .

[21]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[22]  Daniel S. Katz,et al.  Tests and Tolerances for High-Performance Software-Implemented Fault Detection , 2003, IEEE Trans. Computers.

[23]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[25]  Saurabh Bagchi,et al.  Sirius: Neural Network Based Probabilistic Assertions for Detecting Silent Data Corruption in Parallel Programs , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[26]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[27]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[28]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[29]  S. Orszag,et al.  Small-scale structure of two-dimensional magnetohydrodynamic turbulence , 1979, Journal of Fluid Mechanics.

[30]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[31]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[32]  Franck Cappello,et al.  An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[33]  Franck Cappello,et al.  Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications , 2015, 2015 IEEE International Conference on Cluster Computing.

[34]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[35]  B. Fryxell,et al.  On the Cellular Structure of Carbon Detonations , 2000 .

[36]  Franck Cappello,et al.  Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[37]  Osman S. Unsal,et al.  NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.