Soft error resilience in Big Data kernels through modular analysis

The shrinking processor feature and operating voltages of processor circuits are making them increasingly vulnerable to soft faults, which calls for fault resilience techniques at both the software and hardware levels under the big data context. To assist software developers in writing fault-resilient big data applications, we propose the tool ErrorSight, which helps them to focus their efforts on code regions and data structures that are most vulnerable to soft errors, understand how numerical errors propagate through the program, and apply fault resilience techniques effectively. ErrorSight achieves this through efficient generation of error profiles leveraging the predictive power of the Boosted Regression Tree model. We use four big data kernels to illustrate the modular analysis mechanism of ErrorSight and show its usefulness in the development of numerical fault-resilience in Big Data.

[1]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[2]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[5]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors , 2012, ICCS.

[6]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[7]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[8]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[9]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[10]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  Gonçalo N. P. Amador,et al.  Linear Solvers for Stable Fluids: GPU vs CPU , 2012 .

[12]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[13]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[14]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[15]  Takashi Nanya,et al.  The Byzantine hardware fault model , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[16]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[17]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.