CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution

As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques that seek to save energy or improve yield by exploiting error tolerance at the architecture/microarchitecture level. Even seemingly error-tolerant applications, however, will crash or hang due to control-flow/memory addressing errors. In parallel computation, errors involving inter-thread communication can have equally catastrophic effects. Our work explores techniques that mitigate the impact of potentially catastrophic errors in parallel computation, while still garnering power, cost, or yield benefits from data error tolerance. Our proposed CommGuard solution uses FSM-based checkers to pad and discard data in order to maintain semantic alignment between program control flow and the data communicated between processors. CommGuard techniques are low overhead and they exploit application information already provided by some parallel programming languages (e.g. StreamIt). By converting potentially catastrophic communication errors into potentially tolerable data errors, CommGuard allows important streaming applications like JPEG and MP3 decoding to execute without crashing and to sustain good output quality, even for errors as frequent as every 500μs.

[1]  Tania Stathaki,et al.  Image Fusion: Algorithms and Applications , 2008 .

[2]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[3]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[4]  Gary S. Tyson,et al.  Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  peixiong zhao,et al.  Effects of scaling on muon-induced soft errors , 2011, IEEE International Reliability Physics Symposium.

[7]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[8]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[9]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[10]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[11]  Naresh R. Shanbhag,et al.  Energy-efficient signal processing via algorithmic noise-tolerance , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[12]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[13]  Brian Demsky,et al.  Self-stabilizing Java , 2012, PLDI '12.

[14]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[15]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[16]  Sharad Malik,et al.  Extracting useful computation from error-prone processors for streaming applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Jens Palsberg,et al.  Concurrent Collections , 2010, Sci. Program..

[18]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  Georges G. E. Gielen,et al.  Emerging Yield and Reliability Challenges in Nanometer CMOS Technologies , 2008, 2008 Design, Automation and Test in Europe.

[20]  M. D. Giles,et al.  Process Technology Variation , 2011, IEEE Transactions on Electron Devices.

[21]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[22]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[23]  R. Reed,et al.  The Effects of Neutron Energy and High-Z Materials on Single Event Upsets and Multiple Cell Upsets , 2011, IEEE Transactions on Nuclear Science.

[24]  Karlheinz Brandenburg,et al.  The iso/mpeg-audio codec: A generic standard for coding of high quality digital audio , 1992 .

[25]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[26]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[27]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[28]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1991, CACM.

[29]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[30]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[31]  Kevin Skadron,et al.  Interaction of scaling trends in processor architecture and cooling , 2010, 2010 26th Annual IEEE Semiconductor Thermal Measurement and Management Symposium (SEMI-THERM).