论文信息 - Highly-reliable integer matrix multiplication via numerical packing

Highly-reliable integer matrix multiplication via numerical packing

The generic matrix multiply (GEMM) routine comprises the compute and memory-intensive part of many information retrieval, relevance ranking and object recognition systems. Because of the prevalence of GEMM in these applications, ensuring its robustness to transient hardware faults is of paramount importance for highly-efficientlhighly-reliable systems. This is currently accomplished via error control coding (ECC) or via dual modular redundancy (DMR) approaches that produce a separate set of “parity” results to allow for fault detection in GEMM. We introduce a third family of methods for fault detection in integer matrix products based on the concept of numerical packing. The key difference of the new approach against ECC and DMR approaches is the production of redundant results within the numerical representation of the inputs rather than as a separate set of parity results. In this way, high reliability is ensured within integer matrix products while allowing for: (i) in-place storage; (ii) usage of any off-the-shelf 64-bit floating-point GEMM routine; (iii) computational overhead that is independent of the GEMM inner dimension. The only detriment against a conventional (i.e. fault-intolerant) integer matrix multiplication based on 32-bit floating-point GEMM is the sacrifice of approximately 30.6% of the bitwidth of the numerical representation. However, unlike ECC methods that can reliably detect only up to a few faults per GEMM computation (typically two), the proposed method attains more than “12 nines” reliability, i.e. it will only fail to detect 1 fault out of more than 1 trillion arbitrary faults in the GEMM operations. As such, it achieves reliability that approaches that of DMR, at a very small fraction of its cost. Specifically, a single-threaded software realization of our proposal on an Intel i7-3632QM 2.2GHz processor (Ivy Bridge architecture with AVX support) incurs, on average, only 19% increase of execution time against an optimized, fault-intolerant, 32-bit GEMM routine over a range of matrix sizes and it remains more than 80% more efficient than a DMR-based GEMM.

Fabio Verdicchio | Yiannis Andreopoulos | Davide Anastasia | Ijeoma Anarado | Mohammad Ashraful Anam

[1] Wei Wu,et al. Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[2] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3] Yiannis Andreopoulos,et al. Throughput-Distortion Computation of Generic Matrix Multiplication: Toward a Computation Channel for Digital Signal Processing Systems , 2011, IEEE Transactions on Signal Processing.

[4] W. Marsden. I and J , 2012 .

[5] George Bosilca,et al. Algorithmic Based Fault Tolerance Applied to High Performance Computing , 2008, ArXiv.

[6] Endong Wang,et al. Intel Math Kernel Library , 2014 .

[7] Alejandro F. Frangi,et al. Two-dimensional PCA: a new approach to appearance-based face representation and recognition , 2004 .

[8] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[9] Gene H. Golub,et al. Matrix computations , 1983 .

[10] Nicholas P. Carter,et al. CCC visioning study: system-level cross-layer cooperation to achieve predictable systems from unpredictable components , 2011 .

[11] Ben Carterette,et al. Million Query Track 2007 Overview , 2008, TREC.

[12] Yiannis Andreopoulos,et al. Software Designs of Image Processing Tasks With Incremental Refinement of Computation , 2010, IEEE Transactions on Image Processing.

[13] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[14] Yervant Zorian,et al. Design for test and reliability in ultimate CMOS , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[16] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[17] C. Lopez-Ongil,et al. Autonomous Fault Emulation: A New FPGA-Based Acceleration System for Hardness Evaluation , 2007, IEEE Transactions on Nuclear Science.