Generalized Numerical Entanglement for Reliable Linear, Sesquilinear and Bijective Operations on Integer Data Streams

We propose a new technique for the mitigation of fail-stop failures and/or silent data corruptions (SDCs) within linear, sesquilinear or bijective (LSB) operations on <inline-formula><tex-math notation="LaTeX">$M$</tex-math> <alternatives><inline-graphic xlink:href="andreopoulos-ieq1-2597543.gif"/></alternatives></inline-formula> integer data streams (<inline-formula><tex-math notation="LaTeX">$M\geq 3$</tex-math><alternatives> <inline-graphic xlink:href="andreopoulos-ieq2-2597543.gif"/></alternatives></inline-formula>). In the proposed approach, the <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives> <inline-graphic xlink:href="andreopoulos-ieq3-2597543.gif"/></alternatives></inline-formula> input streams are linearly superimposed to form <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives> <inline-graphic xlink:href="andreopoulos-ieq4-2597543.gif"/></alternatives></inline-formula> <italic>numerically entangled</italic> integer data streams that are stored in-place of the original inputs, i.e., no additional (aka. “checksum”) streams are used. An arbitrary number of LSB operations can then be performed in <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives> <inline-graphic xlink:href="andreopoulos-ieq5-2597543.gif"/></alternatives></inline-formula> processing cores using these entangled data streams. The output results can be extracted from any <inline-formula><tex-math notation="LaTeX"> $M-K$</tex-math><alternatives><inline-graphic xlink:href="andreopoulos-ieq6-2597543.gif"/></alternatives> </inline-formula> entangled output streams by additions and arithmetic shifts, thereby mitigating <inline-formula> <tex-math notation="LaTeX">$K$</tex-math><alternatives><inline-graphic xlink:href="andreopoulos-ieq7-2597543.gif"/> </alternatives></inline-formula> fail-stop failures (<inline-formula><tex-math notation="LaTeX">$K\leq \left\lfloor \frac{M-1}{2}\right\rfloor$</tex-math><alternatives><inline-graphic xlink:href="andreopoulos-ieq8-2597543.gif"/> </alternatives></inline-formula>), or detecting up to <inline-formula><tex-math notation="LaTeX">$K$</tex-math> <alternatives><inline-graphic xlink:href="andreopoulos-ieq9-2597543.gif"/></alternatives></inline-formula> SDCs per <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives> <inline-graphic xlink:href="andreopoulos-ieq10-2597543.gif"/></alternatives></inline-formula>-tuple of outputs at corresponding in-stream locations. Therefore, unlike other methods, the number of operations required for the entanglement, extraction and recovery of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. Our proposal is validated within an Amazon EC2 instance (Haswell architecture with AVX2 support) via integer matrix product operations. Our analysis and experiments for fail-stop failure mitigation and SDC detection reveal that the proposed approach incurs 0.75 to 37.23 percent reduction in processing throughput in comparison to the equivalent error-intolerant processing. This overhead is found to be up to two orders of magnitude smaller than that of the equivalent checksum-based method, with increased gains offered as the complexity of the performed LSB operations is increasing. Therefore, our proposal can be used in distributed systems, unreliable multicore clusters and safety-critical applications, where robustness against failures and SDCs is a necessity.

[1]  Yiannis Andreopoulos,et al.  Software Designs of Image Processing Tasks With Incremental Refinement of Computation , 2010, IEEE Transactions on Image Processing.

[2]  Suku Nair,et al.  General linear codes for fault-tolerant matrix operations on processor arrays , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[4]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Gary R. Bradski,et al.  Learning OpenCV - computer vision with the OpenCV library: software that sees , 2008 .

[7]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8]  Niraj K. Jha,et al.  Algorithm-based fault tolerance for floating-point operations in massively parallel systems , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[9]  Christian Engelmann,et al.  The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .

[10]  Yiannis Andreopoulos Error Tolerant Multimedia Stream Processing: There's Plenty of Room at the Top (of the System Stack) , 2013, IEEE Transactions on Multimedia.

[11]  Peter M. Fenwick The Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements , 1996, Comput. J..

[12]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[13]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[14]  Franklin T. Luk Algorithm-based Fault Tolerance for Parallel Matrix Equation Solvers , 1986, Optics & Photonics.

[15]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[16]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[17]  Naoki Shibata,et al.  Task Scheduling Algorithm for Multicore Processor System for Minimizing Recovery Time in Case of Single Node Fault , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[18]  Steven Hand,et al.  Spread-Spectrum Computation , 2008, HotDep.

[19]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[20]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[21]  Jing Wang,et al.  Efficient Coding Schemes for Fault-Tolerant Parallel Filters , 2015, IEEE Transactions on Circuits and Systems II: Express Briefs.

[22]  W. Marsden I and J , 2012 .

[23]  Yiannis Andreopoulos,et al.  Throughput-Distortion Computation of Generic Matrix Multiplication: Toward a Computation Channel for Digital Signal Processing Systems , 2011, IEEE Transactions on Signal Processing.

[24]  Sarah Ellen Michalak,et al.  Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[25]  Konstantinos G. Margaritis,et al.  Algorithm Based Fault Tolerance : Review and experimental study , 2004 .

[26]  Shoko Imaizumi,et al.  国際会議開催報告:IEEE International Symposium on Circuits and Systems , 2013 .

[27]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[28]  Hui Liu,et al.  Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation , 2009, HPCA.

[29]  Yiannis Andreopoulos,et al.  Throughput Scaling Of Convolution For Error-Tolerant Multimedia Applications , 2012, IEEE Transactions on Multimedia.

[30]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[31]  W. Kurschl,et al.  Combining cloud computing and wireless sensor networks , 2009, iiWAS.

[32]  Yiannis Andreopoulos,et al.  Failure mitigation in linear, sesquilinear and bijective operations on integer data streams via numerical entanglement , 2015, 2015 IEEE 21st International On-Line Testing Symposium (IOLTS).

[33]  Bo Zhang,et al.  Packed integer wavelet transform constructed by lifting scheme , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[34]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[35]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[36]  Jian Yang,et al.  Two-dimensional PCA: a new approach to appearance-based face representation and recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Salvatore Pontarelli,et al.  Area efficient concurrent error detection and correction for parallel filters , 2012 .

[38]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[39]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[40]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Salvatore Pontarelli,et al.  On the use of Karatsuba formula to detect errors in GF((2(sup)n(/sup))(sup)2(/sup)) multipliers , 2012, IET Circuits Devices Syst..

[42]  William Gropp,et al.  Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2013, HiPC 2013.

[43]  Bo Zhang,et al.  Packed integer wavelet transform constructed by lifting scheme , 2000, IEEE Trans. Circuits Syst. Video Technol..

[44]  Jung Hee Cheon,et al.  Batch Fully Homomorphic Encryption over the Integers , 2013, EUROCRYPT.

[45]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[46]  Salvatore Pontarelli,et al.  Low Complexity Concurrent Error Detection for Complex Multiplication , 2013, IEEE Transactions on Computers.

[47]  Ganesh Gopalakrishnan,et al.  Towards Formal Approaches to System Resilience , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[48]  Gene H. Golub,et al.  Matrix computations , 1983 .

[49]  Wilfred Pinfold,et al.  Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, HiPC 2009.

[50]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[51]  Zizhong Chen,et al.  Optimal real number codes for fault tolerant matrix operations , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[52]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..