Towards high performance stochastic arithmetic

Because of the finite representation of floating-point numbers in computers, the results of arithmetic operations need to be rounded. The CADNA library [1],based on discrete stochastic arithmetic [2], can be used to estimate the propagation of rounding errors in scientific codes. By synchronously computing each operation three times with a randomly chosen rounding mode, CADNA estimates the number of exact significant digits of the result within a 95% confidence interval. To ensure the validity of the method and allow a better analysis of the program, several types of anomalies are checked at execution time. However, the overhead on computation time can be of up to 80 times depending on the program and on the level of anomaly detection [3]. There are two main factors that can explain this: the cost of anomaly detection and that of stochastic operations. Firstly, cancellation (sudden loss of accuracy in a single operation) detection is based on the computation of the number of exact significant digits that relies on a logarithmic evaluation. This mathematical function is much more costly than floating-point arithmetic operations. Secondly, the stochastic operators are currently implemented through the overloading of arithmetic operators and the change of the rounding mode of the FPU (Floating Point Unit). However, this method makes vectorization impossible, as each vector lane would need a different rounding mode. Moreover, it causes performance overhead due to function calls and to the flushing of the FPU pipelines, respectively. This implies an even greater performance drop for HPC applications that rely on SIMD (Single Instruction Multiple Data) processing and on pipeline filling for better efficiency. To bypass these overheads and allow the use of vector instructions for SIMD parallelism, we propose several improvements in the CADNA library. Since only the integer part of the number of exact significant digits is required, we can use the exponent of a floating-point value as an approximation of the logarithm evaluation, which removes the logarithm function call. To avoid the cost of function calls, we propose to inline the stochastic operators. Finally, rather than depending on the rounding modes of the FPU, we compute the randomly rounded arithmetic operations by handling the sign bit of the operands through masks. These contributions provide a speedup factor of up to 2.5 on a scalar code. They also enable the use of CADNA with vectorized code: SIMD performance results on high-end CPUs and on an Intel Xeon Phi are presented.

[1]  Jean Vignes,et al.  Discrete Stochastic Arithmetic for Validating Results of Numerical Software , 2004, Numerical Algorithms.

[2]  Olena Chubach,et al.  Parallelization of discrete stochastic arithmetic on multicore architectures , 2013, 2013 10th International Conference on Information Technology: New Generations.