SLEEF: A Portable Vectorized Library of C Standard Mathematical Functions

In this article, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to make the library portable while maintaining good performance, intrinsic functions of vector extensions are abstracted by inline functions or preprocessor macros. We implemented the functions so that they can use sub-features of vector extensions such as fused multiply-add, mask registers, and extraction of mantissa. In order to make computation with SIMD instructions efficient, the library only uses a small number of conditional branches, and all the computation paths are vectorized. We devised a variation of the Payne-Hanek argument reduction for trigonometric functions and a floating point remainder, both of which are suitable for vector computation. We compare the performance with our library to Intel SVML.

[1]  Naoki Shibata Efficient evaluation methods of elementary functions suitable for SIMD computation , 2010, Computer Science - Research and Development.

[2]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[3]  Bruce W. Char,et al.  The design of maple: A compact, portable and powerful computer algebra system , 1983, EUROCAL.

[4]  Jean-Michel Muller,et al.  Elementary Functions: Algorithms and Implementation , 1997 .

[5]  James E. Smith,et al.  Characterizing the branch misprediction penalty , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[6]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[7]  Björn Franke,et al.  Free Rider , 2017, ACM Trans. Embed. Comput. Syst..

[8]  Christoph Quirin Lauter A new open-source SIMD vector libm fully implemented with high-level scalar C , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[9]  Jean-Michel Muller,et al.  CR-LIBM: a correctly rounded elementary function library , 2003, SPIE Optics + Photonics.

[10]  M. Payne,et al.  Radian reduction for trigonometric functions , 1983, SGNM.

[11]  Richard W. Vuduc,et al.  Methods for High-Throughput Computation of Elementary Functions , 2013, PPAM.

[12]  T. J. Dekker,et al.  A floating-point technique for extending the available precision , 1971 .

[13]  Geng Yang,et al.  Importance of bitwise identical reproducibility in earth system modeling and status report , 2015 .

[14]  William J. Cody,et al.  Implementation and testing of function software , 1980, Problems and Methodologies in Mathematical Software Production.

[15]  Scott A. Mahlke,et al.  Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[16]  Christoph Quirin Lauter,et al.  Sollya: An Environment for the Development of Numerical Codes , 2010, ICMS.

[17]  William M. Waite,et al.  Software manual for the elementary functions , 1980 .

[18]  Olaf Krzikalla,et al.  Auto-Vectorization Techniques for Modern SIMD Architectures , 2011 .

[19]  Jan Reineke,et al.  uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures , 2018, ASPLOS.

[20]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[21]  Christopher Kumar Anand,et al.  An Optimized Cell BE Special Function Library Generated by Coconut , 2009, IEEE Transactions on Computers.

[22]  Jonathan Richard Shewchuk,et al.  Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates , 1997, Discret. Comput. Geom..

[23]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[24]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[25]  Danilo Piparo,et al.  Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions , 2014 .

[26]  D. Naishlos,et al.  Autovectorization in GCC , 2004 .

[27]  Matthias Gross Neat SIMD: Elegant vectorization in C++ by using specialized templates , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[28]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[29]  Francesco Zanichelli,et al.  The long and winding road to high-performance image processing with MMX/SSE , 2000, Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception.

[30]  Gerald Estrin,et al.  Organization of Computer Systems-the Fixed Plus Variable Structure Computer , 1899 .

[31]  Ingo Wald,et al.  Extending a C-like language for portable SIMD programming , 2012, PPoPP '12.

[32]  Xin-Min Tian,et al.  Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance , 2002 .

[33]  Shmuel Gal,et al.  An accurate elementary mathematical library for the IEEE floating point standard , 1991, TOMS.

[34]  Richard M. Stallman,et al.  Using the GNU Compiler Collection , 2010 .

[35]  Mitsuhisa Sato,et al.  Extending OpenMP SIMD Support for Target Specific Code and Application to ARM SVE , 2017, IWOMP.