Parallel Fully Vectorized Marsa-LFIB4: Algorithmic and Language-Based Optimization of Recursive Computations

The aim of this paper is to present a new high-performance implementation of Marsa-LFIB4 which is an example of high-quality multiple recursive pseudorandom number generators. We propose a new algorithmic approach that combines language-based vectorization techniques together with a new divide-and-conquer method that exploits a special sparse structure of the matrix obtained from the recursive formula that defines the generator. We also show how the use of intrinsics for Intel AVX2 and AVX512 vector extensions can improve the performance. Our new implementation achieves good performance on several multicore architectures and it is much more energy-efficient than simple SIMD-optimized implementations.

[1]  Przemyslaw Stpiczynski Parallel Algorithms for Solving Linear Recurrence Systems , 1992, CONPAR.

[2]  Przemyslaw Stpiczynski,et al.  Parallel GPU-accelerated recursion-based generators of pseudorandom numbers , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[3]  Rob H. Bisseling,et al.  Parallel Scientific Computation , 2004 .

[4]  Richard P. Brent,et al.  Uniform random number generators for supercomputers , 1992 .

[5]  Rob H. Bisseling,et al.  Parallel scientific computation - a structured approach using BSP and MPI , 2004 .

[6]  Tapio Niemi,et al.  RAPL in Action , 2018, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[7]  Ora E. Percus,et al.  Random Number Generators for MIMD Parallel Processors , 1989, J. Parallel Distributed Comput..

[8]  Michael Mascagni,et al.  Parameterizing parallel multiplicative lagged-Fibonacci generators , 2004, Parallel Comput..

[9]  Pierre L'Ecuyer,et al.  Good Parameters and Implementations for Combined Multiple Recursive Random Number Generators , 1999, Oper. Res..

[10]  Michael Mascagni,et al.  SPRNG: A Scalable Library for Pseudorandom Number Generation , 1999, PP.

[11]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[12]  Michael B. Giles,et al.  Parallelization Techniques for Random Number Generators , 2011 .

[13]  Przemyslaw Stpiczynski Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures , 2017, The Journal of Supercomputing.

[14]  Pierre L'Ecuyer,et al.  TestU01: A C library for empirical testing of random number generators , 2006, TOMS.

[15]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[16]  Srinivas Aluru,et al.  Lagged Fibonacci Random Number Generators for Distributed Memory Parallel Computers , 1997, J. Parallel Distributed Comput..

[17]  Stephan Mertens,et al.  Random numbers for large scale distributed Monte Carlo simulations , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Giray Ökten,et al.  Parameterization based on randomized quasi-Monte Carlo methods , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[19]  Przemysław Stpiczyński,et al.  Using distributed memory parallel computers and GPU clusters for multidimensional Monte Carlo integration , 2015, Concurr. Comput. Pract. Exp..

[20]  Krystian Lapa,et al.  Negative Space-Based Population Initialization Algorithm (NSPIA) , 2018, ICAISC.