论文信息 - On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

On the Efficiency of Supernodal Factorization in Interior-Point Method Using CPU-GPU Collaboration

Primal-dual interior-point method (PDIPM) is the most efficient technique for solving sparse linear programming (LP) problems. Despite its efficiency, PDIPM remains a compute-intensive algorithm. Fortunately, graphics processing units (GPUs) have the potential to meet this requirement. However, their peculiar architecture entails a positive relationship between problem density and speedup, conversely implying a limited affinity of GPUs for problem sparsity. To overcome this difficulty, the state-of-the-art hybrid (CPU-GPU) implementation of PDIPM exploits presence of supernodes in sparse matrices during factorization. Supernodes are groups of similar columns that can be treated as dense submatrices. Factorization method used in the state-of-the-art solver performs only selected operations related to large supernodes on GPU. This method is known to underutilize GPU’s computational power while increasing CPU-GPU communication overhead. These shortcomings encouraged us to adapt another factorization method, which processes sets of related supernodes on GPU, and introduce it to the PDIPM implementation of a popular open-source solver. Our adaptation enabled the factorization method to better mitigate the effects of round-off errors accumulated over multiple iterations of PDIPM. To augment performance gains, we also used an efficient CPU-based matrix multiplication method. When tested for a set of well-known sparse problems, the adapted solver showed average speed-ups of approximately 55X, 1.14X and 1.05X over the open-source solver’s original version, the state-of-the-art solver, and a highly optimized proprietary solver known as CPLEX, respectively. These results strongly indicate that our proposed hybrid approach can lead to significant performance gains for solving large sparse problems.

[1] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[2] Robert A. van de Geijn,et al. BLAS (Basic Linear Algebra Subprograms) , 2011, Encyclopedia of Parallel Computing.

[3] C. L. Philip Chen,et al. GCB-Net: Graph Convolutional Broad Network and Its Application in Emotion Recognition , 2019, IEEE Transactions on Affective Computing.

[4] Guangshun Li,et al. Energy Consumption Optimization With a Delay Threshold in Cloud-Fog Cooperation Computing , 2019, IEEE Access.

[5] Nesa L'abbe Wu,et al. Linear programming and extensions , 1981 .

[6] Tanya Y. Berger-Wolf,et al. Optimization techniques for sparse matrix-vector multiplication on GPUs , 2016, J. Parallel Distributed Comput..

[7] YANQING CHEN,et al. Algorithm 8 xx : CHOLMOD , supernodal sparse Cholesky factorization and update / downdate ∗ , 2006 .

[8] J. Hogg. High performance Cholesky and symmetric indefinite factorizations with applications , 2010 .

[9] Xiangmin Xu,et al. Hierarchical Lifelong Learning by Sharing Representations and Integrating Hypothesis , 2019, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[10] Brian Vinter,et al. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11] Ashish Tiwari,et al. Output Range Analysis for Deep Feedforward Neural Networks , 2018, NFM.

[12] Mehmet Deveci,et al. Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures , 2018, Parallel Comput..

[13] Meichun Cao,et al. Related-Key Differential Cryptanalysis of the Reduced-Round Block Cipher GIFT , 2019, IEEE Access.

[14] Jack Dongarra,et al. LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[15] Narendra Karmarkar,et al. A new polynomial-time algorithm for linear programming , 1984, Comb..

[16] Tong Zhang,et al. Design of Highly Nonlinear Substitution Boxes Based on I-Ching Operators , 2018, IEEE Transactions on Cybernetics.

[17] Dianne P. O'Leary,et al. IMPLEMENTING AN INTERIOR POINT METHOD FOR LINEAR PROGRAMS ON A CPU-GPU SYSTEM , 2010 .

[18] Marco Maggioni. Sparse Convex Optimization on GPUs , 2016 .

[19] Abdulwahab Ali Almazroi,et al. Energy Efficient Indivisible Workload Distribution in Geographically Distributed Data Centers , 2019, IEEE Access.

[20] Georgios B. Giannakis,et al. DGLB: Distributed Stochastic Geographical Load Balancing over Cloud Networks , 2017, IEEE Transactions on Parallel and Distributed Systems.

[21] István Maros,et al. Parallel search paths for the simplex algorithm , 2017, Central Eur. J. Oper. Res..

[22] Timothy A. Davis,et al. Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[23] Robert Robere. Interior Point Methods and Linear Programming , 2012 .

[24] Qi Huangfu,et al. Parallelizing the dual revised simplex method , 2015, Mathematical Programming Computation.

[25] Muhammad Ovais Ahmad,et al. Accelerating Revised Simplex Method Using GPU-Based Basis Update , 2020, IEEE Access.

[26] Wenqi He,et al. A Novel Compressive Optical Encryption via Single-Pixel Imaging , 2019, IEEE Photonics Journal.

[27] Song Han,et al. SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28] Sanjay Ranka,et al. Optimized sparse Cholesky factorization on hybrid multicore architectures , 2018, J. Comput. Sci..

[29] Sanjay Mehrotra,et al. On the Implementation of a Primal-Dual Interior Point Method , 1992, SIAM J. Optim..

[30] Matteo Fischetti,et al. Deep neural networks and mixed integer linear optimization , 2018, Constraints.

[31] Jacek Gondzio,et al. GPU Acceleration of the Matrix-Free Interior Point Method , 2011, PPAM.