Scaling performance of interior-point method on large-scale chip multiprocessor system

In this paper we describe parallelization of interior-point method (IPM) aimed at achieving high scalability on large-scale chip-multiprocessors (CMPs). IPM is an important computational technique used to solve optimization problems in many areas of science, engineering and finance. IPM spends most of its computation time in a few sparse linear algebra kernels. While each of these kernels contains a large amount of parallelism, sparse irregular datasets seen in many optimization problems make parallelism difficult to exploit. As a result, most researchers have shown only a relatively low scalability of 4X-12X on medium to large scale parallel machines. This paper proposes and evaluates several algorithmic and hardware features to improve IPM parallel performance on large-scale CMPs. Through detailed simulations, we demonstrate how exploring multiple levels of parallelism with hardware support for low overhead task queues and parallel reduction enables IPM to achieve up to 48X parallel speedup on a 64-core CMP.

[1]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[2]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[3]  Barry W. Peyton,et al.  Block sparse Cholesky algorithms on advanced uniprocessor computers , 1991 .

[4]  Guangye Li,et al.  An implementation of a parallel primal-dual interior point method for block-structured linear programs , 1992, Comput. Optim. Appl..

[5]  A. Gupta,et al.  An efficient block-oriented approach to parallel sparse Cholesky factorization , 1993, Supercomputing '93.

[6]  Vipin Kumar,et al.  A parallel formulation of interior point algorithms , 1994, Proceedings of Supercomputing '94.

[7]  Xiaoye Sherry Li,et al.  Sparse Gaussian Elimination on High Performance Computers , 1996 .

[8]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[9]  Irvin J. Lustig,et al.  Gigaflops in linear programming , 1996, Oper. Res. Lett..

[10]  Stephen J. Wright,et al.  PCx user guide , 1997 .

[11]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[12]  Abhishek Kumar Gupta,et al.  Wsmp: watson sparse matrix package , 2000 .

[13]  Olaf Schenk,et al.  Scalable parallel sparse LU factorization methods on shared memory multiprocessors , 2000 .

[14]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[15]  Josep Torrellas,et al.  Architectural support for parallel reductions in scalable shared-memory multiprocessors , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Robert E. Bixby,et al.  Solving Real-World Linear Programs: A Decade and More of Progress , 2002, Oper. Res..

[17]  Daehyun Kim,et al.  Active memory techniques for ccNUMA multiprocessors , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[18]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[19]  M. Smelyanskiy,et al.  Construction and performance characterization of parallel interior point solver on 4-way Intel Itanium 2 multiprocessor system , 2004, IEEE International Workshop on Workload Characterization, 2004. WWC-7. 2004.

[20]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[21]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[22]  Introduction to Intel ® CoreTM Duo Processor Architecture Intel , 2006 .

[23]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[24]  Jacek Gondzio,et al.  Exploiting structure in parallel implementation of interior point methods for optimization , 2009, Comput. Manag. Sci..