Memory aware load balance strategy on a parallel branch‐and‐bound application

The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute‐intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch‐and‐bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in‐depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Jack J. Dongarra,et al.  Analytical modeling and optimization for affinity based thread scheduling on multicore systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2]  Alfred V. Aho,et al.  Proceedings of the nineteenth annual ACM symposium on Theory of computing , 1987, STOC 1987.

[3]  Inmaculada García,et al.  Adaptive parallel interval branch and bound algorithms based on their performance for multicore architectures , 2011, The Journal of Supercomputing.

[4]  Jack Dongarra,et al.  Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms ∗ , 2008 .

[5]  Ben H. H. Juurlink,et al.  The Parallel Hierarchical Memory Model , 1994, SWAT.

[6]  Mohammad Zubair,et al.  A unified model for multicore architectures , 2008, IFMT '08.

[7]  Mihai Budiu,et al.  DryadOpt: Branch-and-Bound on Distributed Data-Parallel Execution Engines , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[8]  Viktor K. Prasanna,et al.  Hierarchical Scheduling of DAG Structured Computations on Manycore Processors with Dynamic Thread Grouping , 2010, JSSPP.

[9]  Bowen Alpern,et al.  The uniform memory hierarchy model of computation , 2005, Algorithmica.

[10]  Mihalis Yannakakis,et al.  Towards an architecture-independent analysis of parallel algorithms , 1990, STOC '88.

[11]  Jonghyun Park,et al.  Parallel Skyline Computation on Multicore Architectures , 2009, ICDE.

[12]  Bilel Derbel,et al.  Overlay-Centric Load Balancing: Applications to UTS and B&B , 2012, 2012 IEEE International Conference on Cluster Computing.

[13]  Richard Cole,et al.  The APRAM: incorporating asynchrony into the PRAM model , 1989, SPAA '89.

[14]  Cho-Li Wang,et al.  Realistic communication model for parallel computing on cluster , 1999, ICWC 99. IEEE Computer Society International Workshop on Cluster Computing.

[15]  Yossi Matias,et al.  The QRQW PRAM: accounting for contention in parallel algorithms , 1994, SODA '94.

[16]  Yuji Shinano,et al.  A generalized utility for parallel branch and bound algorithms , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[17]  Marco A. Boschetti,et al.  A dual ascent procedure for the set partitioning problem , 2008, Discret. Optim..

[18]  Craig A. Knoblock,et al.  Advanced Programming in the UNIX Environment , 1992, Addison-Wesley professional computing series.

[19]  Bertrand Le Cun,et al.  A Parallel Exact Solver for the Three-Index Quadratic Assignment Problem , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[20]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[21]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[22]  Dhabaleswar K. Panda,et al.  Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[23]  Mohammad Zubair,et al.  Evaluating multicore algorithms on the unified memory model , 2009, Sci. Program..

[24]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[25]  Lúcia Maria de A. Drummond,et al.  A grid-enabled distributed branch-and-bound algorithm with application on the Steiner Problem in graphs , 2006, Parallel Comput..

[26]  Cynthia A. Phillips,et al.  PICO: An Object-Oriented Framework for Branch and Bound , 2000 .

[27]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[28]  Jack J. Dongarra,et al.  Accurate Cache and TLB Characterization Using Hardware Counters , 2004, International Conference on Computational Science.

[29]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[30]  Michael Dahlin,et al.  Emulations between QSM, BSP, and LogP: a framework for general-purpose parallel algorithm design , 1999, SODA '99.

[31]  Cynthia A. Phillips,et al.  Pico: An Object-Oriented Framework for Parallel Branch and Bound * , 2001 .

[32]  Mohammad Zubair,et al.  Evaluating multicore algorithms on the unified memory model , 2009 .

[33]  Michael A. Bauer,et al.  Parallel Branch and Bound Algorithm - A comparison between serial, OpenMP and MPI implementations , 2010 .

[34]  Tiffani L. Williams,et al.  The Heterogeneous Bulk Synchronous Parallel Model , 2000, IPDPS Workshops.

[35]  Apan Qasem,et al.  An Evaluation of Parallel Knapsack Algorithms on Multicore Architectures , 2010, CSC.

[36]  Cristina Boeres,et al.  On the Feasibility of Dynamically Scheduling DAG Applications on Shared Heterogeneous Systems , 2009, Euro-Par.

[37]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[38]  Hisham El-Shishiny,et al.  Proceedings of the 1st international forum on Next-generation multicore/manycore technologies , 2008 .

[39]  Stephen A. Rago,et al.  Advanced Programming in the UNIX(R) Environment (2nd Edition) , 2005 .

[40]  Yossi Matias,et al.  Can shared-memory model serve as a bridging model for parallel computation? , 1997, SPAA '97.

[41]  Leslie G. Valiant,et al.  A bridging model for multi-core computing , 2008, J. Comput. Syst. Sci..

[42]  El-Ghazali Talbi,et al.  Hierarchical branch and bound algorithm for computational grids , 2012, Future Gener. Comput. Syst..

[43]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[44]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[45]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[46]  Vijaya Ramachandran,et al.  QSM: A General Purpose Shared-Memory Model for Parallel Computation , 1997, FSTTCS.

[47]  Paul G. Spirakis,et al.  BSP vs LogP , 1996, SPAA '96.

[48]  El-Ghazali Talbi,et al.  An adaptive hierarchical master-worker (AHMW) framework for grids - Application to B&B algorithms , 2012, J. Parallel Distributed Comput..

[49]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[50]  Catherine Roucairol,et al.  Bob++: Framework for Solving Optimization Problems with Branch-and-Bound methods , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[51]  Mary K. Vernon,et al.  LoPC: modeling contention in parallel algorithms , 1997, PPOPP '97.

[52]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[53]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[54]  Lingjia Tang,et al.  Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.

[55]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[56]  Kwan-Liu Ma,et al.  Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures , 1995, PRS.

[57]  Juan Touriño,et al.  Servet: A benchmark suite for autotuning on multicore clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[58]  Rong Ge,et al.  $\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems , 2007, IEEE Transactions on Computers.

[59]  Lingjia Tang,et al.  Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.

[60]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[61]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[62]  Bruce M. Maggs,et al.  Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995 Models of Parallel Computation: A Survey and Synthesis , 2022 .

[63]  K. Cameron,et al.  lognP and log3P: Accurate Analytical Models of Point-to- point Communication in Distributed Systems , 2006 .

[64]  品野勇治 A Generalized Utility for Parallel Branch-and-Bound Algorithms(並列分枝限定法システムの汎用化) , 1997 .

[65]  Xiaofang Zhao,et al.  Accurate Analytical Models for Message Passing on Multi-core Clusters , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[66]  Eduard Ayguadé,et al.  Impact of the Memory Hierarchy on Shared Memory Architectures in Multicore Programming Models , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[67]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[68]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[69]  Mihalis Yannakakis,et al.  Towards an Architecture-Independent Analysis of Parallel Algorithms , 1990, SIAM J. Comput..