Reducing thread divergence in a GPU‐accelerated branch‐and‐bound algorithm

In this paper, we address the design and implementation of graphical processing unit (GPU)‐accelerated branch‐and‐bound algorithms (B&B) for solving flow‐shop scheduling optimization problems (FSP). Such applications are CPU‐time consuming and highly irregular. On the other hand, GPUs are massively multithreaded accelerators using the single instruction multiple data model at execution. A major issue that arises when executing on GPU, a B&B applied to FSP is thread or branch divergence. Such divergence is caused by the lower bound function of FSP that contains many irregular loops and conditional instructions. Our challenge is therefore to revisit the design and implementation of B&B applied to FSP dealing with thread divergence. Extensive experiments of the proposed approach have been carried out on well‐known FSP benchmarks using an Nvidia Tesla (C2050 GPU card (http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf)). Compared with a CPU‐based execution, accelerations up to × 77.46 are achieved for large problem instances. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Imen Chakroun,et al.  Graphics processing unit‐accelerated bounding for branch‐and‐bound applied to a permutation problem using data access optimization , 2014, Concurr. Comput. Pract. Exp..

[2]  S. M. Johnson,et al.  Optimal two- and three-stage production schedules with setup times included , 1954 .

[3]  Imen Chakroun,et al.  Reducing Thread Divergence in GPU-Based B&B Applied to the Flow-Shop Problem , 2011, PPAM.

[4]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..

[5]  Imen Chakroun,et al.  An Adaptative Multi-GPU Based Branch-and-Bound. A Case Study: The Flow-Shop Scheduling Problem , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[6]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[7]  Jack Dongarra,et al.  Scientific Computing with Multicore and Accelerators , 2010, Chapman and Hall / CRC computational science series.

[8]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Éric D. Taillard,et al.  Benchmarks for basic scheduling problems , 1993 .

[10]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[11]  Inmaculada García,et al.  Branch-and-Bound interval global optimization on shared memory multiprocessors , 2008, Optim. Methods Softw..

[12]  B. J. Lageweg,et al.  A General Bounding Scheme for the Permutation Flow-Shop Problem , 1978, Oper. Res..

[13]  Ravi Sethi,et al.  The Complexity of Flowshop and Jobshop Scheduling , 1976, Math. Oper. Res..

[14]  Michael J. Quinn,et al.  Analysis and Implementation of Branch-and Bound Algorithms on a Hypercube Multicomputer , 1990, IEEE Trans. Computers.

[15]  El-Ghazali Talbi,et al.  GPU Computing for Parallel Local Search Metaheuristic Algorithms , 2013, IEEE Transactions on Computers.

[16]  Stefan Andersson-Engels,et al.  Next-generation acceleration and code optimization for light transport in turbid media using GPUs , 2010, Biomedical optics express.

[17]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[19]  El-Ghazali Talbi,et al.  A Grid-enabled Branch and Bound Algorithm for Solving Challenging Combinatorial Optimization Problems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.