The Pheet Task-Scheduling Framework on the Intel® Xeon Phi Coprocessor and other Multicore Architectures

Pheet, a task-scheduling framework that allows for easy customization of internal data-structures, is a research vehicle for experimenting with high-level application and low-level architectural support for task-parallel programming models. Pheet is highly configurable, and allows comparison between different implementations of data structures used in the scheduler, as well as comparisons between entirely different schedulers (typically using work-stealing). Pheet is being used to investigate high-level task-parallel support mechanisms that allow applications to influence scheduling decisions and behavior. One such mechanism, that we use in this work, is scheduling strategies. Previous Pheet benchmarking was done on conventional multicore architectures from AMD and Intel. In this paper we discuss the performance of Pheet on a prototype Intel Xeon Phi coprocessor with 61 cores. We compare these results to Pheet on three conventional multicore architectures. Using two benchmarks from the mostly non-numerical/combinatorial Pheet suite we find that the Xeon Phi coprocessor provides considerably better scalability than the other architectures, with more than a 70x speedup on the 61-core Knights Corner prototype system when using 4-way SMT, although not achieving the same absolute performance. For our research, the Xeon Phi coprocessor is an interesting architecture for implementing and evaluating fine-grained task-parallel parallel algorithm implementations.

[1]  Teodor Gabriel Crainic,et al.  Parallel Branch‐and‐Bound Algorithms , 2006 .

[2]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[3]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[4]  Jesper Larsson Träff,et al.  An Extended Work-Stealing Framework for Mixed-Mode Parallel Applications , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[5]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[6]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[7]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[9]  Richard M. Karp,et al.  Randomized parallel algorithms for backtrack search and branch-and-bound computation , 1993, JACM.

[10]  Geppino Pucci,et al.  Fast Deterministic Parallel Branch-and-Bound , 1999, Parallel Process. Lett..

[11]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[12]  Jesper Larsson Träff,et al.  Implementation of parallel branch-and-bound algorithms – experiences with the graph partitioning problem , 1991, Ann. Oper. Res..

[13]  William J. Dally,et al.  A portable runtime interface for multi-level memory hierarchies , 2008, PPoPP.

[14]  Jesper Larsson Träff,et al.  Work-stealing for mixed-mode parallelism by deterministic team-building , 2010, SPAA '11.

[15]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[16]  Alexey Kukanov,et al.  The Foundations for Scalable Multicore Software in Intel Threading Building Blocks , 2007 .

[17]  Moustafa Ghanem,et al.  Structured parallel programming , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[18]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[19]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[20]  Leslie Ann Goldberg,et al.  The natural work-stealing algorithm is stable , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[21]  Martin Wimmer Wait-free Hyperobjects for Task-Parallel Programming Systems , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[22]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[23]  Peter Sanders,et al.  Fast Priority Queues for Parallel Branch-and-Bound , 1995, IRREGULAR.

[24]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .