Experimental analysis of space-bounded schedulers

The running time of nested parallel programs on shared memory machines depends in significant part on how well the scheduler mapping the program to the machine is optimized for the organization of caches and processors on the machine. Recent work proposed ``space-bounded schedulers'' for scheduling such programs on the multi-level cache hierarchies of current machines. The main benefit of this class of schedulers is that they provably preserve locality of the program at every level in the hierarchy, resulting (in theory) in fewer cache misses and better use of bandwidth than the popular work-stealing scheduler. On the other hand, compared to work-stealing, space-bounded schedulers are inferior at load balancing and may have greater scheduling overheads, raising the question as to the relative effectiveness of the two schedulers in practice. In this paper, we provide the first experimental study aimed at addressing this question. To facilitate this study, we built a flexible experimental framework with separate interfaces for programs and schedulers. This enables a head-to-head comparison of the relative strengths of schedulers in terms of running times and cache miss counts across a range of benchmarks. (The framework is validated by comparisons with the Intel\textregistered{} Cilk\texttrademark{} Plus work-stealing scheduler.) We present experimental results on a 32-core Xeon\textregistered{} 7560 comparing work-stealing, hierarchy-minded work-stealing, and two variants of space-bounded schedulers on both divide-and-conquer micro-benchmarks and some popular algorithmic kernels. Our results indicate that space-bounded schedulers reduce the number of L3 cache misses compared to work-stealing schedulers by 25--65\% for most of the benchmarks, but incur up to 7\% additional scheduler and load-imbalance overhead. Only for memory-intensive benchmarks can the reduction in cache misses overcome the added overhead, resulting in up to a 25\% improvement in running time for synthetic benchmarks and about 20\% improvement for algorithmic kernels. We also quantify runtime improvements varying the available bandwidth per core (the ``bandwidth gap''), and show up to 50\% improvements in the running times of kernels as this gap increases 4-fold. As part of our study, we generalize prior definitions of space-bounded schedulers to allow for more practical variants (while still preserving their guarantees), and explore implementation tradeoffs.

[1]  Girija J. Narlikar,et al.  Scheduling threads for low space requirement and good locality , 1999, SPAA '99.

[2]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[3]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, IPDPS.

[4]  Vijaya Ramachandran,et al.  The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation , 2007, SPAA '07.

[5]  Richard Cole,et al.  Efficient Resource Oblivious Algorithms for Multicores , 2011, ArXiv.

[6]  Guy E. Blelloch,et al.  Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures , 2009, SPAA '09.

[7]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[8]  Vijaya Ramachandran,et al.  Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[9]  Harsha Vardhan Simhadri,et al.  Program-Centric Cost Models for Locality and Parallelism , 2013 .

[10]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[11]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[12]  Vijaya Ramachandran,et al.  Oblivious algorithms for multicores and network of processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[14]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[15]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[16]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[17]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[18]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[19]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[20]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[21]  Richard Cole,et al.  Efficient Resource Oblivious Algorithms for Multicores with False Sharing , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[22]  Guy E. Blelloch,et al.  Program-centric cost models for locality , 2013, MSPC '13.

[23]  Leslie G. Valiant,et al.  A bridging model for multi-core computing , 2008, J. Comput. Syst. Sci..

[24]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[25]  Frédéric Wagner,et al.  Hierarchical Work-Stealing , 2010, Euro-Par.