Experimental Study of Thread Scheduling Libraries on Degraded CPU

In this paper, we compare four libraries for efficiently running threads when the performance of a CPU cores are degraded. First, we are interested by 'brute performance' of the libraries when all the CPU resources are available and second, we would like to measure how the scheduling strategy impacts also the memory management in order to revisit, in the future, scheduling strategies when we artificially degrade the performance in advance. It is well known that work stealing, when done in an anarchic way, may lead to poor cache performance. It is also known that the migration of threads may induce penalties if they are too frequent. We study, at the processor level, the memory management in order to find trade-offs between active thread number that an application should start and the memory hierarchy. Our implementations, coded with the different libraries, were compared against a Pthread one where the threads are scheduled by the Linux kernel and not by a specific tool. Our experimental results indicate that scheduler may perfectly balance loads over cores but execution time is impacted in a negative way. We also put forward a relation between the L1 cache misses, the number of steals and the execution time that will allow to focus on specific points to improve 'work stealing' schedulers in the future.

[1]  Guy E. Blelloch,et al.  Effectively sharing a cache among threads , 2004, SPAA '04.

[2]  Akira Fukuda,et al.  Design and implementation of a Parallel Pthread Library (PPL) with parallelism and portability , 1998 .

[3]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[4]  Josep Torrellas,et al.  Evaluating the Performance of Cache-Affinity Scheduling in Shared-Memory Multiprocessors , 1995, J. Parallel Distributed Comput..

[5]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[6]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[7]  Yiannakis Sazeides,et al.  Performance implications of single thread migration on a chip multi-core , 2005, CARN.

[8]  Michael A. Bender,et al.  Online Scheduling of Parallel Programs on Heterogeneous Systems with Applications to Cilk , 2002, SPAA '00.

[9]  Alexandra Fedorova,et al.  Performance Implications of Cache Affinity on Multicore Processors , 2008, Euro-Par.

[10]  Peter Sanders,et al.  MCSTL: the multi-core standard template library , 2007, PPOPP.

[11]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[12]  Cédric Augonnet,et al.  A Unified Runtime System for Heterogeneous Multi-core Architectures , 2009, Euro-Par Workshops.

[13]  Mohamed Jemni,et al.  Sequential in-core sorting performance for a SQL data service and for parallel sorting on heterogeneous clusters , 2006, Future Gener. Comput. Syst..

[14]  Tao Yang,et al.  A Comparison of Clustering Heuristics for Scheduling Directed Acycle Graphs on Multiprocessors , 1992, J. Parallel Distributed Comput..