Analysis and approximation of optimal co-scheduling on cmp

In recent years, the increasing design complexity and the problems of power and heat dissipation have caused a shift in processor technology to favor Chip Multiprocessors. In Chip Multiprocessors (CMP) architecture, it is common that multiple cores share some on-chip cache. The sharing may cause cache thrashing and contention among co-running jobs. Job co-scheduling is an approach to tackling the problem by assigning jobs to cores appropriately so that the contention and consequent performance degradations are minimized. This dissertation aims to tackle two of the most prominent challenges in job co-scheduling. The first challenge is in the computational complexity for determining optimal job co-schedules. This dissertation presents one of the first systematic analyses on the complexity of job co-scheduling. Besides proving the NP completeness of job co-scheduling, it introduces a set of algorithms, based on graph theory and Integer/Linear Programming, for computing optimal co-schedules or their lower bounds in scenarios with or without job migrations. For complex cases, it empirically demonstrates the feasibility for approximating the optimal schedules effectively by proposing several heuristics-based algorithms. These discoveries facilitate the assessment of job co-schedulers by providing necessary baselines, and shed insights to the development of practical co-scheduling systems. The second challenge resides in the prediction of the performance of processes co-running on a shared cache. This dissertation explores the influence on co-run performance prediction imposed by co-runners, program inputs, and cache configurations. Through a sequence of formal analysis, we derive an analytical co-run locality model, uncovering the inherent statistical connections between the data references of programs single-runs and their co-run locality. The model offers theoretical insights on co-run locality analysis and leads to a lightweight approach for fast prediction of shared cache performance. We demonstrate the effectiveness of the model in enabling proactive job co-scheduling. Together, the two-dimensional findings open up many new opportunities for cache management on modern CMP by laying the foundation for job co-scheduling, and enhancing the understanding to data locality and cache sharing significantly.

[1]  Peter J. Denning,et al.  Thrashing: its causes and prevention , 1968, AFIPS Fall Joint Computing Conference.

[2]  Xipeng Shen,et al.  Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors , 2010, HiPEAC.

[3]  Margo I. Seltzer,et al.  Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design , 2005, USENIX Annual Technical Conference, General Track.

[4]  Jack Edmonds,et al.  Maximum matching and a polyhedron with 0,1-vertices , 1965 .

[5]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[6]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[7]  Rajiv Gupta,et al.  Copy or Discard execution model for speculative parallelization on multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[8]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[9]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[10]  Xipeng Shen,et al.  A study on optimally co-scheduling jobs of different lengths on chip multiprocessors , 2009, CF '09.

[11]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[12]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[13]  Tong Li,et al.  Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin , 2009, PPoPP '09.

[14]  J. N. Amaral,et al.  Benchmark Design for Robust Profile-Directed Optimization , 2007 .

[15]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[16]  Xipeng Shen,et al.  Exploration of the Influence of Program Inputs on CMP Co-scheduling , 2008, Euro-Par.

[17]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[18]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[19]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[20]  Feng Mao,et al.  Cross-Input Learning and Discriminative Prediction in Evolvable Virtual Machines , 2009, 2009 International Symposium on Code Generation and Optimization.

[21]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[22]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[23]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[24]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Chen Ding,et al.  Locality approximation using time , 2007, POPL '07.

[26]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[27]  Harold S. Stone,et al.  Footprints in the cache , 1986, SIGMETRICS '86/PERFORMANCE '86.

[28]  Dean M. Tullsen,et al.  Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.

[29]  Dean M. Tullsen,et al.  Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[30]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[31]  Dean M. Tullsen,et al.  Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture , 2008, HiPEAC.

[32]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[33]  Dean M. Tullsen,et al.  Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[34]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[36]  Xipeng Shen,et al.  Scalable Implementation of Efficient Locality Approximation , 2008, LCPC.

[37]  William J. Cook,et al.  Computing Minimum-Weight Perfect Matchings , 1999, INFORMS J. Comput..

[38]  Sandhya Dwarkadas,et al.  Compatible phase co-scheduling on a CMP of multi-threaded processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[39]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[40]  Chen Ding,et al.  Software behavior oriented parallelization , 2007, PLDI '07.

[41]  Sanjay Mehrotra,et al.  On the Implementation of a Primal-Dual Interior Point Method , 1992, SIAM J. Optim..

[42]  Michael D. Smith,et al.  Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[43]  David A. Padua,et al.  Compile-Time Based Performance Prediction , 1999, LCPC.

[44]  Yan Solihin,et al.  Scal-Tool: Pinpointing and Quantifying Scalability Bottlenecks in DSM Multiprocessors , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[45]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[46]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[47]  Erik Hagersten,et al.  A statistical multiprocessor cache model , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[48]  Alex Settle,et al.  Architectural Support for Enhanced SMT Job Scheduling , 2004, IEEE PACT.

[49]  Yin Zhang,et al.  Solving large-scale linear programs by interior-point methods under the Matlab ∗ Environment † , 1998 .

[50]  Rudolf Eigenmann,et al.  Speculative thread decomposition through empirical optimization , 2007, PPoPP.

[51]  Sivarama P. Dandamudi Parallel and Cluster Systems , 2003 .

[52]  Susan J. Eggers,et al.  Thread-Sensitive Scheduling for SMT Processors , 2000 .

[53]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[54]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[55]  Ian Pratt,et al.  Hyper-Threading Aware Process Scheduling Heuristics , 2005, USENIX Annual Technical Conference, General Track.

[56]  Steve Carr,et al.  Feedback-directed memory disambiguation through store distance analysis , 2006, ICS '06.

[57]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI '03.

[58]  Alan Jay Smith,et al.  On the effectiveness of set associative page mapping and its application to main memory management , 1976, ICSE '76.

[59]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[60]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[61]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.