Understanding Co-run Degradations on Integrated Heterogeneous Processors

Co-runs of independent applications on systems with heterogeneous processors are common (data centers, mobile devices, etc.). There has been limited understanding on the influence of co-runners on such systems. The previous studys on this topic are on simulators with limited settings.

[1]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[2]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[4]  Jiri Filipovic,et al.  Automatic fusions of CUDA-GPU kernels for parallel map , 2011, CARN.

[5]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[6]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[7]  Mark S. Squillante,et al.  Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling , 1993, IEEE Trans. Parallel Distributed Syst..

[8]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[9]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[10]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[11]  Michael F. P. O'Boyle,et al.  OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.

[12]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Michael D. Smith,et al.  Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[14]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  Evangelos P. Markatos,et al.  Using processor affinity in loop scheduling on shared-memory multiprocessors , 1992, Supercomputing '92.

[16]  Xipeng Shen,et al.  A study on optimally co-scheduling jobs of different lengths on chip multiprocessors , 2009, CF '09.

[17]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[18]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[20]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[21]  Sandhya Dwarkadas,et al.  Compatible phase co-scheduling on a CMP of multi-threaded processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS 2010.

[23]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[24]  Benjamin C. Lee,et al.  REF: resource elasticity fairness with sharing incentives for multiprocessors , 2014, ASPLOS.

[25]  Sudhakar Yalamanchili,et al.  Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[26]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[27]  Dean M. Tullsen,et al.  Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[28]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[29]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[30]  Xipeng Shen,et al.  Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors , 2010, HiPEAC.