Shared resource management for efficient heterogeneous computing

[1]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[2]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[3]  Jens Sparsø,et al.  A router architecture for connection-oriented service guarantees in the MANGO clockless network-on-chip , 2005, Design, Automation and Test in Europe.

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[6]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[7]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[8]  Kees G. W. Goossens,et al.  Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip , 2003, DATE.

[9]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[10]  John Paul Shen,et al.  Mitigating Amdahl's law through EPI throttling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[11]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[12]  Chi-Ying Tsui,et al.  Optimal link scheduling on improving best-effort and guaranteed services performance in network-on-chip systems , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[13]  Chita R. Das,et al.  Aérgia: exploiting packet latency slack in on-chip networks , 2010, ISCA.

[14]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[16]  Sudhakar Yalamanchili,et al.  Accelerating simulation of agent-based models on heterogeneous architectures , 2013, GPGPU@ASPLOS.

[17]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[18]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[19]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[20]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[21]  Krste Asanovic,et al.  Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[22]  Wei Huang,et al.  Processor-Memory Power Shifting for Multi-Core Systems , 2012 .

[23]  Chita R. Das,et al.  ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[24]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Jongman Kim,et al.  Virtualizing Virtual Channels for Increased Network-on-Chip Robustness and Upgradeability , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[26]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[27]  Sherief Reda,et al.  Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[29]  Jian Li,et al.  Power-efficient time-sensitive mapping in heterogeneous systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Kees Goossens,et al.  AEthereal network on chip: concepts, architectures, and implementations , 2005, IEEE Design & Test of Computers.

[31]  Lei Gao,et al.  A dynamically-allocated virtual channel architecture with congestion awareness for on-chip routers , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[32]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[33]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Francisco J. Cazorla,et al.  MLP-Aware Dynamic Cache Partitioning , 2008, HiPEAC.

[35]  Onur Mutlu,et al.  Preemptive Virtual Clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Onur Mutlu,et al.  Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[37]  Bishop Brock,et al.  Architecting for power management: The IBM® POWER7™ approach , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[38]  Kees G. W. Goossens,et al.  dAElite: A TDM NoC Supporting QoS, Multicast, and Fast Connection Set-Up , 2014, IEEE Transactions on Computers.

[39]  G. Edward Suh,et al.  A new memory monitoring scheme for memory-aware scheduling and partitioning , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[40]  Yvon Jégou,et al.  Using virtual lines to enhance locality exploitation , 1994, ICS '94.

[41]  Mahmut T. Kandemir,et al.  PEPON: Performance-aware hierarchical power budgeting for NoC based multicores , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[42]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[43]  Li Shang,et al.  Multi-Optimization power management for chip multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Karthick Rajamani,et al.  Power-performance management on an IBM POWER7 server , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[45]  Kees G. W. Goossens,et al.  Aelite: A flit-synchronous Network on Chip with composable and predictable services , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[46]  Naehyuck Chang,et al.  Accurate modeling and calculation of delay and energy overheads of dynamic voltage scaling in modern high-performance microprocessors , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[47]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[48]  Natalie D. Enright Jerger,et al.  Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[49]  Sudhakar Yalamanchili,et al.  Cooperative boosting: needy versus greedy power management , 2013, ISCA.

[50]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[51]  Gary S. Tyson,et al.  Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[52]  Xuejun Yang,et al.  IPC-Based Cache Partitioning: An IPC-Oriented Dynamic Shared Cache Partitioning Mechanism , 2008, 2008 International Conference on Convergence and Hybrid Information Technology.

[53]  Kai Ma,et al.  DPPC: Dynamic power partitioning and capping in chip multiprocessors , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[54]  Kai Ma,et al.  PGCapping: Exploiting power gating for power capping and core lifetime balancing in CMPs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[55]  Russell Tessier,et al.  ASOC: a scalable, single-chip communications architecture , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[56]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[57]  Sudhakar Yalamanchili,et al.  Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures , 2013, ACM Trans. Design Autom. Electr. Syst..

[58]  Hiroshi Sasaki,et al.  Coordinated power-performance optimization in manycores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[59]  Yusuf Leblebici,et al.  Quantitative modelling and comparison of communication schemes to guarantee quality-of-service in networks-on-chip , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[60]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[61]  Tor M. Aamodt,et al.  Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Cédric Augonnet,et al.  Data-Aware Task Scheduling on Multi-accelerator Based Platforms , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[63]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[64]  Kai Ma,et al.  Scalable power control for many-core architectures running multi-threaded applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[65]  Karthick Rajamani,et al.  A performance-conserving approach for reducing peak power consumption in server systems , 2005, ICS '05.

[66]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[67]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[68]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[69]  Efraim Rotem,et al.  Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[70]  José L. Sánchez,et al.  Exploring NoC Virtualization Alternatives in CMPs , 2012, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[71]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[72]  Radu Marculescu,et al.  Analysis and optimization of prediction-based flow control in networks-on-chip , 2008, TODE.

[73]  Gabriel H. Loh,et al.  Scalable Shared-Cache Management by Containing Thrashing Workloads , 2010, HiPEAC.

[74]  Théodore Marescaux,et al.  Introducing the SuperGT Network-on-Chip; SuperGT QoS: more than just GT , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[75]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[76]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[77]  Fang Liu,et al.  Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[78]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[79]  Hsien-Hsin S. Lee,et al.  Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning , 2003, ISLPED '03.

[80]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[81]  Yuval Tamir,et al.  Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches , 1992, IEEE Trans. Computers.

[82]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[83]  Kees G. W. Goossens,et al.  Networks on silicon: combining best-effort and guaranteed services , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[84]  Kees G. W. Goossens,et al.  A Design Flow for Application-Specific Networks on Chip with Guaranteed Performance to Accelerate SOC Design and Verification , 2005, Design, Automation and Test in Europe.

[85]  Wolf-Dietrich Weber,et al.  A quality-of-service mechanism for interconnection networks in system-on-chips , 2005, Design, Automation and Test in Europe.

[86]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[87]  T. Sakurai,et al.  Run-time voltage hopping for low-power real-time systems , 2000, Proceedings 37th Design Automation Conference.

[88]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[89]  Fabien Clermidy,et al.  An asynchronous NOC architecture providing low latency service and its multi-level design framework , 2005, 11th IEEE International Symposium on Asynchronous Circuits and Systems.

[90]  K.J. Nesbit,et al.  AC/DC: an adaptive data cache prefetcher , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[91]  Natalie D. Enright Jerger,et al.  Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[92]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[93]  Mahmut T. Kandemir,et al.  SHARP control: Controlled shared cache management in chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[94]  R. Marculescu,et al.  Exploiting the routing flexibility for energy/performance aware mapping of regular NoC architectures , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[95]  José Duato,et al.  A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks , 2005, 11th International Symposium on High-Performance Computer Architecture.

[96]  Natalie D. Enright Jerger,et al.  DBAR: An efficient routing algorithm to support multiple concurrent applications in networks-on-chip , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[97]  Hao Wang,et al.  Workload and power budget partitioning for single-chip heterogeneous processors , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[98]  Christine A. Shoemaker,et al.  Scalable thread scheduling and global power management for heterogeneous many-core architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[99]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[100]  Michael J. Schulte,et al.  ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[101]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[102]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[103]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[104]  Ran Ginosar,et al.  An asynchronous router for multiple service levels networks on chip , 2005, 11th IEEE International Symposium on Asynchronous Circuits and Systems.

[105]  Mainak Chaudhuri,et al.  Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[106]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[107]  Mainak Chaudhuri,et al.  Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[108]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[109]  Soraya Ghiasi,et al.  Scheduling for heterogeneous processors in server systems , 2005, CF '05.

[110]  Ran Ginosar,et al.  QNoC: QoS architecture and design process for network on chip , 2004, J. Syst. Archit..

[111]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[112]  Axel Jantsch,et al.  Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[113]  Jing Li,et al.  Fast lock scheme for phase-locked loops , 2009, 2009 IEEE Custom Integrated Circuits Conference.

[114]  Russell Tessier,et al.  An architecture and compiler for scalable on-chip communication , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[115]  Kai Ma,et al.  Temperature-constrained power control for chip multiprocessors with online model estimation , 2009, ISCA '09.

[116]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[117]  Axel Jantsch,et al.  Load distribution with the proximity congestion awareness in a network on chip , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[118]  R. Marculescu,et al.  Traffic analysis for on-chip networks design of multimedia applications , 2002, Proceedings 2002 Design Automation Conference (IEEE Cat. No.02CH37324).

[119]  Kevin Kai-Wei Chang,et al.  HAT: Heterogeneous Adaptive Throttling for On-Chip Networks , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[120]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[121]  Timothy Mark Pinkston,et al.  Evaluation of queue designs for true fully adaptive routers , 2004, J. Parallel Distributed Comput..

[122]  Radu Marculescu,et al.  DyAD - smart routing for networks-on-chip , 2004, Proceedings. 41st Design Automation Conference, 2004..

[123]  Hsien-Hsin S. Lee,et al.  Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors , 2008, ASPLOS.

[124]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[125]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[126]  Chita R. Das,et al.  A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).