Thermal-Aware Design and Runtime Management of 3D Stacked Multiprocessors

The sustained increase in computational performance demanded by next-generation applications drives the increasing core counts of modern multiprocessor systems. However, in the dark silicon era, the performance levels and integration density of such systems is limited by thermal constraints of their physical package. These constraints are more severe in the case of three-dimensional (3D) integrated systems, as a consequence of the complex thermal characteristics exhibited by stacked silicon dies. This dissertation investigates the development of efficient, thermal-aware multiprocessor architectures, and presents methodologies to enable the simultaneous exploration of their thermal and functional behaviour. Chapter 2 examines the efficiency of multiprocessor architectures from the perspective of the the memory hierarchy, and presents techniques that focus on the effective management and transfer of on-chip data in order to minimize the time spent waiting on memory accesses. In the case of shared-memory multiprocessors, this is achieved through the proposed Persistence Selective Caching (PSC) and CacheBalancer schemes that influence what data is stored in on-chip caches, where it is stored, and for how long. This enables the memory hierarchy to adapt to changing execution behaviour, balance resource utilization, and most importantly, reduce the average latency and energy per memory access. Further to this, Chapter 2 presents the Pronto system, which enables efficient data transfers in message-passing multiprocessors by minimizing the role of the processing element in the management of transfers. Pronto effectively decreases the overheads incurred in setting up and managing data transfers, thereby yielding shorter communication latencies. In addition, it also simplifies the semantics of data movement by abstracting implementation details of communications from the programmer, thus enabling transfers to be specified entirely at the task level. The issue of thermal-aware design for 3D Integrated Circuits (IC) using Nagata’s equation – a mathematical representation of the dark silicon problem – is investigated in Chapter 3. Significantly, the chapter explores the thermal design space of 3D ICs in terms of this equation, and proposes a high-level flow to characterize the specific influence of individual design parameters on thermal behaviour of die stacks. The results of this exploration advance the state-of-the-art by providing new insights into the critical role of power density, thermal conductivity and stack construction in the formation of hotspots in 3D ICs. Building on these insights, the Ctherm framework is proposed for the thermal-aware design of multiprocessor systems-on-chip (MPSoC). Ctherm enables the concurrent evaluation of thermal and functional performance of MPSoCs using automatically generated fine-grained area, latency and energy models for system components, and facilitates the exploration of thermal behaviour early in the system design flow. The efficacy of the framework is demonstrated using a number of practical design cases ranging from floorplanning and temperature sensor placement to application tuning. Together, the characterization and the Ctherm framework further our understanding of the thermal behaviour of die stacks, and provide a practical template for the realization of thermal-aware electronic design automation tooling for 3D ICs. The management of thermal issues that arise in 3D MPSoCs at runtime is examined in Chapter 4. Temperature control is typically exercised by means of Dynamic Thermal Management (DTM) which continuously adapt the activity and power dissipation of system components. A significant disadvantage of state-of-the-art DTMs lies in their inability to account for the non-uniform thermal behaviour of die stacks, leading to the ineffective management of temperatures and in degraded system performance. In Chapter 4, a novel 3D Dynamic Voltage Frequency Scaling (DVFS) scheme is proposed that takes these non-uniformities into account within its power management algorithm, effectively maintains operating temperatures within a safe range, and maximizes system performance within the available thermal margins at individual processing elements. Furthermore, the chapter also presents an adaptive routing strategy to decrease the magnitude of thermal gradients in network-on-chip based 3D architectures, by directing traffic along paths of low temperature. The proposed Immediate Neighbourhood Temperature (INT) adaptive routing scheme actively steers interconnect traffic away from regions with thermal hotspots based only on temperature information available in the immediate neighbourhood, relying on the heat transfer characteristics of 3D ICs to avoid the need for a global temperature monitoring network. The consequent spreading of interconnect activity over multiple paths results in balanced thermal profiles, and decreased operating temperatures across the system. Over the course of these chapters, this dissertation explores the critical issues impeding the realization of thermal-aware 3D stacked multiprocessors, and details a multifaceted approach towards addressing the challenges of dark silicon.

[1]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[2]  Oded Lempel,et al.  2nd Generation Intel® Core Processor Family: Intel® Core i7, i5 and i3 , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[3]  Diana Marculescu,et al.  Analysis of dynamic voltage/frequency scaling in chip-multiprocessors , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[4]  Gabriel H. Loh,et al.  Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Kees van Berkel,et al.  Multi-core for mobile phones , 2009, DATE.

[6]  Alan Gray,et al.  Deterministic Parallel Processing , 2006, International Journal of Parallel Programming.

[7]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[8]  Kofi A. A. Makinwa,et al.  A 0.008-mm2 area-optimized thermal-diffusivity-based temperature sensor in 160-nm CMOS for SoC thermal monitoring , 2014, ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC).

[9]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[10]  Shorin Kyo,et al.  In-vehicle vision processors for driver assistance systems , 2008, 2008 Asia and South Pacific Design Automation Conference.

[11]  R. S. Jagtap,et al.  A Methodology for Early Exploration of TSV Interconnects in 3D Stacked ICs , 2011 .

[12]  Chin-Chung Tsai,et al.  A time-to-digital-converter-based CMOS smart temperature sensor , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[13]  W. Van Teijlingen,et al.  Determining Performance Boundaries and Automatic Loop Optimization of High-Level System Specifications , 2014 .

[14]  Anant Agarwal,et al.  rMPI: Message Passing on Multicore Processors with On-Chip Interconnect , 2008, HiPEAC.

[15]  José Ignacio Hidalgo,et al.  3D thermal-aware floorplanner using a MOEA approximation , 2013, Integr..

[16]  Stephen W. Keckler,et al.  Regional congestion awareness for load balance in networks-on-chip , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[17]  Huaxi Gu,et al.  DTBR: A dynamic thermal-balance routing algorithm for Network-on-Chip , 2012, Comput. Electr. Eng..

[18]  Nikolas Ioannou,et al.  Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[19]  José González,et al.  Thermal-aware clustered microarchitectures , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[20]  Terrence S. T. Mak,et al.  Thermal Optimization in Network-on-Chip-Based 3D Chip Multiprocessors Using Dynamic Programming Networks , 2014, ACM Trans. Embed. Comput. Syst..

[21]  Amir Zjajo,et al.  A 11 µW 0°C–160°C temperature sensor in 90 nm CMOS for adaptive thermal monitoring of VLSI circuits , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[22]  Jason Cong,et al.  An automated design flow for 3D microarchitecture evaluation , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[23]  Paul D. Franzon,et al.  Thermal Pathfinding for 3-D ICs , 2014, IEEE Transactions on Components, Packaging and Manufacturing Technology.

[24]  Anton Bakker CMOS smart temperature sensors - an overview , 2002, Proceedings of IEEE Sensors.

[25]  Sri Parameswaran,et al.  HitME: Low power Hit MEmory buffer for embedded systems , 2009, 2009 Asia and South Pacific Design Automation Conference.

[26]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[27]  An-Yeu Wu,et al.  Traffic-and thermal-aware routing for throttled three-dimensional Network-on-Chip systems , 2011, Proceedings of 2011 International Symposium on VLSI Design, Automation and Test.

[28]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[29]  Timothy G. Mattson,et al.  Light-weight communications on Intel's single-chip cloud computer processor , 2011, OPSR.

[30]  H. Kufluoglu,et al.  A Computational Model of NBTI and Hot Carrier Injection Time-Exponents for MOSFET Reliability , 2004 .

[31]  Mateo Valero,et al.  Software management of selective and dual data caches , 1997 .

[32]  Coniferous softwood GENERAL TERMS , 2003 .

[33]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[35]  Ankush Varma High-Speed Performance, Power and Thermal Co-simulation For SoC Design , 2007 .

[36]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[37]  Amir Zjajo,et al.  Dynamic Thermal Estimation Methodology for High-Performance 3-D MPSoC , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[38]  Kevin Kai-Wei Chang,et al.  HAT: Heterogeneous Adaptive Throttling for On-Chip Networks , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[39]  E. Beyne,et al.  Numerical and experimental characterization of the thermal behavior of a packaged DRAM-on-logic stack , 2012, 2012 IEEE 62nd Electronic Components and Technology Conference.

[40]  Fabien Clermidy,et al.  3D Embedded multi-core: Some perspectives , 2011, 2011 Design, Automation & Test in Europe.

[41]  Alberto Ros,et al.  Distance-aware round-robin mapping for large NUCA caches , 2009, 2009 International Conference on High Performance Computing (HiPC).

[42]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[43]  Scott Hauck,et al.  FPGA vs. MPPA for Positron Emission Tomography pulse processing , 2009, 2009 International Conference on Field-Programmable Technology.

[44]  Paul D. Franzon,et al.  Creating 3D specific systems: Architecture, design and CAD , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[45]  Paul Marchal,et al.  Flexible hardware/software support for message passing on a distributed shared memory architecture , 2005, Design, Automation and Test in Europe.

[46]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[47]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[48]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[49]  David Atienza,et al.  Thermal analysis and active cooling management for 3D MPSoCs , 2011, ISCAS.

[50]  Paul Chow,et al.  TMD-MPI: An MPI Implementation for Multiple Processors Across Multiple FPGAs , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[51]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[52]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[53]  Luca Benini,et al.  Exploring "temperature-aware" design in low-power MPSoCs , 2006, DATE.

[54]  David Blaauw,et al.  Nanometer Device Scaling in Subthreshold Circuits , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[55]  Alexander V. Veidenbaum,et al.  Revisiting level-0 caches in embedded processors , 2012, CASES '12.

[56]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[57]  Jean-Luc Gaudiot,et al.  Throttling-Based Resource Management in High Performance Multithreaded Architectures , 2006, IEEE Transactions on Computers.

[58]  Federico Angiolini,et al.  Automated Pathfinding tool chain for 3D-stacked integrated circuits: Practical case study , 2009, 2009 IEEE International Conference on 3D System Integration.

[59]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[60]  Seda Ogrenci Memik,et al.  Optimizing Thermal Sensor Allocation for Microprocessors , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[61]  Edward A. Lee,et al.  Dataflow process networks , 2001 .

[62]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[63]  Philip Jacob,et al.  Thermal Modeling of 3-D Stacked DRAM Over SiGe HBT BiCMOS CPU , 2015, IEEE Access.

[64]  Zhiyi Yu,et al.  A 167-Processor Computational Platform in 65 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[65]  Nanning Zheng,et al.  3D DRAM Design and Application to 3D Multicore Systems , 2009, IEEE Design & Test of Computers.

[66]  Gerald H. Hilderink,et al.  Parallel Processing — the picoChip way! , 2003 .

[67]  Arvind Sridhar,et al.  Thermal modeling and analysis of 3D multi-processor chips , 2010, Integr..

[68]  Theocharis Theocharides,et al.  Intelligent Hotspot Prediction for Network-on-Chip-Based Multicore Systems , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[69]  Narayanan Vijaykrishnan,et al.  Variation Impact on SER of Combinational Circuits , 2007, 8th International Symposium on Quality Electronic Design (ISQED'07).

[70]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[71]  Florence Maraninchi,et al.  Co-simulation of Functional SystemC TLM Models with Power/Thermal Solvers , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[72]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[73]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[74]  Robert C. Aitken,et al.  Low Power Methodology Manual - for System-on-Chip Design , 2007 .

[75]  Koen Bertels,et al.  QUAD - A Memory Access Pattern Analyser , 2010, ARC.

[76]  Amir Zjajo,et al.  Physical characterization of steady-state temperature profiles in three-dimensional integrated circuits , 2015, 2015 IEEE International Symposium on Circuits and Systems (ISCAS).

[77]  Federico Angiolini,et al.  QoS-ocMPI: QoS-aware on-chip Message Passing Library for NoC-based Many-Core MPSoCs , 2010 .

[78]  Eby G. Friedman,et al.  Thermal conduction path analysis in 3-D ICs , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[79]  Amir Fijany,et al.  Very low power parallel implementation of stereo vision algorithm on a solar cell powered MIMD many core architecture , 2011, 2011 Aerospace Conference.

[80]  Tao Zhang,et al.  A customized design of DRAM controller for on-chip 3D DRAM stacking , 2010, IEEE Custom Integrated Circuits Conference 2010.

[81]  Petru Eles,et al.  On-line thermal aware dynamic voltage scaling for energy optimization with frequency/temperature dependency consideration , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[82]  Geoffrey Brown,et al.  ρ-VEX: A reconfigurable and extensible softcore VLIW processor , 2008, 2008 International Conference on Field-Programmable Technology.

[83]  Liam Madden Heterogeneous 3-d stacking, can we have the best of both (technology) worlds? , 2013, ISPD '13.

[84]  Bart Vandevelde,et al.  Fine grain thermal modeling and experimental validation of 3D-ICs , 2011, Microelectron. J..

[85]  S. S. Kumar TMFab: A Transactional Memory Fabric for Chip Multiprocessors , 2010 .

[86]  Tao Zhang,et al.  3D-SWIFT: a high-performance 3D-stacked wide IO DRAM , 2014, GLSVLSI '14.

[87]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[88]  A. Michos A Novel Concurrent Validation Scheme for Hardware Transactional Memory , 2012 .

[89]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[90]  Gernot Heiser,et al.  Slow Down or Sleep, That Is the Question , 2011, USENIX Annual Technical Conference.

[91]  Sumeet S. Kumar,et al.  A 3D Network-on-Chip for stacked-die transactional chip multiprocessors using Through Silicon Vias , 2011, 2011 6th International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS).

[92]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[93]  Saurabh Dighe,et al.  A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[94]  Brent E. Nelson,et al.  Comparing fine-grained performance on the Ambric MPPA against an FPGA , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[95]  Changkyu Kim,et al.  Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches , 2003, IEEE Micro.

[96]  David Atienza,et al.  3D Thermal-aware floorplanner for many-core single-chip systems , 2011, 2011 12th Latin American Test Workshop (LATW).

[97]  Ed F. Deprettere,et al.  Daedalus: Toward composable multimedia MP-SoC design , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[98]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[99]  Margaret Martonosi,et al.  Voltage and frequency control with adaptive reaction time in multiple-clock-domain processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[100]  Seung Wook Yoon,et al.  3D TSV processes and its assembly/packaging technology , 2009, 2009 IEEE International Conference on 3D System Integration.

[101]  Li Shang,et al.  PowerHerd: a distributed scheme for dynamically satisfying peak-power constraints in interconnection networks , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[102]  Lingjia Tang,et al.  Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.

[103]  Guoping Xu Evaluation of a Liquid Cooling Concept for High Power Processors , 2007, Twenty-Third Annual IEEE Semiconductor Thermal Measurement and Management Symposium.

[104]  Kai Ma,et al.  Adaptive Power Control with Online Model Estimation for Chip Multiprocessors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[105]  T. Mohsenin,et al.  A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling , 2008, 2008 IEEE Symposium on VLSI Circuits.

[106]  Luca Benini,et al.  HW-SW emulation framework for temperature-aware design in MPSoCs , 2008, TODE.

[107]  Shorin Kyo,et al.  IMAPCAR: A 100 GOPS In-Vehicle Vision Processor Based on 128 Ring Connected Four-Way VLIW Processing Elements , 2011, J. Signal Process. Syst..

[108]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[109]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[110]  Jeongho Cho,et al.  Test and debug strategy for TSMC CoWoS™ stacking process based heterogeneous 3D IC: A silicon case study , 2013, 2013 IEEE International Test Conference (ITC).

[111]  J. W. McPherson,et al.  Reliability challenges for 45nm and beyond , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[112]  Ge-Ming Chiu,et al.  The Odd-Even Turn Model for Adaptive Routing , 2000, IEEE Trans. Parallel Distributed Syst..

[113]  Mike Butts,et al.  Synchronization through Communication in a Massively Parallel Processor Array , 2007, IEEE Micro.

[114]  Keiji Matsumoto,et al.  Thermal resistance measurements of interconnections, for the investigation of the thermal resistance of a three-dimensional (3D) chip stack , 2009, 2009 25th Annual IEEE Semiconductor Thermal Measurement and Management Symposium.

[115]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[116]  John Kubiatowicz,et al.  Integrated shared-memory and message-passing communication in the Alewife multiprocessor , 1998 .

[117]  Luca Benini,et al.  A virtual platform environment for exploring power, thermal and reliability management control strategies in high-performance multicores , 2010, GLSVLSI '10.

[118]  Pradip Bose,et al.  Stretching the limits of clock-gating efficiency in server-class processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[119]  Gabriel H. Loh,et al.  3D-Integrated SRAM Components for High-Performance Microprocessors , 2009, IEEE Transactions on Computers.

[120]  Lieven Eeckhout,et al.  Comparing Benchmarks Using Key Microarchitecture-Independent Characteristics , 2006, 2006 IEEE International Symposium on Workload Characterization.

[121]  Li Shang,et al.  Three-Dimensional Chip-Multiprocessor Run-Time Thermal Management , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[122]  Amir Zjajo,et al.  System Level Methodology for Interconnect Aware and Temperature Constrained Power Management of 3-D MP-SOCs , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[123]  P. Soussan,et al.  Comprehensive analysis of the impact of single and arrays of through silicon vias induced stress on high-k / metal gate CMOS performance , 2010, 2010 International Electron Devices Meeting.

[124]  David A. Padua,et al.  Calculating stack distances efficiently , 2002, MSP/ISMM.

[125]  Radhika Sanjeev Jagtap,et al.  A Methodology for Early Exploration of TSV Placement Topologies in 3D Stacked ICs , 2012, 2012 15th Euromicro Conference on Digital System Design.

[126]  David Atienza,et al.  3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[127]  Brad Budlong,et al.  Reconfigurable Work Farms on a Massively Parallel Processor Array , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[128]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[129]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[130]  M. Nagata,et al.  Limitations, innovations, and challenges of circuits and devices into a half micrometer and beyond , 1992 .

[131]  A. Varma,et al.  Selective victim caching: a method to improve the performance of direct-mapped caches , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[132]  An-Yeu Wu,et al.  Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems , 2010, 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip.

[133]  J. De Klerk Cache Balancer: A communication latency and utilization aware resource manager , 2014 .

[134]  C. Feenstra A Memory Access and Operator Usage Pro?ler Framework for HLS Optimization: Using the Lucas Optical Flow Algorithm as Case Study , 2011 .

[135]  Robert S. Patti Three-Dimensional Integrated Circuits and the Future of System-on-Chip Designs In 3D integrated circuits, analog, digital, flash and DRAM wafers are processed separately, then brought together in an integrated vertical stack. , 2006 .