Execution Drafting: Energy Efficiency through Computation Deduplication

Computation is increasingly moving to the data enter. Thus, the energy used by CPUs in the data centeris gaining importance. The centralization of computation in the data center has also led to much commonality between the applications running there. For example, there are many instances of similar or identical versions of the Apache web server running in a large data center. Many of these applications, such as bulk image resizing or video Transco ding, favor increasing throughput over single stream performance. In this work, we propose Execution Drafting, an architectural technique for executing identical instructions from different programs or threads on the same multithreaded core, such that they flow down the pipe consecutively, or draft. Drafting reduces switching and removes the need to fetch and decode drafted instructions, thereby saving energy. Drafting can also reduce the energy of the execution and commit stages of a pipeline when drafted instructions have similar operands, such as when loading constants. We demonstrate Execution Drafting saving energy when executing the same application with different data, as well as different programs operating on different data, as is the case for different versions of the same program. We evaluate hardware techniques to identify when to draft and analyze the hardware overheads of Execution Drafting implemented in an Open SPARC T1 core. We show that Execution Drafting can result in substantial performance per energy gains (up to 20%) in a data center without decreasing throughput or dramatically increasing latency.

[1]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[2]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[3]  Rachel Courtland The high stakes of low power , 2012 .

[4]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[5]  Jinuk Luke Shin,et al.  The UltraSPARC T1 Processor: CMT Reliability , 2006, IEEE Custom Integrated Circuits Conference 2006.

[6]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[7]  Sylvain Collange Stack-less SIMT reconvergence at low cost , 2011 .

[8]  Dongrui Fan,et al.  Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Diana Marculescu,et al.  Analysis of dynamic voltage/frequency scaling in chip-multiprocessors , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[10]  Rajkumar Buyya,et al.  Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing , 2012, Future Gener. Comput. Syst..

[11]  Richard M. Stallman,et al.  Using the GNU Compiler Collection , 2010 .

[12]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[14]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Vanish Talwar,et al.  No "power" struggles: coordinated multi-level power management for the data center , 2008, ASPLOS.

[17]  E. Witchel,et al.  Direct addressed caches for reduced power consumption , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[18]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[19]  José González,et al.  Energy efficiency via thread fusion and value reuse , 2010, IET Comput. Digit. Tech..

[20]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[21]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[22]  Christoforos E. Kozyrakis,et al.  Towards energy-proportional datacenter memory with mobile DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[23]  John Sartori,et al.  Power balanced pipelines , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[24]  Rajkumar Buyya,et al.  Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[25]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[26]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[27]  Yanpei Chen,et al.  Towards Energy Efficient MapReduce , 2009 .

[28]  Sean Matthew Dorward,et al.  Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[29]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[30]  Roy T. Fielding,et al.  The Apache HTTP Server Project , 1997, IEEE Internet Comput..

[31]  Sudhakar Yalamanchili,et al.  SIMD re-convergence at thread frontiers , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Krste Asanovic,et al.  RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture , 2006 .

[33]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[34]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[35]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[36]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[37]  Krste Asanovic Energy-Exposed Instruction Set Architectures , 2000 .

[38]  Eitan Frachtenberg,et al.  Many-core key-value store , 2011, 2011 International Green Computing Conference and Workshops.

[39]  José González,et al.  Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors , 2010, TACO.

[40]  Antonio González,et al.  Dynamic removal of redundant computations , 1999, ICS '99.

[41]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[42]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[43]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[44]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[45]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[46]  Scott A. Mahlke,et al.  Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[47]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[48]  Shuang Wu,et al.  Virtual Machine Based Energy-Efficient Data Center Architecture for Cloud Computing: A Performance Perspective , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[49]  Rong Ge,et al.  Improving MapReduce energy efficiency for computation intensive workloads , 2011, 2011 International Green Computing Conference and Workshops.

[50]  Jeffrey C. Mogul,et al.  A trace-based analysis of duplicate suppression in HTTP , 2000 .

[51]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[52]  John Paul Shen,et al.  Post-pass binary adaptation for software-based speculative precomputation , 2002, PLDI '02.

[53]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[54]  Eric Rotenberg,et al.  Multithreaded Instruction Sharing , 2010 .

[55]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[56]  Archana Ganapathi,et al.  Statistical Workloads for Energy Efficient MapReduce , 2010 .

[57]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[58]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.

[59]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[60]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[61]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[62]  Donald Yeung,et al.  Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[63]  Rajkumar Buyya,et al.  Energy Efficient Resource Management in Virtualized Cloud Data Centers , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[64]  Thomas F. Wenisch,et al.  DreamWeaver: architectural support for deep sleep , 2012, ASPLOS XVII.

[65]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[66]  Yanpei Chen,et al.  Energy efficiency for large-scale MapReduce workloads with significant interactive analysis , 2012, EuroSys '12.

[67]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[68]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.