Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6× to 5.6× are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3× and 3.9×. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

[1]  Bill Moyer,et al.  A low power unified cache architecture providing power and performance flexibility , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[2]  Yun Liang,et al.  High-Level Synthesis: Productivity, Performance, and Software Constraints , 2012, J. Electr. Comput. Eng..

[3]  Stamatis Vassiliadis,et al.  The MOLEN rho-mu-Coded Processor , 2001, FPL.

[4]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[5]  Karthikeyan Sankaralingam,et al.  Power Limitations and Dark Silicon Challenge the Future of Multicore , 2012, TOCS.

[6]  Wayne H. Wolf A Decade of Hardware/Software Codesign , 2003, Computer.

[7]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[8]  Mark Horowitz,et al.  CPU DB: Recording Microprocessor History , 2012, ACM Queue.

[9]  Paul Gratz,et al.  ILP and TLP in shared memory applications: A limit study , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[10]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[11]  John Wawrzynek,et al.  ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism , 2015, Microprocess. Microsystems.

[12]  Pedro C. Diniz,et al.  Compiling for reconfigurable computing: A survey , 2010, CSUR.

[13]  Stamatis Vassiliadis,et al.  The MOLEN ρμ-coded processor , 2001 .

[14]  Kiyoung Choi Coarse-Grained Reconfigurable Array: Architecture and Application Mapping , 2011, IPSJ Trans. Syst. LSI Des. Methodol..

[15]  David Novo,et al.  From low-architectural expertise up to high-throughput non-binary LDPC decoders: Optimization guidelines using high-level synthesis , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[16]  Alex K. Jones,et al.  Interconnect Customization for a Coarse-grained Reconfigurable Fabric , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[17]  Lilian Bossuet,et al.  Architectures of flexible symmetric key crypto engines—a survey: From hardware coprocessor to multi-crypto-processor system on chip , 2013, CSUR.

[18]  Scott A. Mahlke,et al.  An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Frank Vahid,et al.  Design and implementation of a MicroBlaze-based warp processor , 2009, TECS.

[20]  Albert Y. Zomaya,et al.  A Survey of Mobile Device Virtualization , 2016, ACM Comput. Surv..

[21]  Morteza Saheb Zamani,et al.  An architecture framework for an adaptive extensible processor , 2008, The Journal of Supercomputing.

[22]  Erik R. Altman,et al.  Welcome to the Opportunities of Binary Translation , 2000, Computer.

[23]  Luigi Carro,et al.  Towards a multiple-ISA embedded system , 2013, J. Syst. Archit..

[24]  John Wawrzynek,et al.  Exploiting Memory-Level Parallelism in Reconfigurable Accelerators , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[25]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Michael Bedford Taylor,et al.  A Landscape of the New Dark Silicon Design Regime , 2013, IEEE Micro.

[27]  João M. P. Cardoso,et al.  On identifying and optimizing instruction sequences for dynamic compilation , 2010, 2010 International Conference on Field-Programmable Technology.

[28]  Tulika Mitra,et al.  Characterizing embedded applications for instruction-set extensible processors , 2004, Proceedings. 41st Design Automation Conference, 2004..

[29]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[30]  Steven Derrien,et al.  Hybrid-DBT: Hardware/Software Dynamic Binary Translation Targeting VLIW , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[32]  Luigi Carro,et al.  Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications , 2008, 2008 Design, Automation and Test in Europe.

[33]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[34]  Reiner W. Hartenstein Coarse grain reconfigurable architecture (embedded tutorial) , 2001, ASP-DAC '01.

[35]  Jie Tan,et al.  Dynamic Translation Optimization Method Based on Static Pre-Translation , 2019, IEEE Access.

[36]  Yun Wang,et al.  IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[37]  João M. P. Cardoso,et al.  A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses , 2014, TRETS.

[38]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[39]  Jim D. Garside,et al.  Optimizing Indirect Branches in Dynamic Binary Translators , 2016, ACM Trans. Archit. Code Optim..

[40]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[41]  João M. P. Cardoso,et al.  Dynamic Partial Reconfiguration of Customized Single-Row Accelerators , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[42]  Daniel D. Gajski,et al.  High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[43]  Steven Derrien,et al.  Hardware-accelerated dynamic binary translation , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[44]  Johanna Ullrich,et al.  From Hack to Elaborate Technique—A Survey on Binary Rewriting , 2019, ACM Comput. Surv..

[45]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[46]  Hamid Noori,et al.  Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization , 2010, The Journal of Supercomputing.

[47]  Kiyoung Choi,et al.  A host-accelerator communication architecture design for efficient binary acceleration , 2011, 2011 International SoC Design Conference.

[48]  Michael Laurenzano,et al.  PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[49]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50]  Steven J. E. Wilton,et al.  Interconnect architectures for modulo-scheduled coarse-grained reconfigurable arrays , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[51]  Luigi Carro,et al.  A transparent and energy aware reconfigurable multiprocessor platform for simultaneous ILP and TLP exploitation , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[52]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[53]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[54]  Luigi Carro,et al.  Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment , 2011, Int. J. Reconfigurable Comput..

[55]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[56]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[57]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[58]  David Novo,et al.  Selective Flexibility: Creating Domain-Specific Reconfigurable Arrays , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[59]  Luigi Carro,et al.  A run-time modulo scheduling by using a binary translation mechanism , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).

[60]  Mihai Sima,et al.  Coarse-grain reconfigurable architectures - taxonomy - , 2009, 2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[61]  Michael Gschwind,et al.  Dynamic and Transparent Binary Translation , 2000, Computer.

[62]  Pedro C. Diniz,et al.  Custom FPGA-based micro-architecture for streaming computing , 2011, 2011 VII Southern Conference on Programmable Logic (SPL).

[63]  Koen Bertels,et al.  The Instruction-Set Extension Problem: A Survey , 2008, TRETS.

[64]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[65]  Aviral Shrivastava,et al.  Memory access optimization in compilation for coarse-grained reconfigurable architectures , 2011, TODE.

[66]  Jürgen Teich,et al.  Hardware/Software Codesign: The Past, the Present, and Predicting the Future , 2012, Proceedings of the IEEE.

[67]  Frank Vahid,et al.  A configurable logic architecture for dynamic hardware/software partitioning , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[68]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[69]  N. Bansal,et al.  Analysis of the Performance of Coarse-Grain Reconfigurable Architectures with Different Processing Element Configurations , 2003 .

[70]  Hossein Pedram,et al.  An efficient heterogeneous reconfigurable functional unit for an adaptive dynamic extensible processor , 2007, VLSI-SoC.

[71]  Carl Ebeling,et al.  Architecture design of reconfigurable pipelined datapaths , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[72]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[73]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[74]  Mingwei Zhang,et al.  A platform for secure static binary instrumentation , 2014, VEE '14.

[75]  Kiyoung Choi,et al.  Binary acceleration using coarse-grained reconfigurable architecture , 2010, CARN.

[76]  Mike Van,et al.  UQBT: Adaptable Binary Translation at Low Cost , 2000 .

[77]  João M. P. Cardoso,et al.  Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[78]  Taewhan Kim,et al.  Clock Tree synthesis for TSV-based 3D IC designs , 2011, TODE.

[79]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[80]  Gorker Alp Malazgirt,et al.  Customizing VLIW processors from dynamically profiled execution traces , 2015, Microprocess. Microsystems.

[81]  Frank Vahid,et al.  Frequent loop detection using efficient nonintrusive on-chip hardware , 2005, IEEE Transactions on Computers.

[82]  Frank Vahid,et al.  Warp Processing: Dynamic Translation of Binaries to FPGA Circuits , 2008, Computer.

[83]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[84]  Liang Chen,et al.  A Just-in-Time Customizable processor , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[85]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[86]  Nuno Roma,et al.  Efficient data-stream management for shared-memory many-core systems , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[87]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[88]  S.A. Manavski,et al.  CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[89]  Muhammad Shafique,et al.  Concepts, architectures, and run-time systems for efficient and adaptive reconfigurable processors , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[90]  Carl Ebeling,et al.  Static versus scheduled interconnect in Coarse-Grained Reconfigurable Arrays , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[91]  Scott A. Mahlke,et al.  Modulo scheduling for highly customized datapaths to increase hardware reusability , 2008, CGO '08.

[92]  Frank Vahid,et al.  Thread Warping: Dynamic and Transparent Synthesis of Thread Accelerators , 2011, TODE.