Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey
暂无分享,去创建一个
João M. P. Cardoso | João Canas Ferreira | Nuno Paulino | N. Paulino | João MP Cardoso | J. Ferreira
[1] Bill Moyer,et al. A low power unified cache architecture providing power and performance flexibility , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).
[2] Yun Liang,et al. High-Level Synthesis: Productivity, Performance, and Software Constraints , 2012, J. Electr. Comput. Eng..
[3] Stamatis Vassiliadis,et al. The MOLEN rho-mu-Coded Processor , 2001, FPL.
[4] Scott A. Mahlke,et al. Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.
[5] Karthikeyan Sankaralingam,et al. Power Limitations and Dark Silicon Challenge the Future of Multicore , 2012, TOCS.
[6] Wayne H. Wolf. A Decade of Hardware/Software Codesign , 2003, Computer.
[7] Scott A. Mahlke,et al. VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.
[8] Mark Horowitz,et al. CPU DB: Recording Microprocessor History , 2012, ACM Queue.
[9] Paul Gratz,et al. ILP and TLP in shared memory applications: A limit study , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[10] Karthikeyan Sankaralingam,et al. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.
[11] John Wawrzynek,et al. ASTRO: Synthesizing application-specific reconfigurable hardware traces to exploit memory-level parallelism , 2015, Microprocess. Microsystems.
[12] Pedro C. Diniz,et al. Compiling for reconfigurable computing: A survey , 2010, CSUR.
[13] Stamatis Vassiliadis,et al. The MOLEN ρμ-coded processor , 2001 .
[14] Kiyoung Choi. Coarse-Grained Reconfigurable Array: Architecture and Application Mapping , 2011, IPSJ Trans. Syst. LSI Des. Methodol..
[15] David Novo,et al. From low-architectural expertise up to high-throughput non-binary LDPC decoders: Optimization guidelines using high-level synthesis , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).
[16] Alex K. Jones,et al. Interconnect Customization for a Coarse-grained Reconfigurable Fabric , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[17] Lilian Bossuet,et al. Architectures of flexible symmetric key crypto engines—a survey: From hardware coprocessor to multi-crypto-processor system on chip , 2013, CSUR.
[18] Scott A. Mahlke,et al. An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[19] Frank Vahid,et al. Design and implementation of a MicroBlaze-based warp processor , 2009, TECS.
[20] Albert Y. Zomaya,et al. A Survey of Mobile Device Virtualization , 2016, ACM Comput. Surv..
[21] Morteza Saheb Zamani,et al. An architecture framework for an adaptive extensible processor , 2008, The Journal of Supercomputing.
[22] Erik R. Altman,et al. Welcome to the Opportunities of Binary Translation , 2000, Computer.
[23] Luigi Carro,et al. Towards a multiple-ISA embedded system , 2013, J. Syst. Archit..
[24] John Wawrzynek,et al. Exploiting Memory-Level Parallelism in Reconfigurable Accelerators , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.
[25] Amin Ansari,et al. Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[26] Michael Bedford Taylor,et al. A Landscape of the New Dark Silicon Design Regime , 2013, IEEE Micro.
[27] João M. P. Cardoso,et al. On identifying and optimizing instruction sequences for dynamic compilation , 2010, 2010 International Conference on Field-Programmable Technology.
[28] Tulika Mitra,et al. Characterizing embedded applications for instruction-set extensible processors , 2004, Proceedings. 41st Design Automation Conference, 2004..
[29] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[30] Steven Derrien,et al. Hybrid-DBT: Hardware/Software Dynamic Binary Translation Targeting VLIW , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[31] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.
[32] Luigi Carro,et al. Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications , 2008, 2008 Design, Automation and Test in Europe.
[33] David W. Wall,et al. Limits of instruction-level parallelism , 1991, ASPLOS IV.
[34] Reiner W. Hartenstein. Coarse grain reconfigurable architecture (embedded tutorial) , 2001, ASP-DAC '01.
[35] Jie Tan,et al. Dynamic Translation Optimization Method Based on Static Pre-Translation , 2019, IEEE Access.
[36] Yun Wang,et al. IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..
[37] João M. P. Cardoso,et al. A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses , 2014, TRETS.
[38] Scott Hauck,et al. Reconfigurable computing: a survey of systems and software , 2002, CSUR.
[39] Jim D. Garside,et al. Optimizing Indirect Branches in Dynamic Binary Translators , 2016, ACM Trans. Archit. Code Optim..
[40] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.
[41] João M. P. Cardoso,et al. Dynamic Partial Reconfiguration of Customized Single-Row Accelerators , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[42] Daniel D. Gajski,et al. High ― Level Synthesis: Introduction to Chip and System Design , 1992 .
[43] Steven Derrien,et al. Hardware-accelerated dynamic binary translation , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.
[44] Johanna Ullrich,et al. From Hack to Elaborate Technique—A Survey on Binary Rewriting , 2019, ACM Comput. Surv..
[45] Fadi J. Kurdahi,et al. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.
[46] Hamid Noori,et al. Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization , 2010, The Journal of Supercomputing.
[47] Kiyoung Choi,et al. A host-accelerator communication architecture design for efficient binary acceleration , 2011, 2011 International SoC Design Conference.
[48] Michael Laurenzano,et al. PEBIL: Efficient static binary instrumentation for Linux , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[49] Scott A. Mahlke,et al. Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[50] Steven J. E. Wilton,et al. Interconnect architectures for modulo-scheduled coarse-grained reconfigurable arrays , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).
[51] Luigi Carro,et al. A transparent and energy aware reconfigurable multiprocessor platform for simultaneous ILP and TLP exploitation , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[52] Scott A. Mahlke,et al. Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[53] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.
[54] Luigi Carro,et al. Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment , 2011, Int. J. Reconfigurable Comput..
[55] Karthikeyan Sankaralingam,et al. Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[56] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.
[57] B. Ramakrishna Rau,et al. Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.
[58] David Novo,et al. Selective Flexibility: Creating Domain-Specific Reconfigurable Arrays , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[59] Luigi Carro,et al. A run-time modulo scheduling by using a binary translation mechanism , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).
[60] Mihai Sima,et al. Coarse-grain reconfigurable architectures - taxonomy - , 2009, 2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.
[61] Michael Gschwind,et al. Dynamic and Transparent Binary Translation , 2000, Computer.
[62] Pedro C. Diniz,et al. Custom FPGA-based micro-architecture for streaming computing , 2011, 2011 VII Southern Conference on Programmable Logic (SPL).
[63] Koen Bertels,et al. The Instruction-Set Extension Problem: A Survey , 2008, TRETS.
[64] Michael Gschwind,et al. Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.
[65] Aviral Shrivastava,et al. Memory access optimization in compilation for coarse-grained reconfigurable architectures , 2011, TODE.
[66] Jürgen Teich,et al. Hardware/Software Codesign: The Past, the Present, and Predicting the Future , 2012, Proceedings of the IEEE.
[67] Frank Vahid,et al. A configurable logic architecture for dynamic hardware/software partitioning , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.
[68] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[69] N. Bansal,et al. Analysis of the Performance of Coarse-Grain Reconfigurable Architectures with Different Processing Element Configurations , 2003 .
[70] Hossein Pedram,et al. An efficient heterogeneous reconfigurable functional unit for an adaptive dynamic extensible processor , 2007, VLSI-SoC.
[71] Carl Ebeling,et al. Architecture design of reconfigurable pipelined datapaths , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.
[72] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[73] Scott A. Mahlke,et al. The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.
[74] Mingwei Zhang,et al. A platform for secure static binary instrumentation , 2014, VEE '14.
[75] Kiyoung Choi,et al. Binary acceleration using coarse-grained reconfigurable architecture , 2010, CARN.
[76] Mike Van,et al. UQBT: Adaptable Binary Translation at Low Cost , 2000 .
[77] João M. P. Cardoso,et al. Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[78] Taewhan Kim,et al. Clock Tree synthesis for TSV-based 3D IC designs , 2011, TODE.
[79] Miodrag Potkonjak,et al. MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[80] Gorker Alp Malazgirt,et al. Customizing VLIW processors from dynamically profiled execution traces , 2015, Microprocess. Microsystems.
[81] Frank Vahid,et al. Frequent loop detection using efficient nonintrusive on-chip hardware , 2005, IEEE Transactions on Computers.
[82] Frank Vahid,et al. Warp Processing: Dynamic Translation of Binaries to FPGA Circuits , 2008, Computer.
[83] Kevin Skadron,et al. Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.
[84] Liang Chen,et al. A Just-in-Time Customizable processor , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[85] Torsten Hoefler,et al. Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.
[86] Nuno Roma,et al. Efficient data-stream management for shared-memory many-core systems , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).
[87] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[88] S.A. Manavski,et al. CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.
[89] Muhammad Shafique,et al. Concepts, architectures, and run-time systems for efficient and adaptive reconfigurable processors , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).
[90] Carl Ebeling,et al. Static versus scheduled interconnect in Coarse-Grained Reconfigurable Arrays , 2009, 2009 International Conference on Field Programmable Logic and Applications.
[91] Scott A. Mahlke,et al. Modulo scheduling for highly customized datapaths to increase hardware reusability , 2008, CGO '08.
[92] Frank Vahid,et al. Thread Warping: Dynamic and Transparent Synthesis of Thread Accelerators , 2011, TODE.