Customized pipeline and instruction set architecture for embedded processing engines

Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains.

[1]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[2]  T. N. Vijaykumar,et al.  Reducing register ports for higher speed and lower energy , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[3]  Yifan He,et al.  Energy efficient special instruction support in an embedded processor with compact isa , 2012, CASES '12.

[4]  Victor V. Zyuban,et al.  The energy complexity of register files , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[5]  Hoi-Jun Yoo,et al.  A 345 mW Heterogeneous Many-Core Processor With an Intelligent Inference Engine for Robust Object Recognition , 2011, IEEE Journal of Solid-State Circuits.

[6]  Sied Mehdi Fakhraie,et al.  Instruction set architectural guidelines for embedded packet-processing engines , 2012, J. Syst. Archit..

[7]  Paolo Bonzini,et al.  Recurrence-Aware Instruction Set Selection for Extensible Embedded Processors , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Samuel Naffziger,et al.  An x86-64 core implemented in 32nm SOI CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[9]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[10]  Paolo Faraboschi,et al.  Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[11]  Mark Horowitz,et al.  Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[12]  장훈,et al.  [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[13]  Paolo Ienne,et al.  Exploiting pipelining to relax register-file port constraints of instruction-set extensions , 2005, CASES '05.

[14]  Jason Cong,et al.  Instruction set extension with shadow registers for configurable processors , 2005, FPGA '05.

[15]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[16]  Ha Pham,et al.  A 40nm 16-core 128-thread CMT SPARC SoC processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[17]  Hamid Noori,et al.  Energy-aware design space exploration of registerfile for extensible processors , 2010, 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[18]  Koichi Yamazaki,et al.  A note on greedy algorithms for the maximum weighted independent set problem , 2003, Discret. Appl. Math..

[19]  Kingshuk Karuri,et al.  Increasing data-bandwidth to instruction-set extensions through register clustering , 2007, 2007 IEEE/ACM International Conference on Computer-Aided Design.

[20]  Zhiyi Yu,et al.  A 167-Processor Computational Platform in 65 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[21]  Di Wu,et al.  Resource-shared custom instruction generation under performance/area constraints , 2012, 2012 International Symposium on System on Chip (SoC).

[22]  Nikil D. Dutt,et al.  Introduction of local memory elements in instruction set extensions , 2004, Proceedings. 41st Design Automation Conference, 2004..

[23]  T. N. Vijaykumar,et al.  Reducing register ports for higher speed and lower energy , 2002, MICRO.

[24]  Sied Mehdi Fakhraie,et al.  Architecture-Aware Graph-Covering Algorithm for Custom Instruction Selection , 2010, 2010 5th International Conference on Future Information Technology.

[25]  Aviral Shrivastava,et al.  Register File Power Reduction Using Bypass Sensitive Compiler , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[26]  Wayne Luk,et al.  Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[27]  Cid C. de Souza,et al.  Efficient datapath merging for partially reconfigurable architectures , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28]  Chen-Yong Cher,et al.  A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[29]  Preeti Ranjan Panda,et al.  Customization of Register File Banking Architecture for Low Power , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[30]  Tulika Mitra,et al.  Scalable custom instructions identification for instruction-set extensible processors , 2004, CASES '04.

[31]  Douglas L. Maskell,et al.  Fast Identification of Custom Instructions for Extensible Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[32]  Sied Mehdi Fakhraie,et al.  Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development , 2009, J. Syst. Archit..

[33]  Nigel P. Topham,et al.  Design-Space Exploration of Resource-Sharing Solutions for Custom Instruction Set Extensions , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[34]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[35]  Paolo Ienne,et al.  Exact and approximate algorithms for the extension of embedded processor instruction sets , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[36]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[37]  Kingshuk Karuri,et al.  Increasing data-bandwidth to instruction-set extensions through register clustering , 2007, ICCAD 2007.

[38]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[39]  Haibin Liu,et al.  Exploiting forwarding to improve data bandwidth of instruction-set extensions , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[40]  Majid Sarrafzadeh,et al.  Area-efficient instruction set synthesis for reconfigurable system-on-chip designs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[41]  Douglas L. Maskell,et al.  Supporting multiple-input, multiple-output custom functions in configurable processors , 2007, J. Syst. Archit..

[42]  Tilman Wolf,et al.  Analysis of Network Processing Workloads , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[43]  Scott A. Mahlke,et al.  Processor Acceleration Through Automated Instruction Set Customization , 2003, MICRO.

[44]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[45]  Kingshuk Karuri,et al.  A design flow for configurable embedded processors based on optimized instruction set extension synthesis , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[46]  Coniferous softwood GENERAL TERMS , 2003 .

[47]  Trevor N. Mudge,et al.  Reducing register ports using delayed write-back queues and operand pre-fetch , 2003, ICS '03.

[48]  Sied Mehdi Fakhraie,et al.  Locality considerations in exploring custom instruction selection algorithms , 2010, 2nd Asia Symposium on Quality Electronic Design (ASQED).

[49]  Paolo Ienne,et al.  Automatic application-specific instruction-set extensions under microarchitectural constraints , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[50]  Steven Swanson,et al.  Area-Performance Trade-offs in Tiled Dataflow Architectures , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[51]  Martin D. F. Wong,et al.  Efficient ASIP design for configurable processors with fine-grained resource sharing , 2008, FPGA '08.

[52]  Nachiket Kapre,et al.  Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[53]  Paolo Ienne,et al.  Fast, Nearly Optimal ISE Identification With I/O Serialization Through Maximal Clique Enumeration , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[54]  Wayne Luk,et al.  CHIPS: Custom Hardware Instruction Processor Synthesis , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.