Efficient design-space exploration of custom instruction-set extensions

Customization of processors with instruction set extensions (ISEs) is a technique that improves performance through parallelization with a reasonable area overhead, in exchange for additional design effort. This thesis presents a collection of novel techniques that reduce the design effort and cost of generating ISEs by advancing automation and reconfigurability. In addition, these techniques maximize the perfomance gained as a function of the additional commited resources. Including ISEs into a processor design implies development at many levels. Most prior works on ISEs solve separate stages of the design: identification, selection, and implementation. However, the interations between these stages also hold important design trade-offs. In particular, this thesis addresses the lack of interaction between the hardware implementation stage and the two previous stages. Interaction with the implementation stage has been mostly limited to accurately measuring the area and timing requirements of the implementation of each ISE candidate as a separate hardware module. However, the need to independently generate a hardware datapath for each ISE limits the flexibility of the design and the performance gains. Hence, resource sharing is essential in order to create a customized unit with multi-function capabilities. Previously proposed resource-sharing techniques aggressively share resources amongst the ISEs, thus minimizing the area of the solution at any cost. However, it is shown that aggressively sharing resources leads to large ISE datapath latency. Thus, this thesis presents an original heuristic that can be parameterized in order to control the degree of resource sharing amongst a given set of ISEs, thereby permitting the exploration of the existing implementation trade-offs between instruction latency and area savings. In addition, this thesis introduces an innovative predictive model that is able to quickly expose the optimal trade-offs

[1]  Scott A. Mahlke,et al.  Cost sensitive modulo scheduling in a loop accelerator synthesis system , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[2]  Wen-mei W. Hwu,et al.  Modulo scheduling of loops in control-intensive non-numeric programs , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[3]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[4]  Stamatis Vassiliadis,et al.  Automatic selection of application-specific instruction-set extensions , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[5]  João M. P. Cardoso Dynamic loop pipelining in data-driven architectures , 2005, CF '05.

[6]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[7]  Tulika Mitra,et al.  Scalable custom instructions identification for instruction-set extensible processors , 2004, CASES '04.

[8]  Alan Murray,et al.  An End-to-End Design Flow for Automated Instruction Set Extension and Complex Instruction Selection Based on GCC , 2009 .

[9]  Paolo Bonzini,et al.  Code transformation strategies for extensible embedded processors , 2006, CASES '06.

[10]  Majid Sarrafzadeh,et al.  Area-efficient instruction set synthesis for reconfigurable system-on-chip designs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[11]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[12]  Tulika Mitra,et al.  Characterizing embedded applications for instruction-set extensible processors , 2004, Proceedings. 41st Design Automation Conference, 2004..

[13]  Scott A. Mahlke,et al.  Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[14]  Tao Li,et al.  Fast enumeration of maximal valid subgraphs for custom-instruction identification , 2009, CASES '09.

[15]  Scott Mahlke,et al.  Processor acceleration through automated instruction set customization , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[16]  Srivaths Ravi,et al.  Synthesis of custom processors based on extensible platforms , 2002, ICCAD 2002.

[17]  Scott A. Mahlke,et al.  An architecture framework for transparent instruction set customization in embedded processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[18]  Koen Bertels,et al.  Algorithms for the automatic extension of an instruction-set , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[19]  Paolo Ienne,et al.  Rethinking custom ISE identification: a new processor-agnostic method , 2007, CASES '07.

[20]  John Wawrzynek,et al.  Instruction-Level Parallelism for Reconfigurable Computing , 1998, FPL.

[21]  Norman P. Jouppi,et al.  Core architecture optimization for heterogeneous chip multiprocessors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  H. Corporaal,et al.  Designing domain-specific processors , 2001, Ninth International Symposium on Hardware/Software Codesign. CODES 2001 (IEEE Cat. No.01TH8571).

[23]  Paolo Ienne,et al.  Exact and approximate algorithms for the extension of embedded processor instruction sets , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[24]  Olivier Temam,et al.  Reconciling specialization and flexibility through compound circuits , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[25]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[26]  Kiyoung Choi,et al.  Loop pipelining in hardware-software partitioning , 1998, Proceedings of 1998 Asia and South Pacific Design Automation Conference.

[27]  Paolo Bonzini,et al.  A Retargetable Framework for Automated Discovery of Custom Instructions , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).

[28]  Chong-Min Kyung,et al.  Synthesis of application specific instructions for embedded DSP software , 1998, International Conference on Computer Aided Design.

[29]  Scott A. Mahlke,et al.  Exploring the design space of LUT-based transparent accelerators , 2005, CASES '05.

[30]  Wen-mei W. Hwu,et al.  Enhancing loop buffering of media and telecommunications applications using low-overhead predication , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[31]  Tulika Mitra,et al.  A Model for Hardware Realization of Kernel Loops , 2003, FPL.

[32]  Cid C. de Souza,et al.  Efficient datapath merging for partially reconfigurable architectures , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[33]  Günhan Dündar,et al.  An integer linear programming approach for identifying instruction-set extensions , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[34]  Cesare Alippi,et al.  A DAG-Based Design Approach for Reconfigurable VLIW Processors , 1999, DATE.

[35]  Scott A. Mahlke,et al.  Increasing hardware efficiency with multifunction loop accelerators , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[36]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[37]  F. F. Yao,et al.  Approximation Algorithms for the Largest Common Subtree Problem. , 1995 .

[38]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[39]  Horst Bunke,et al.  Weighted minimum common supergraph for cluster representation , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[40]  Barry M. Pangrle,et al.  On the complexity of connectivity binding , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[41]  Scott A. Mahlke,et al.  Modulo scheduling for highly customized datapaths to increase hardware reusability , 2008, CGO '08.

[42]  Paolo Ienne,et al.  Exploiting pipelining to relax register-file port constraints of instruction-set extensions , 2005, CASES '05.

[43]  Paolo Ienne,et al.  A high-level synthesis flow for custom instruction set extensions for application-specific processors , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[44]  Paolo Bonzini,et al.  Heterogeneous coarse-grained processing elements: A template architecture for embedded processing acceleration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[45]  Sri Parameswaran,et al.  Novel architecture for loop acceleration: a case study , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[46]  Majid Sarrafzadeh,et al.  Instruction generation and regularity extraction for reconfigurable processors , 2002, CASES '02.

[47]  Darin Petkov,et al.  Automatic generation of application specific processors , 2003, CASES '03.

[48]  Martin D. F. Wong,et al.  Efficient ASIP design for configurable processors with fine-grained resource sharing , 2008, FPGA '08.

[49]  Tao Li,et al.  Efficient Heuristic Algorithm for Rapid Custom-Instruction Selection , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[50]  Wayne Luk,et al.  Pipeline vectorization , 2001, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[51]  Nikil D. Dutt,et al.  Introduction of Architecturally Visible Storage in Instruction Set Extensions , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[52]  Ramesh Karri,et al.  ALPS: an algorithm for pipeline data path synthesis , 1991, MICRO 24.

[53]  Paolo Bonzini,et al.  Polynomial-time subgraph enumeration for automated instruction set extension , 2007 .

[54]  Wu-chun Feng,et al.  Making a Case for Efficient Supercomputing , 2003, ACM Queue.

[55]  Prithviraj Banerjee,et al.  Dynamic template generation for resource sharing in control and data flow graphs , 2006, 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design (VLSID'06).

[56]  Srivaths Ravi,et al.  A Synthesis Methodology for Hybrid Custom Instruction and Coprocessor Generation for Extensible Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[57]  Majid Sarrafzadeh,et al.  A unified theory of timing budget management , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[58]  Gilles Brassard,et al.  Fundamentals of Algorithmics , 1995 .

[59]  Wayne Luk,et al.  Fast custom instruction identification by convex subgraph enumeration , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[60]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[61]  Nikil D. Dutt,et al.  ISEGEN: an iterative improvement-based ISE generation technique for fast customization of processors , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[62]  Wayne Luk,et al.  Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[63]  Paolo Ienne,et al.  Way Stealing: Cache-assisted automatic Instruction Set Extensions , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[64]  P. Faraboschi,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[65]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[66]  Majid Sarrafzadeh,et al.  Instruction generation for hybrid reconfigurable systems , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[67]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[68]  Tatsuya Akutsu,et al.  On the approximation of largest common subtrees and largest common point sets , 1994, Theor. Comput. Sci..

[69]  Mike Schlansker,et al.  Parallelization of loops with exits on pipelined architectures , 1990, Proceedings SUPERCOMPUTING '90.