Constraint-aware configurable system-on-chip design for embedded computing

Field Programmable Gate Arrays (FPGAs) are rapidly becoming a popular alternative to ASICs as they continue to increase in capacity, functionality and performance. At the same time, FPGA developers are faced with the challenges of meeting increasingly aggressive design constraints such as power, delay and area costs without violating shorter Time-to-Market (TTM) pressures and lower Non-Recurring Engineering (NRE) costs for embedded systems development. In this research, efficient techniques have been proposed for processor subsetting and customization as well as the rapid generation of application-specific hardware accelerators in order to meet the design constraints of configurable System-on-Chip (SoC) platforms. A processor-agnostic technique has been devised for sub-setting soft-core processors by relying on LLVM compiler generated front-end application output. The proposed approach has resulted in a systematic method for the application-aware sub-setting of the micro-architecture subsystems such as hardware multipliers and floating point units of a soft-core processor. Evaluations based on widely used benchmarks show that the proposed method can be deployed to reliably subset soft core processors at high-speed without compromising compute performance. A technique for the architecture-aware enumeration of custom instructions has been proposed next to identify area-efficient custom instructions by employing FPGA resource-aware pruning of the search space. Experimental results based on applications from widely-used benchmark suites confirm that deploying custom instructions identified in this way can improve compute performance by up to 65%. The instruction level parallelism (ILP) has also been exploited to further improve the compute performance by identifying profitable coarsegrained custom instructions. It has been demonstrated that the custom instructions using the proposed method can accelerate computations by up to 39% when compared to a base processor only implementation. Unlike traditional custom instruction generation methods that are incapable of incorporating memory-dependent basic blocks, a novel technique for accelerating memory-dependent basic blocks has been proposed. A detailed data dependency analysis based on pre-defined memory allocation in an application has been developed to guarantee the identification of

[1]  Fadi J. Kurdahi,et al.  Design and Implementation of the MorphoSys Reconfigurable Computing Processor , 2000, J. VLSI Signal Process..

[2]  Jason Cong,et al.  Application-specific instruction generation for configurable processor architectures , 2004, FPGA '04.

[3]  Kiyoung Choi,et al.  Energy-efficient instruction set synthesis for application-specific processors , 2003, ISLPED '03.

[4]  Paolo Ienne,et al.  Speculative DMA for architecturally visible storage in instruction set extensions , 2008, CODES+ISSS '08.

[5]  Mohammed A. S. Khalid,et al.  Design Space Exploration using Parameterized Cores: A Case Study , 2006, 2006 Canadian Conference on Electrical and Computer Engineering.

[6]  Srivaths Ravi,et al.  A Synthesis Methodology for Hybrid Custom Instruction and Coprocessor Generation for Extensible Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Wayne Luk,et al.  HArtes: Hardware-Software Codesign for Heterogeneous Multicore Platforms , 2010, IEEE Micro.

[8]  John Wawrzynek,et al.  Exploring Many-Core Design Templates for FPGAs and ASICs , 2012, Int. J. Reconfigurable Comput..

[9]  Nikil D. Dutt,et al.  On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems , 2000, TODE.

[10]  Jonathan Rose,et al.  Application-specific customization of soft processor microarchitecture , 2006, FPGA '06.

[11]  Erik Brockmeyer,et al.  An automatic Scratch Pad Memory management tool and MPEG-4 encoder case study , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[12]  Fabrizio Ferrandi,et al.  A design methodology to implement memory accesses in High-Level Synthesis , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[13]  Michel Robert,et al.  HS-Scale: a Hardware-Software Scalable MP-SOC Architecture for embedded Systems , 2007, IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07).

[14]  Jörg Henkel,et al.  Closing the SoC Design Gap , 2003, Computer.

[15]  Jan-Philipp Weiss,et al.  Facing the Multicore-Challenge - Aspects of New Paradigms and Technologies in Parallel Computing [Proceedings of a conference held at Stuttgart, Germany, September 19-21, 2012] , 2013, Facing the Multicore-Challenge.

[16]  Jason Cong,et al.  Automatic memory partitioning and scheduling for throughput and power optimization , 1999, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[17]  Yihe Sun,et al.  Design of a configurable system-on-chip for audio application , 2007, 2007 7th International Conference on ASIC.

[18]  Amit Kumar Singh,et al.  Architecture-Aware Custom Instruction Generation for Reconfigurable Processors , 2010, ARC.

[19]  Amit Kumar Singh,et al.  Accelerating throughput-aware runtime mapping for heterogeneous MPSoCs , 2013, TODE.

[20]  Anshul Kumar,et al.  Exhaustive Enumeration of Legal Custom Instructions for Extensible Processors , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[21]  Kevin Skadron,et al.  Accelerating Compute-Intensive Applications with GPUs and FPGAs , 2008, 2008 Symposium on Application Specific Processors.

[22]  Gerald Estrin,et al.  Organization of Computer Systems-the Fixed Plus Variable Structure Computer , 1899 .

[23]  Hiroaki Takada,et al.  Regular Paper Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis , 2009 .

[24]  H. Zhang,et al.  A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing , 2000, IEEE Journal of Solid-State Circuits.

[25]  Alan Murray,et al.  An End-to-End Design Flow for Automated Instruction Set Extension and Complex Instruction Selection Based on GCC , 2009 .

[26]  Amit Kumar Singh,et al.  Mapping on multi/many-core systems: Survey of current and emerging trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[27]  Frank Vahid,et al.  Soft-core Processor Customization using the Design of Experiments Paradigm , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[28]  Fabrizio Ferrandi,et al.  hArtes design flow for heterogeneous platforms , 2009, 2009 10th International Symposium on Quality Electronic Design.

[29]  Xiaofeng Wu,et al.  A Self-reconfigurable System-on-Chip Architecture for Satellite On-Board Computer Maintenance , 2006, Asia-Pacific Computer Systems Architecture Conference.

[30]  Srivaths Ravi,et al.  A Scalable Synthesis Methodology for Application-Specific Processors , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[31]  Jürgen Becker,et al.  An industrial/academic configurable system-on-chip project (CSoC): coarse-grain XPP-/Leon-based architecture integration , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[32]  Stamatis Vassiliadis,et al.  Cost-Efficient SHA Hardware Accelerators , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[33]  Wu Jigang,et al.  Practical techniques for performance estimation of processors , 2005, Fifth International Workshop on System-on-Chip for Real-Time Applications (IWSOC'05).

[34]  Peter Crosthwaite,et al.  Profile driven data-dependency analysis for improved high level language hardware synthesis , 2009, 2009 International Conference on Field-Programmable Technology.

[35]  Jürgen Becker,et al.  Design and implementation of a coarse-grained dynamically reconfigurable hardware architecture , 2001, Proceedings IEEE Computer Society Workshop on VLSI 2001. Emerging Technologies for VLSI Systems.

[36]  Paul Stravers Homogeneous multiprocessing for the masses , 2004, 2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004..

[37]  Muhammad Shafique,et al.  mRTS: Run-time system for reconfigurable processors with multi-grained instruction-set extensions , 2011, 2011 Design, Automation & Test in Europe.

[38]  Nikil D. Dutt,et al.  Introduction of local memory elements in instruction set extensions , 2004, Proceedings. 41st Design Automation Conference, 2004..

[39]  Thambipillai Srikanthan,et al.  Modeling arbitrator delay-area dependencies in customizable instruction set processors , 2006, Third IEEE International Workshop on Electronic Design, Test and Applications (DELTA'06).

[40]  Francky Catthoor,et al.  A framework for automatic parallelization, static and dynamic memory optimization in MPSoC platforms , 2010, Design Automation Conference.

[41]  Paolo Ienne,et al.  Virtual Ways: Efficient Coherence for Architecturally Visible Storage in Automatic Instruction Set Extensions , 2010, HiPEAC.

[42]  Jürgen Becker,et al.  Reconfigurable processor architectures for mobile phones , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[43]  Wu Jigang,et al.  Estimating processor performance of library function , 2005, Second International Conference on Embedded Software and Systems (ICESS'05).

[44]  Paolo Bonzini,et al.  Recurrence-Aware Instruction Set Selection for Extensible Embedded Processors , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[45]  Alberto L. Sangiovanni-Vincentelli,et al.  System design: traditional concepts and new paradigms , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[46]  Tsuyoshi Isshiki,et al.  Bridging the Gap between ASIC and GPP: A High-Performance and C-Programmable ASIP for Image Processing , 2012 .

[47]  Paolo Ienne,et al.  Way Stealing: A Unified Data Cache and Architecturally Visible Storage for Instruction Set Extensions , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[48]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[49]  Jason Cong,et al.  Customizable Domain-Specific Computing , 2009, IEEE Design & Test of Computers.

[50]  Tao Li,et al.  Efficient Heuristic Algorithm for Rapid Custom-Instruction Selection , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[51]  Raffaele Tripiccione,et al.  The hardware application platform of the hartes project , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[52]  Thambipillai Srikanthan,et al.  FPGA-aware techniques for rapid generation of profitable custom instructions , 2013, Microprocess. Microsystems.

[53]  Nikil D. Dutt,et al.  Introduction of Architecturally Visible Storage in Instruction Set Extensions , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[54]  C. Edwards Speeding up is hard to do [multiprocessing systems for faster software execution] , 2004 .

[55]  Wayne H. Wolf A Decade of Hardware/Software Codesign , 2003, Computer.

[56]  Tao Li,et al.  Fast identification algorithm for application-specific instruction-set extensions , 2008, 2008 International Conference on Electronic Design.

[57]  Michael Taylor A landscape of the new dark silicon design regime , 2013 .

[58]  Thambipillai Srikanthan,et al.  Instruction set customization for area-constrained FPGA designs , 2011, 2011 IEEE International SOC Conference.

[59]  Transactions on High-Performance Embedded Architectures and Compilers III , 2011, Trans. HiPEAC.

[60]  Sharad Malik,et al.  From ASIC to ASIP: the next design discontinuity , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[61]  Srivaths Ravi,et al.  A Scalable Application-Specific Processor Synthesis Methodology , 2003, ICCAD 2003.

[62]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[63]  Wayne Luk,et al.  A high-level compilation toolchain for heterogeneous systems , 2009, 2009 IEEE International SOC Conference (SOCC).

[64]  DANIEL MATTSSON,et al.  Evaluation of synthesizable CPU cores , 2004 .

[65]  Christos-Savvas Bouganis,et al.  GPU Versus FPGA for High Productivity Computing , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[66]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[67]  Hajo Broersma,et al.  A graph covering algorithm for a coarse grain reconfigurable system , 2003 .

[68]  Frank Vahid,et al.  Design and implementation of a MicroBlaze-based warp processor , 2009, TECS.

[69]  Nikolaos Hardavellas,et al.  The Rise and Fall of Dark Silicon , 2012, login Usenix Mag..

[70]  Stamatis Vassiliadis,et al.  Soft core processors and embedded processing: a survey and analysis , 2005 .

[71]  Jason Cong,et al.  AutoPilot: A Platform-Based ESL Synthesis System , 2008 .

[72]  Harald Michalik,et al.  SoCWire: A Network-on-Chip Approach for Reconfigurable System-on-Chip Designs in Space Applications , 2008, 2008 NASA/ESA Conference on Adaptive Hardware and Systems.

[73]  Kevin Skadron,et al.  Implications of the Power Wall: Dim Cores and Reconfigurable Logic , 2013, IEEE Micro.

[74]  Thambipillai Srikanthan,et al.  Rapid design of area-efficient custom instructions for reconfigurable embedded processing , 2009, J. Syst. Archit..

[75]  Wen Li,et al.  High level area estimation of custom instructions for FPGA-based reconfigurable processors , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.

[76]  Wayne Luk,et al.  Fast custom instruction identification by convex subgraph enumeration , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[77]  Senior Member High Performance Reconfigurable Computing : From Applications to Hardware , .

[78]  C. P. Ravikumar,et al.  On-chip memory architecture exploration framework for DSP processor-based embedded system on chip , 2012, TECS.

[79]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[80]  E. Roza Systems-on-chip: what are the limits? , 2001 .

[81]  Sebastian Wallner Design methodology of a configurable system-on-chip architecture , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[82]  Muhammad Shafique,et al.  KAHRISMA: A Novel Hypermorphic Reconfigurable-Instruction-Set Multi-grained-Array Architecture , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[83]  Frank Vahid,et al.  Application-specific customization of parameterized FPGA soft-core processors , 2006, ICCAD.

[84]  Paolo Ienne,et al.  Introducing control-flow inclusion to support pipelining in custom instruction set extensions , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[85]  Scott Mahlke,et al.  Processor acceleration through automated instruction set customization , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[86]  Jörg Henkel,et al.  An approach to automated hardware/software partitioning using a flexible granularity that is driven by high-level estimation techniques , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[87]  J. Gregory Steffan,et al.  The microarchitecture of FPGA-based soft processors , 2005, CASES '05.

[88]  Amit Kumar Singh,et al.  Rapid design exploration framework for application-aware customization of soft core processors , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[89]  Luca Fanucci,et al.  Homogeneous and Heterogeneous MPSoC Architectures with Network-On-Chip Connectivity for Low-Power and Real-Time Multimedia Signal Processing , 2012, VLSI Design.

[90]  Jeff Mason,et al.  CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[91]  Nikil D. Dutt,et al.  Automatic Identification of Application-Specific Functional Units with Architecturally Visible Storage , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[92]  Baifeng Wu,et al.  Extended-DDF modeling embedded system design: adapting to IP technology , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[93]  Mariano Fons,et al.  Deployment of Run-Time Reconfigurable Hardware Coprocessors Into Compute-Intensive Embedded Applications , 2012, J. Signal Process. Syst..

[94]  Alberto L. Sangiovanni-Vincentelli,et al.  Platform-Based Design and Software Design Methodology for Embedded Systems , 2001, IEEE Des. Test Comput..

[95]  S. Knapp,et al.  Field configurable system-on-chip device architecture , 2000, Proceedings of the IEEE 2000 Custom Integrated Circuits Conference (Cat. No.00CH37044).

[96]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[97]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[98]  Jason Cong,et al.  Memory partitioning and scheduling co-optimization in behavioral synthesis , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[99]  Jason Cong,et al.  Architecture support for custom instructions with memory operations , 2013, FPGA '13.

[100]  Gerald Estrin,et al.  Reconfigurable Computer Origins: The UCLA Fixed-Plus-Variable (F+V) Structure Computer , 2002, IEEE Ann. Hist. Comput..

[101]  Thambipillai Srikanthan,et al.  Modelling communication overhead for accessing local memories in hardware accelerators , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[102]  Douglas L. Maskell,et al.  Fast Identification of Custom Instructions for Extensible Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[103]  Yun Liang,et al.  Efficient custom instructions generation for system-level design , 2010, 2010 International Conference on Field-Programmable Technology.

[104]  Robert W. Brodersen System-on-a-chip VLSI-is it finally really here? , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[105]  Anshul Kumar,et al.  Application Specific Datapath Extension with Distributed I/O Functional Units , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[106]  Jan M. Rabaey Silicon Platforms for the Next Generation Wireless Systems - What Role Does Reconfigurable Hardware Play? , 2000, FPL.

[107]  Francois Capman,et al.  In-car speech and audio processing - some experiments within hArtes project , 2009, 2009 International Conference on Networking, Sensing and Control.

[108]  Thambipillai Srikanthan,et al.  Custom instructions with local memory elements without expensive DMA transfers , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[109]  Mohammed A. S. Khalid,et al.  Soft-Core Processors for Embedded Systems , 2006, 2006 International Conference on Microelectronics.

[110]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2007, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[111]  Jonathan Rose,et al.  Exploration and Customization of FPGA-Based Soft Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[112]  Scott A. Mahlke,et al.  Automated custom instruction generation for domain-specific processor acceleration , 2005, IEEE Transactions on Computers.

[113]  Alberto L. Sangiovanni-Vincentelli,et al.  System-level design: orthogonalization of concerns andplatform-based design , 2000, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[114]  John W. Lockwood,et al.  Automatic application-specific microarchitecture reconfiguration , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[115]  Monica S. Lam,et al.  The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler , 1994 .

[116]  Bruno da Silva,et al.  Performance and Programming Environment of a Combined GPU/FPGA Desktop , 2012, High Performance Computing Workshop.

[117]  Donald E. Thomas,et al.  Instruction subsetting: Trading power for programmability , 1998, Proceedings IEEE Computer Society Workshop on VLSI'98 System Level Design (Cat. No.98EX158).