Optically-Clocked Instruction Set Extensions for High Efficiency Embedded Processors

We propose a technique to localize computation in Instruction Set Extensions (ISEs) that are clocked at very high speed with respect to the processor. In order to save power, data to and from Custom Instruction Units (CIUs) is synchronized via an optical signal that is detected through a Single-Photon Avalanche Diode (SPAD) capable of timing uncertainties as low as 50 ps.The CIUs comprise a free-standing local oscillator serving a computing area of a few tens of square micrometers, thus resulting in extremely reduced power dissipations, since the distribution of a high frequency clock over long distances is avoided. This approach is based on the globally asynchronous locally synchronous con cept, whereby the granularity of the local domains is reduced to a minimum, thus enabling extremely high local clock frequencies and low power, while minimizing substrate noise injection and intra-chip interference. Thanks to this approach we can free ourselves from expensive synchronization techniques such as FIFOs, delays, or flip-flop based synchronizers by creating fixed synchronization points in time where data can be exchanged. The paradigm is demonstrated on a chip designed and fabricated in a standard 90 nm CMOS technology. A full characterization demonstrates the suitability of the approach.

[1]  Michalis D. Galanis,et al.  Performance Improvements in Microprocessor Systems Utilizing a Coprocessor Data-Path , 2006, 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[2]  Chulwoo Kim,et al.  A Low-Jitter Open-Loop All-Digital Clock Generator With Two-Cycle Lock-Time , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Guo Liang Li,et al.  A 10Gb/s photonic modulator and WDM MUX/DEMUX integrated with electronics in 0.13µm SOI CMOS , 2006, ISSCC.

[4]  Paolo Ienne,et al.  Exact and approximate algorithms for the extension of embedded processor instruction sets , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[5]  Xin Fan,et al.  Analysis and optimization of pausible clocking based GALS design , 2009, 2009 IEEE International Conference on Computer Design.

[6]  Xuan Zhang,et al.  A Low-Power, Process-and- Temperature- Compensated Ring Oscillator With Addition-Based Current Source , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[7]  Ian G. Harris,et al.  A deterministic globally asynchronous locally synchronous microprocessor architecture , 2003, Proceedings. 4th International Workshop on Microprocessor Test and Verification - Common Challenges and Solutions.

[8]  W. Marwood,et al.  A coprocessor with supercomputer capabilities for personal computers , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[9]  David A. B. Miller,et al.  Receiver-less optical clock injection for clock distribution networks , 2003 .

[10]  Enrico Macii,et al.  Thermal-Aware Clock Tree Design to Increase Timing Reliability of Embedded SoCs , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[11]  Alain Greiner,et al.  A Portable Clock Multiplier Generator using Digital CMOS Standard Cells , 1995, ESSCIRC '95: Twenty-first European Solid-State Circuits Conference.

[12]  Yvon Savaria,et al.  Crosstalk Glitch Propagation Modeling for Asynchronous Interfaces in Globally Asynchronous Locally Synchronous Systems , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[13]  Narayanan Vijaykrishnan,et al.  Design of Thermally Robust Clock Trees Using Dynamically Adaptive Clock Buffers , 2009, IEEE Transactions on Circuits and Systems I: Regular Papers.

[14]  Edoardo Charbon,et al.  A 128×128 Single-Photon Imager with on-Chip Column-Level 10b Time-to-Digital Converter Array Capable of 97ps Resolution , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[15]  Wolfgang Fichtner,et al.  Practical design of globally-asynchronous locally-synchronous systems , 2000, Proceedings Sixth International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC 2000) (Cat. No. PR00586).

[16]  Kuo-Hsing Cheng,et al.  A 0.5-V 0.4–2.24-GHz Inductorless Phase-Locked Loop in a System-on-Chip , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[17]  Satoshi Tanaka,et al.  Half-Swing Clocking Scheme for 75% Power Saving in Clocking Circuitry , 1994, Proceedings of 1994 IEEE Symposium on VLSI Circuits.

[18]  Ahmad Hemami,et al.  A digitally controlled low-power clock multiplier for globally asynchronous locally synchronous designs , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[19]  Sunao Torii,et al.  On-Chip Optical Interconnect , 2009, Proceedings of the IEEE.

[20]  Frank Vahid,et al.  Design and implementation of a MicroBlaze-based warp processor , 2009, TECS.

[21]  Eby G. Friedman,et al.  Clock distribution networks in synchronous digital integrated circuits , 2001, Proc. IEEE.

[22]  Yijun Liu,et al.  A low power embedded dataflow coprocessor , 2005, IEEE Computer Society Annual Symposium on VLSI: New Frontiers in VLSI Design (ISVLSI'05).

[23]  Peng Li,et al.  Design of a Low-Power Coprocessor for Mid-Size Vocabulary Speech Recognition Systems , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[24]  Paolo Ienne,et al.  Way Stealing: Cache-assisted automatic Instruction Set Extensions , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[25]  Shahriar Mirabbasi,et al.  A 4 GHz Non-Resonant Clock Driver With Inductor-Assisted Energy Return to Power Grid , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[26]  Anshul Kumar,et al.  Application Specific Datapath Extension with Distributed I/O Functional Units , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[27]  Muhammad Shafique,et al.  Run-time instruction set selection in a transmutable embedded processor , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[28]  Wayne Luk,et al.  Fast custom instruction identification by convex subgraph enumeration , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[29]  F.J. Leonberger,et al.  Optical interconnections for VLSI systems , 1984, Proceedings of the IEEE.

[30]  Bertram E. Shi,et al.  IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS — I : REGULAR PAPERS , VOL . ? ? , NO . ? ? , ? ? ? ? , 2007 .

[31]  Scott Mahlke,et al.  Processor acceleration through automated instruction set customization , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[32]  Paolo Ienne,et al.  Fast, quasi-optimal, and pipelined instruction-set extensions , 2008, 2008 Asia and South Pacific Design Automation Conference.

[33]  Kenneth Y. Yun,et al.  Pausible clocking: a first step toward heterogeneous systems , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[34]  K.L. Shepard,et al.  Uniform-phase uniform-amplitude resonant-load global clock distributions , 2005, IEEE Journal of Solid-State Circuits.

[35]  Edoardo Charbon,et al.  Techniques for fully integrated intra-/inter-chip optical communication , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[36]  Chien-Nan Jimmy Liu,et al.  A Tree-Topology Multiplexer for Multiphase Clock System , 2009, IEEE Trans. Circuits Syst. I Regul. Pap..

[37]  Paolo Ienne,et al.  Exploiting pipelining to relax register-file port constraints of instruction-set extensions , 2005, CASES '05.

[38]  Paolo Ienne,et al.  Rethinking custom ISE identification: a new processor-agnostic method , 2007, CASES '07.

[39]  Lionel C. Kimerling,et al.  Silicon Microphotonics , 2002 .

[40]  Edoardo Charbon,et al.  A new single-photon avalanche diode in 90nm standard CMOS technology , 2010, NanoScience + Engineering.

[41]  Hui Wu,et al.  A 1V, 1mW, 4GHz Injection-Locked Oscillator for High-Performance Clocking , 2007, 2007 IEEE Custom Integrated Circuits Conference.

[42]  Muhammad Shafique,et al.  RISPP: Rotating Instruction Set Processing Platform , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[43]  Paolo Ienne,et al.  Speculative DMA for architecturally visible storage in instruction set extensions , 2008, CODES+ISSS '08.

[44]  Yijun Liu,et al.  The Design of a Dataflow Coprocessor for Low Power Embedded Hierarchical Processing , 2006, PATMOS.

[45]  John McNeill,et al.  Jitter in ring oscillators , 1994, Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94.

[46]  E. Charbon,et al.  A low-noise single-photon detector implemented in a 130 nm CMOS imaging process , 2009, ESSDERC 2009.

[47]  K. Nishi,et al.  Waveguide-integrated Si nano-photodiode with surface-plasmon antenna and its application to on-chip optical clock signal distribution , 2008 .

[48]  James D. Meindl,et al.  Electrical and optical clock distribution networks for gigascale microprocessors , 2002, IEEE Trans. Very Large Scale Integr. Syst..

[49]  Ties Jan Henderikus Kluter,et al.  Architectural Support for Coherent Architecturally Visible Storage in Instruction Set Extensions , 2010 .

[50]  Eckhard Grass,et al.  Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook , 2007, IEEE Design & Test of Computers.

[51]  Alexis Rochas,et al.  Single photon avalanche diodes in CMOS technology , 2003 .

[52]  M.Z. Straayer,et al.  A Multi-Path Gated Ring Oscillator TDC With First-Order Noise Shaping , 2009, IEEE Journal of Solid-State Circuits.

[53]  L. Batina,et al.  A hyperelliptic curve crypto coprocessor for an 8051 microcontroller , 2005, IEEE Workshop on Signal Processing Systems Design and Implementation, 2005..

[54]  Scott A. Mahlke,et al.  Processor Acceleration Through Automated Instruction Set Customization , 2003, MICRO.

[55]  D. H. Hartman,et al.  Optical clock distribution using a mode-locked semiconductor laser diode system , 1991 .

[56]  Stuart K. Tewksbury,et al.  Optical Clock Distribution in Electronic Systems , 1997, J. VLSI Signal Process..

[57]  C. Desset,et al.  A Low-Complexity, Low-Phase-Noise, Low-Voltage Phase-Aligned Ring Oscillator in 90 nm Digital CMOS , 2009, IEEE Journal of Solid-State Circuits.

[58]  G. Wolrich,et al.  A high performance floating point coprocessor , 1984, IEEE Journal of Solid-State Circuits.

[59]  Jürgen Jahns,et al.  Optical clock distribution using integrated free-space optics , 1992 .

[60]  Sohini Dasgupta,et al.  Comparative analysis of GALS clocking schemes , 2007, IET Comput. Digit. Tech..

[61]  Hui Wu,et al.  Injection-Locked Clocking: A New GHz Clock Distribution Scheme , 2006, IEEE Custom Integrated Circuits Conference 2006.