Operation chaining asynchronous pipelined circuits

We define operation chaining (op-chaining) as an optimization problem to determine the optimal pipeline depth for balancing performance against energy demands in pipelined asynchronous designs. Since there are no clock period requirements, asynchronous pipeline stages can have non-uniform latencies. We exploit this fact to coalesce several stages together thereby saving power and area due to the elimination of control-path resources from the pipeline. The trade-off is potentially reduced pipeline parallelism. In this paper, we formally define this optimization as a graph covering problem, which finds sub-graphs that will be synthesized as an opchained pipeline stage. We then define the solution space for provably correct solutions and present an algorithm to efficiently search this space. The search technique partitions the graph based on post-dominator relationships to find sub-graphs that are potential op-chain candidates. We use knowledge of the Global Critical Path (GCP) [13] to evaluate the performance impact of accepting a candidate sub-graph and formulate a heuristic cost function to model this trade-off. The algorithm has a quadratic-time complexity in the size of the dataflow graph. We have implemented this algorithm within an automated asynchronous synthesis toolchain [12]. Experimental evidence from applying the algorithm on several media processing kernels reveals that the average energy-delay and energy-delay-area products improve by about 1.4x and 1.8x respectively, with a maximum improvement of 5x and 18x.

[1]  F. Somenzi,et al.  On the optimization power of retiming and resynthesis transformations , 1998, 1998 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (IEEE Cat. No.98CB36287).

[2]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Paul Day,et al.  Four-phase micropipeline latch control circuits , 1996, IEEE Trans. Very Large Scale Integr. Syst..

[4]  Seth Copen Goldstein,et al.  Global Critical Path: A Tool for System-Level Timing Analysis , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[5]  Seth Copen Goldstein,et al.  Spatial computation , 2004, ASPLOS XI.

[6]  Seth Copen Goldstein,et al.  C to Asynchronous Dataflow Circuits: An End-to-End Toolflow , 2004 .

[7]  S. Sapatnekar,et al.  Minimum area retiming with equivalent initial states , 1997, ICCAD 1997.

[8]  Peter A. Beerel,et al.  Pipeline optimization for asynchronous circuits: complexity analysis and an efficient optimal algorithm , 2000, IEEE/ACM International Conference on Computer Aided Design. ICCAD - 2000. IEEE/ACM Digest of Technical Papers (Cat. No.00CH37140).

[9]  Michael Kishinevsky,et al.  Performance Analysis Based on Timing Simulation , 1994, 31st Design Automation Conference.

[10]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[11]  Peter A. Beerel,et al.  Bounding average time separations of events in stochastic timed Petri nets with choice , 1999, Proceedings. Fifth International Symposium on Advanced Research in Asynchronous Circuits and Systems.

[12]  Ted Eugene Williams,et al.  Self-timed rings and their application to division , 1992 .

[13]  S.C. Goldstein,et al.  Leveraging Protocol Knowledge in Slack Matching , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[14]  Tughrul Arslan,et al.  System-level Scheduling on Instruction Cell Based Reconfigurable Systems , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[15]  Steven M. Nowick,et al.  Resynthesis and peephole transformations for the optimization of large-scale asynchronous systems , 2002, DAC '02.

[16]  Daniel Gajski,et al.  An optimal clock period selection method based on slack minimization criteria , 1996, TODE.