Scalable and deterministic timing-driven parallel placement for FPGAs

This paper describes a parallel implementation of the timing-driven VPR~5.0 simulated annealing engine. By restricting the move distance to a confined neighborhood, it is possible to consider a large number of non-conflicting moves in parallel and achieve a deterministic result. The full timing-driven algorithm is parallelized, including the detailed timing analysis updates done periodically while placement progresses. The limited move slightly degrades the placement quality, but this is necessary to expose greater degrees of parallelism. The overall bounding box metric degrades about 11% and critical path delay metric degrades about 8% compared to VPR's original algorithm, but we show the amount of degradation is independent of the number of threads. Overall, the parallel implementation scales to a speedup of 123x using 25 threads compared to VPR. With additional tuning effort, we believe the algorithm can be scaled to a larger number of threads, perhaps even run on a GPU, with little additional quality degradation.

[1]  Vaughn Betz,et al.  High-quality, deterministic parallel placement for FPGAs on commodity hardware , 2008, FPGA '08.

[2]  Vaughn Betz,et al.  Efficient and Deterministic Parallel Placement for FPGAs , 2011, TODE.

[3]  Jonathan Rose Parallel global routing for standard cells , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4]  Kenneth B. Kent,et al.  VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling , 2011, TRETS.

[5]  G. Lemieux,et al.  Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[6]  Vaughn Betz,et al.  Timing-driven placement for FPGAs , 2000, FPGA '00.

[7]  Alberto L. Sangiovanni-Vincentelli,et al.  A Parallel Simulated Annealing Algorithm for the Placement of Macro-Cells , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Alok N. Choudhary,et al.  Parallel algorithms for FPGA placement , 2000, ACM Great Lakes Symposium on VLSI.

[9]  Vaughn Betz,et al.  Architecture and CAD for Deep-Submicron FPGAS , 1999, The Springer International Series in Engineering and Computer Science.

[10]  Graeme Smecher,et al.  Self-hosted placement for massively parallel processor arrays , 2009, 2009 International Conference on Field-Programmable Technology.

[11]  C. Sechen,et al.  New algorithms for the placement and routing of macro cells , 1990, 1990 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers.

[12]  A. Sangiovanni-Vincentelli,et al.  The TimberWolf placement and routing package , 1985, IEEE Journal of Solid-State Circuits.

[13]  Allen C.-H. Wu,et al.  A performance and routability-driven router for FPGAs considering path delays , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[14]  Mikael Palczewski Plane parallel A* maze router and its application to FPGAs , 1992, [1992] Proceedings 29th ACM/IEEE Design Automation Conference.

[15]  Jonathan Rose,et al.  Area and delay trade-offs in the circuit and architecture design of FPGAs , 2008, FPGA '08.

[16]  Mark A. Franklin,et al.  Parallel Simulated Annealing using Speculative Computation , 1991, IEEE Trans. Parallel Distributed Syst..

[17]  Jianwen Zhu,et al.  Parallelizing Simulated Annealing-Based Placement Using GPGPU , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[18]  Scott Hauck,et al.  Enhancing timing-driven FPGA placement for pipelined netlists , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[19]  Carl Ebeling,et al.  PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs , 1995, Third International ACM Symposium on Field-Programmable Gate Arrays.

[20]  Jianwen Zhu,et al.  Towards scalable placement for FPGAs , 2010, FPGA '10.

[21]  Carl Sechen,et al.  A Loosely Coupled Parallel Algorithm For Standard Cell Placement , 1994, IEEE/ACM International Conference on Computer-Aided Design.

[22]  Yao-Wen Chang,et al.  A New Global Routing Algorithm For FPGAs , 1994, IEEE/ACM International Conference on Computer-Aided Design.

[23]  Michael Santarini,et al.  Xilinx Tailors Four Tool Flows to Customer Design Disciplines in ISE Design Suite , 2009 .

[24]  Michael L. Scott,et al.  Fast, contention-free combining tree barriers for shared-memory multiprocessors , 1994, International Journal of Parallel Programming.

[25]  Jonathan Rose,et al.  Parallel standard cell placement algorithms with quality equivalent to simulated annealing , 1988, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[26]  Yao-Wen Chang,et al.  A new global routing algorithm for FPGAs , 1994, ICCAD '94.

[27]  Guy Lemieux,et al.  On two-step routing for FPGAS , 1997, ISPD '97.

[28]  André DeHon,et al.  Hardware-assisted simulated annealing with application for fast FPGA placement , 2003, FPGA '03.

[29]  Rob A. Rutenbar,et al.  Placement by Simulated Annealing on a Multiprocessor , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[30]  Guy Lemieux,et al.  Deterministic Timing-Driven Parallel Placement by Simulated Annealing Using Half-Box Window Decomposition , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[31]  Prithviraj Banerjee,et al.  Parallel Simulated Annealing Algorithms for Cell Placement on Hypercube Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[32]  Vaughn Betz,et al.  Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density , 1999, FPGA '99.

[33]  Jean-Marc Delosme,et al.  Performance of a new annealing schedule , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[34]  Vaughn Betz,et al.  VPR: A new packing, placement and routing tool for FPGA research , 1997, FPL.

[35]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.