Deflection-routed butterfly fat trees on FPGAs

Bufferless, deflection-routed, Butterfly Fat Trees (BFTs) can outperform state-of-the-art FPGAs overlay NoCs such as Hoplite by as much as 2–5× on throughput and ≈5× on worst-case latency at identical PE counts, and by ≈1.5× on throughput at identical resource costs >16K LUTs for statistical traffic patterns. In this paper, we show how to modify the tree connectivity and routing function to support deflection routing on the BFT topology. We introduce the idea of localized deflections that trap deflected packets within a single level of a multi-level BFT to avoid the long round-trip penalty traditionally associated with deflection routing. Across a range of statistical traffic patterns, we show a sustained throughput improvement of 2–5× over for Hoplite for system sizes as large at 512 PEs at above 20% injection rates when using localized deflections. We also show how the configurable bisection bandwidth of the BFT, modeled with the Rent parameter 0<p<1, allows us to choose the best performing NoC at a desired cost. For instance, our NoC generator can produce simple trees (p=0) for low-cost applications <2K LUTs, mesh-equivalent BFTs (p=0.5) for real-world applications with locality at <10K LUTs, and crossbars (p=1) when cost is not a constraint >64K LUTs. For workloads with locality, we recommend the BFT topology with p = 0.67.

[1]  Nachiket Kapre,et al.  Hoplite: Building austere overlay NoCs for FPGAs , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[2]  Nachiket Kapre,et al.  Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[3]  Nachiket Ganesh Kapre,et al.  Packet-Switched On-Chip FPGA Overlay Networks , 2006 .

[4]  Kevin Kai-Wei Chang,et al.  MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[5]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[6]  André DeHon,et al.  FPGA optimized packet-switched NoC using split and merge primitives , 2012, 2012 International Conference on Field-Programmable Technology.

[7]  James C. Hoe,et al.  The CONNECT Network-on-Chip Generator , 2015, Computer.

[8]  James C. Hoe,et al.  CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs , 2012, FPGA '12.

[9]  Onur Mutlu,et al.  A case for bufferless routing in on-chip networks , 2009, ISCA '09.

[10]  Roy L. Russo,et al.  On a Pin Versus Block Relationship For Partitions of Logic Graphs , 1971, IEEE Transactions on Computers.