Generating Configurable Hardware from Parallel Patterns

In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance and energy improvements over CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software languages but produce very efficient designs, and low-level software languages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two important optimizations for using parallel patterns to generate efficient hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 39.4× on a set of benchmarks from the data analytics domain.

[1]  Kunle Olukotun,et al.  Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[2]  Mingxing Tan,et al.  ElasticFlow: A complexity-effective approach for pipelining irregular loop nests , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[3]  Christian de Schryver,et al.  FPGA Based Accelerators for Financial Applications , 2015 .

[4]  Jinyang Li,et al.  Spartan: A Distributed Array Framework with Smart Tiling , 2015, USENIX Annual Technical Conference.

[5]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[6]  Eric S. Chung,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[7]  M. Laurenzano,et al.  Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[8]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[9]  Kunle Olukotun,et al.  Locality-Aware Mapping of Nested Parallel Patterns on GPUs , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Udo Kebschull,et al.  Biomedical image processing and reconstruction with dataflow computing on FPGAs , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[11]  Kunle Olukotun,et al.  Hardware system synthesis from Domain-Specific Languages , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[12]  Yong Wang,et al.  SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[13]  Feng Liu,et al.  CGPA: Coarse-Grained Pipelined Accelerators , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[15]  Kunle Olukotun,et al.  Composition and Reuse with Compiled Domain-Specific Languages , 2013, ECOOP.

[16]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[17]  Dennis Shasha,et al.  Locality Optimization for Data Parallel Programs , 2013, ArXiv.

[18]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[19]  David F. Bacon,et al.  FPGA programming for the masses , 2013, CACM.

[20]  Kunle Olukotun,et al.  Optimizing data structures in high-level programs: new directions for extensible compilers based on staging , 2013, POPL.

[21]  Kevin J. Brown,et al.  Optimizing data structures in high-level programs , 2013 .

[22]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[23]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[24]  Oskar Mencer,et al.  Finding the right level of abstraction for minimizing operational expenditure , 2011, WHPCF '11.

[25]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[26]  Huseyin Seker,et al.  FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[27]  Donald G. Bailey,et al.  Design for Embedded Image Processing on FPGAs , 2011 .

[28]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[30]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[31]  Joshua S. Auerbach,et al.  Lime: a Java-compatible and synthesizable language for heterogeneous architectures , 2010, OOPSLA.

[32]  M. Zaharia,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[33]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[34]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[35]  Viktor K. Prasanna,et al.  High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware , 2008, IEEE Transactions on Computers.

[36]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[37]  BastoulCédric,et al.  Iterative optimization in the polyhedral model , 2008 .

[38]  Satnam Singh,et al.  Kiwi: Synthesis of FPGA Circuits from Parallel Programs , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[39]  Jeff Mason,et al.  CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[40]  Sadaf R. Alam,et al.  Using FPGA Devices to Accelerate Biomolecular Simulations , 2007, Computer.

[41]  Stephen A. Edwards,et al.  The Challenges of Synthesizing Hardware from C-Like Languages , 2006, IEEE Design & Test of Computers.

[42]  Wayne Luk,et al.  Reconfigurable acceleration for Monte Carlo based financial simulation , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[43]  Arvind Bluespec: A language for hardware design, simulation, synthesis and verification Invited Talk , 2003, MEMOCODE.

[44]  Randolph E. Harr,et al.  Efficient pipelining of nested loops: unroll-and-squash , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[45]  Jason Cong,et al.  AutoPilot: A Platform-Based ESL Synthesis System , 2008 .

[46]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[47]  Mary W. Hall,et al.  CHiLL : A Framework for Composing High-Level Loop Transformations , 2007 .

[48]  Samuel M. Brown,et al.  Performance Comparison of Finite-difference Modeling On Cell, FPGA And Multi-core Computers , 2007 .

[49]  Sadaf R. Alam,et al.  Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels , 2005 .

[50]  Ralf Hinze,et al.  Haskell 98 — A Non−strict‚ Purely Functional Language , 1999 .