Learning Optimizations for Hardware Accelerated Designs

Many emerging applications require hardware acceleration due to their growing computational intensities. These accelerated designs use heterogeneous hardware, such as GPUs, FPGAs and multi-core CPUs to process the intensive computations at a higher rate. The first part of this work provides two paradigms of hardware accelerated biomedical applications. These paradigms achieved 115X and 273X speedups respectively.Developing these paradigms taught us that, in order to efficiently utilize the heterogeneous accelerators, the designer needs to carefully investigate which device is the most suitable accelerator for a particular computing task. In addition, the designer needs to effectively optimize the computations to fully exploit the computing power of the selected accelerator. This process is called design space exploration (DSE). Heterogeneous DSE requires multiple programming skills for these different types of devices.In recent years, there is a trend to use one unified programming language for multiple heterogeneous devices. The SDKs and hardware synthesis tools have enabled OpenCL as one unified language to program heterogeneous devices including GPUs, FPGAs, and multi-core CPUs. However, one major bottleneck for DSE still exists. In contrast to GPU and CPU OpenCL code compilation, which only consumes several milliseconds, implementing OpenCL designs on a FPGA requires hours of compilation time. Moreover, merely tuning a few programming parameters in the OpenCL code will result in an abundance of possible designs. Implementing all these designs requires months of compilation time. Exploring the FPGA design space with brute force is therefore impractical.The second part of this work addresses this issue by providing a machine learning approach for automatic DSE. This machine learning approach automatically identifies the optimal designs by learning from a few training samples. In comparison with other state-of-the-art machine learning frameworks, this approach reduces the amount of hardware compilations by 3.28X, which is equivalent to hundreds of compute hours. This work also provides a data mining method that enables the machine to automatically use the estimation data to replace the time consuming end-to-end FPGA training samples for DSE. Mining these estimation data further reduces the amount of hardware compilations by 1.26X.