Optimized Distribution of an Accelerated Convolutional Neural Network across Multiple FPGAs

Convolutional Neural Networks (CNN) have achieved a resounding success especially in computer vision and collaborative filtering. The general trend in CNN architectures has been to build deeper networks with a substantial number of convolution filters and several large feature maps. As a result, most of the current CNN inference routines are highly compute-intensive and have significant storage requirements. Field Programmable Gate Arrays (FPGAs) are among the most popular choices for accelerating CNN inference workloads as they can perform complex and massively parallel jobs. Recently, notable efforts have been made to distribute CNN inference workloads across multiple FPGAs [1]. These strategies, however, do not take into account variations in computational complexity across different layers of a CNN resulting in suboptimal performance gains. This work proposes an optimal distribution of CNN layers across different FPGA nodes while accounting for each layer’s performance to achieve maximum overall throughput.