Taming the beast: Programming Peta-FLOP class Deep Learning Systems

1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing