Analysis of thread workgroup broadcast for Intel GPUs

As hardware becomes more flexible in terms of programming, software APIs must expose hardware features in a portable way. Thread to thread communication is being exposed in OpenCL 2.0 through the newly defined work-group functions. In this paper we analyze the work-group broadcast functionality in the OpenCL compiler backend for Intel's GPUs. We first describe the particularities of Intel's GEN GPU architecture and the Beignet OpenCL open source project. Then we describe the work-group broadcast implementation which uses shared local memory read/write for thread to thread communication. Finally we analyze the performance and on how the implementation maps to hardware, motivating the design decisions.