Analysis of OpenCL Work-Group Reduce for Intel GPUs

As hardware becomes more flexible in terms ofprogramming, software APIs must expose hardware features ina portable way. Additions in the OpenCL 2.0 API expose threadcommunication through the newly defined work-group functions. In this paper we focus on two implementations of the work-groupfunctions in the OpenCL compiler backend for Intel's GPUs. Wefirst describe the particularities of Intel's GEN GPU architectureand the Beignet OpenCL open source project. Both work-groupimplementations are then detailed, one based on thread to threadmessage passing while the other on thread to shared local memoryread/write. The focus is around choosing an optimal variant basedon how each implementation maps to the hardware and its impacton performance.