Optimizing Data Parallel Operations on Many-Core Platforms

Data parallel operations are widely used in game, multimedia, physics and data-intensive and scientific applications. Unlike control parallelism, data parallelism comes from simultaneous operations across large sets of collection-oriented data such as vectors and matrices. A simple implementation can use OpenMP directives to execute operations on multiple data concurrently. However, this implementation introduces a lot of barriers across data parallel operations and even within a single data parallel operation to synchronize the concurrent threads. This synchronization cost may overwhelm the benefit of data parallelism. Moreover, barriers prohibit many optimization opportunities among parallel regions. In this paper, we describe an approach to optimizing data parallel operations on many-core platforms, called sub-primitive fusion, which reduces expensive barriers by merging code regions of data parallel operations based on the data flow information. It also replaces remaining barriers with light-weight synchronization mechanisms. This approach enables other optimization opportunities such as data reuse across data parallel operations, dynamic partitioning of fused data parallel operations, and semiasynchronous parallel execution among the threads. We present preliminary experimental results for the sparse matrix kernels that demonstrate the benefits of this approach. We observe speedups up to 5 on an 8-way SMP machine compared against the serial execution time.