Sesame: A User-Transparent Optimizing Framework for Many-Core Processors

With the integration of more computational cores and deeper memory hierarchies on modern processors, the performance gap between naively parallel zed code and optimized code becomes much larger than ever before. Very often, bridging the gap involves architecture-specific optimizations. These optimizations are difficult to implement by application programmers, who typically focus on the basic functionality of their code. Therefore, in this thesis, I focus on answering the following research question: "How can we address architecture-specific optimizations in a programmer-friendly way?'' As an answer, I propose an optimizing framework for parallel applications running on many-core processors (\textit{Sesame}). Taking a simple parallel zed code provided by the application programmers as input, Sesame chooses and applies the most suitable architecture-specific optimizations, aiming to improve the overall application performance in a user-transparent way. In this short paper, I present the motivation for designing and implementing Sesame, its structure and its modules. Furthermore, I describe the current status of Sesame, discussing our promising results in source-to-source vectorization, automated usage of local memory, and auto-tuning for implementation-specific parameters. Finally, I discuss my work-in-progress and sketch my ideas for finalizing Sesame's development and testing.

[1]  JinHaoqiang,et al.  Performance characteristics of the multi-zone NAS parallel benchmarks , 2006 .

[2]  Jianbin Fang,et al.  An Auto-tuning Solution to Data Streams Clustering in OpenCL , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[3]  Henk Sips,et al.  Source-to-Source Vectorization for OpenCL Kernels , 2012 .

[4]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[5]  Jianbin Fang,et al.  Memory Access Patterns on Architectures with Local Memory: A Performance Database , 2012 .

[6]  Haoqiang Jin,et al.  Performance characteristics of the multi-zone NAS parallel benchmarks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[7]  Jie Shen,et al.  Performance Gaps between OpenMP and OpenCL for Multi-core CPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[8]  Jie Shen,et al.  Accelerating Cost Aggregation for Real-Time Stereo Matching , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[9]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[10]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[11]  Henri E. Bal,et al.  Towards an Effective Unified Programming Model for Many-Cores , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[14]  Seyong Lee,et al.  Early evaluation of directive-based GPU programming models for productive exascale computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Jie Shen,et al.  ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[16]  Kunle Olukotun,et al.  Implementing Domain-Specific Languages for Heterogeneous Parallel Computing , 2011, IEEE Micro.

[17]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .