Automatic Locality Exploitation in the Codelet Model

State-of-the-art codelet scheduling focuses on dynamic workload balance of codelets (similar to tasks). While this approach may achieve reasonable performance since computation resources are fully utilized, it may not attain optimal energy savings. In this paper, targeting at IBM Cyclops64 -- a manycore system, we propose a novel polynomial time algorithm that finds out the optimal codelet scheduling in terms of maximum locality and minimum global memory accesses. Our algorithm leverages static information regarding locality among codelets to achieve better performance and energy efficiency. By using local buffers to pass data produced in one codelet to another, global memory accesses can be greatly reduced. The experimental results on our developed IBM Cyclops-64 emulator show that the codelet scheduling of our algorithm removes up to 59.7% of global memory accesses, achieves up to 68.1% of performance improvement, and reduces up to 40.7% of energy consumption comparing to the state-of-the-art codelet scheduling.

[1]  Jack B. Dennis,et al.  Fresh Breeze: a multiprocessor chip architecture guided by modular programming principles , 2003, CARN.

[2]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[3]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  Chuck Pheatt,et al.  Intel® threading building blocks , 2008 .

[5]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[6]  Guang R. Gao,et al.  Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.

[7]  Rishi Khan,et al.  Towards a codelet-based runtime for exascale computing: position paper , 2012, EXADAPT '12.

[8]  Tse-Yun Feng,et al.  A Vertically Layered Allocation Scheme for Data Flow Systems , 1991, J. Parallel Distributed Comput..

[9]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[10]  Barbara M. Chapman,et al.  Enabling locality-aware computations in OpenMP , 2010, Sci. Program..

[11]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[12]  Guang R. Gao,et al.  ParalleX: A Study of A New Parallel Computation Model , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[14]  Guang R. Gao,et al.  Minimum register instruction sequence problem: revisiting optimal code generation for DAGs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[15]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[16]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[17]  Haoqiang Jin,et al.  Enabling locality-aware computations in OpenMP , 2010 .

[18]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[19]  Daniel A. Orozco,et al.  Energy efficient tiling on a Many-Core Architecture , 2011 .

[20]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[21]  Quan Chen,et al.  CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures , 2012, ICS '12.

[22]  Theo Ungerer,et al.  Asynchrony in Parallel Computing: From Dataflow to Multithreading , 2001, Scalable Comput. Pract. Exp..

[23]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[24]  Guang R. Gao,et al.  Earth: an efficient architecture for running threads , 1999 .

[25]  Michael Haupt,et al.  Maxine: An approachable virtual machine for, and in, java , 2013, TACO.

[26]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[27]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[28]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[29]  Rob C. Knauerhase,et al.  For extreme parallelism, your OS is Sooooo last-millennium , 2012, HotPar'12.