D2.2 White-box methodologies, programming abstractions and libraries

This deliverable reports the results of white-box methodologies and early results of the first prototype of libraries and programming abstractions as available by project month 18 by Work Package 2 (WP2). It reports i) the latest results of Task 2.2 on white-box methodologies, programming abstractions and libraries for developing energy-efficient data structures and algorithms and ii) the improved results of Task 2.1 on investigating and modeling the trade-off between energy and performance of concurrent data structures and algorithms. The work has been conducted on two main EXCESS platforms: Intel platforms with recent Intel multicore CPUs and Movidius Myriad1 platform. Regarding white-box methodologies, we have devised new relaxed cache-oblivious models and proposed a new power model for Myriad1 platform and an energy model for lock-free queues on CPU platforms. For Myriad1 platform, the im- proved model now considers both computation and data movement cost as well as architecture and application properties. The model has been evaluated with a set of micro-benchmarks and application benchmarks. For Intel platforms, we have generalized the model for concurrent queues on CPU platforms to offer more flexibility according to the workers calling the data structure (parallel section sizes of enqueuers and dequeuers are decoupled). Regarding programming abstractions and libraries, we have continued investigat- ing the trade-offs between energy consumption and performance of data structures such as concurrent queues and concurrent search trees based on the early results of Task 2.1.The preliminary results show that our concurrent trees are faster and more energy efficient than the state-of-the-art on commodity HPC and embedded platforms.

[1]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[2]  B. Mandelbrot FRACTAL ASPECTS OF THE ITERATION OF z →Λz(1‐ z) FOR COMPLEX Λ AND z , 1980 .

[3]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[4]  Shirley Moore,et al.  Measuring Energy and Power with PAPI , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[5]  Paul Renaud-Goud,et al.  Models for energy consumption of data structures and algorithms , 2018, ArXiv.

[6]  Philippas Tsigas,et al.  Reactive multiword synchronization for multiprocessors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[7]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[8]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[9]  Bill Dally Power, Programmability, and Granularity: The Challenges of ExaScale Computing , 2011, IPDPS.

[10]  Michael A. Bender,et al.  Cache-oblivious priority queue and graph algorithm applications , 2002, STOC '02.

[11]  Roger Wattenhofer,et al.  Efficient multi-word locking using randomization , 2005, PODC '05.

[12]  Haim Kaplan,et al.  CBTree: A Practical Concurrent Self-Adjusting Search Tree , 2012, DISC.

[13]  John Giacomoni,et al.  FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue , 2008, PPoPP.

[14]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[15]  Michel Raynal,et al.  A speculation‐friendly binary search tree , 2019, Concurr. Comput. Pract. Exp..

[16]  Michael A. Bender,et al.  Cache-oblivious streaming B-trees , 2007, SPAA '07.

[17]  Kunle Olukotun,et al.  A practical concurrent binary search tree , 2010, PPoPP '10.

[18]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[19]  S. B. Yao,et al.  Efficient locking for concurrent operations on B-trees , 1981, TODS.

[20]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[21]  Goetz Graefe,et al.  A survey of B-tree locking techniques , 2010, TODS.

[22]  Philippas Tsigas,et al.  Wait-free Programming for General Purpose Computations on Graphics Processors , 2008, IPDPS.

[23]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[24]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[25]  Gerth Stølting Brodal,et al.  Cache oblivious search trees via binary trees of small height , 2001, SODA '02.

[26]  John D. Valois Implementing Lock-Free Queues , 1994 .

[27]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[28]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[29]  Philippas Tsigas,et al.  NOBLE : A Non-Blocking Inter-Process Communication Library , 2002 .

[30]  Michael A. Bender,et al.  Concurrent cache-oblivious b-trees , 2005, SPAA '05.

[31]  Mark Moir,et al.  Using elimination to implement scalable and lock-free FIFO queues , 2005, SPAA '05.

[32]  Paul Renaud-Goud,et al.  White-box methodologies, programming abstractions and libraries , 2018, ArXiv.

[33]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[34]  Maurice Herlihy,et al.  Nonblocking memory management support for dynamic-sized data structures , 2005, TOCS.

[35]  Philippas Tsigas,et al.  The Synchronization Power of Coalesced Memory Accesses , 2010, IEEE Transactions on Parallel and Distributed Systems.

[36]  Erez Petrank,et al.  A lock-free B+tree , 2012, SPAA '12.

[37]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[38]  Arne Andersson Faster deterministic sorting and searching in linear space , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[39]  Nir Shavit,et al.  The Baskets Queue , 2007, OPODIS.

[40]  Philippas Tsigas,et al.  Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency , 2010, OPODIS.

[41]  Yi Zhang,et al.  A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems , 2001, SPAA '01.

[42]  Giuseppe Serazzi,et al.  What to expect when you are consolidating: effective prediction models of application performance on multicores , 2013, Cluster Computing.

[43]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[44]  Harumi A. Kuno,et al.  Modern B-tree techniques , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[45]  Laurent Lefèvre,et al.  A survey on techniques for improving the energy efficiency of large-scale distributed systems , 2014, ACM Comput. Surv..

[46]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[47]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[48]  Faith Ellen,et al.  Non-blocking binary search trees , 2010, PODC.

[49]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[50]  Peter van Emde Boas,et al.  Preserving order in a forest in less than logarithmic time , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[51]  Rolf Fagerberg Cache-Oblivious Model , 2008, Encyclopedia of Algorithms.

[52]  Rajesh Gupta,et al.  Evaluating the effectiveness of model-based power characterization , 2011 .

[53]  Nir Shavit,et al.  Reactive Diffracting Trees , 2000, J. Parallel Distributed Comput..

[54]  Richard W. Vuduc,et al.  Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[55]  John David Valois Lock-free data structures , 1996 .

[56]  Philippas Tsigas,et al.  NB-FEB: A Universal Scalable Easy-to-Use Synchronization Primitive for Manycore Architectures , 2009, OPODIS.

[57]  Georg Ofenbeck,et al.  Applying the roofline model , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[58]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[59]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[60]  Marina Papatriantafilou,et al.  Efficient and Reliable Lock-Free Memory Reclamation Based on Reference Counting , 2009, IEEE Transactions on Parallel and Distributed Systems.

[61]  Marina Papatriantafilou,et al.  Multiword atomic read/write registers on multiprocessor systems , 2009, JEAL.

[62]  Marina Papatriantafilou,et al.  Self-tuning reactive diffracting trees , 2007, J. Parallel Distributed Comput..

[63]  Philippas Tsigas,et al.  NOBLE: non-blocking programming support via lock-free shared abstract data types , 2009, CARN.

[64]  Marina Papatriantafilou,et al.  Efficient self-tuning spin-locks using competitive analysis , 2007, J. Syst. Softw..

[65]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[66]  Gerth Stølting Brodal,et al.  Cache-Oblivious Algorithms and Data Structures , 2004, SWAT.

[67]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.

[68]  Marina Papatriantafilou,et al.  A lock-free algorithm for concurrent bags , 2011, SPAA '11.

[69]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[70]  Leslie Lamport,et al.  Specifying Concurrent Program Modules , 1983, TOPL.

[71]  Trevor Brown,et al.  Non-blocking k-ary Search Trees , 2011, OPODIS.

[72]  Phuong Hoai Ha,et al.  DeltaTree: A Practical Locality-aware Concurrent Search Tree , 2013, ArXiv.

[73]  Pradeep Dubey,et al.  PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors , 2011, Proc. VLDB Endow..