DF‐DTM: Dynamic Task Memoization and reuse in dataflow

Instruction Reuse is a technique adopted in Von Neumann architectures that improves performance by avoiding redundant execution of instructions when the result to be produced can be obtained by searching an input/output memoization table for such instruction. Trace reuse can be applied to traces of instructions in a similar fashion. However, those techniques are yet to be studied in the context of the Dataflow model, which has been gaining traction in the high performance computing community due to its inherent parallelism. Dataflow programs are represented by directed graphs where nodes are instructions or tasks and edges denote data dependencies between tasks. This work presents Dataflow Dynamic Task Memoization (DF‐DTM), a technique that allows the reuse of both nodes and subgraphs in dataflow, which are analogous to instructions and traces, respectively. The potential of DF‐DTM is evaluated by a series of experiments that analyze the behavior of redundant tasks in five relevant benchmarks, where up to 99.70% of the instantiated tasks could be reused. Moreover, this paper evaluates how reuse rates can be affected by limiting subgraph size, memoization table size, task granularity, and problem size, showing that DF‐DTM can yield good reuse rates in more realistic environments.

[1]  Master Gardener,et al.  Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .

[2]  DONALD MICHIE,et al.  “Memo” Functions and Machine Learning , 1968, Nature.

[3]  Chia-Hung Liao,et al.  Exploiting speculative value reuse using value prediction , 2002 .

[4]  R. Gomory,et al.  A Linear Programming Approach to the Cutting-Stock Problem , 1961 .

[5]  Yasuhiko Nakashima,et al.  An implementation of Auto-Memoization mechanism on ARM-based superscalar processor , 2014, 2014 International Symposium on System-on-Chip (SoC).

[6]  Avi Mendelson,et al.  TERAFLUX: Harnessing dataflow in next generation teradevices , 2014, Microprocess. Microsystems.

[7]  Maurício L. Pilla,et al.  Value predictors for reuse through speculation on traces , 2004, 16th Symposium on Computer Architecture and High Performance Computing.

[8]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[9]  Brunno F. Goldstein,et al.  A Minimalistic Dataflow Programming Library for Python , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[10]  Christof Paar,et al.  Understanding Cryptography: A Textbook for Students and Practitioners , 2009 .

[11]  Gurindar S. Sohi,et al.  Understanding the differences between value prediction and instruction reuse , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  Wei Zhang,et al.  Low-Power FPGA Design Using Memoization-Based Approximate Computing , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[13]  Yasuhiko Nakashima,et al.  Hinting for Auto-Memoization Processor Based on Static Binary Analysis , 2014, 2014 Second International Symposium on Computing and Networking.

[14]  Jordi Tubella,et al.  The Performance Potential of Data Value Reuse , 1998 .

[15]  Antonio González,et al.  Dynamic removal of redundant computations , 1999, ICS '99.

[16]  Hiroshi Nakashima,et al.  Design and evaluation of an auto-memoization processor , 2007, Parallel and Distributed Computing and Networks.

[17]  Maurício L. Pilla,et al.  Value Reuse Potential in ARM Architectures , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[18]  Maurício L. Pilla,et al.  The limits of speculative trace reuse on deeply pipelined processors , 2003, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing.

[19]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[20]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[21]  Jian Huang,et al.  Exploring sub-block value reuse for superscalar processors , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[22]  Eduard Ayguadé,et al.  DaSH: a benchmark suite for hybrid dataflow and shared memory programming models: with comparative evaluation of three hybrid dataflow models , 2014, Conf. Computing Frontiers.

[23]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Luo Qi,et al.  Parallel and Distributed Computing and Networks , 2011 .

[25]  Vítor Santos Costa,et al.  Couillard: Parallel programming via coarse-grained Data-flow Compilation , 2011, Parallel Comput..

[26]  Philip J. Guo,et al.  Towards Practical Incremental Recomputation for Scientists: An Implementation for the Python Language , 2010, TaPP.

[27]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[28]  Antonio González,et al.  Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[29]  Gurindar S. Sohi,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, MICRO 33.

[30]  Maurício L. Pilla,et al.  A Speculative Trace Reuse Architecture with Reduced Hardware Requirements , 2006, 2006 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'06).

[31]  Felipe Maia Galvão França,et al.  The dynamic trace memoization reuse technique , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).