On Efficient Data Exchange in Multicore Architectures

In contemporary multicore architectures, three trends can be observed: (i) A growing number of cores, (ii) shared memory as the primary means of communication and data exchange and (iii) high diversity between platform architectures. Still, these platforms are typically programmed manually on a core-by-core basis; the most helpful tool that is widely accepted are library implementations of frequently used algorithms. This complicated task of multicore programming will grow further in complexity with the increasing numbers of cores. In addition, the constant change in architecture designs and thus in platform-specific programming demands will continue to make it laborious to migrate existing code to new platforms. State-of-the-art methods of automatic multicore code generation only partially meet the requirements of modern multicore platforms. They typically have a high overhead for different threads when growing numbers of cores and thus shrinking thread granularities demand the opposite. Also, they typically use message passing models for implementing data exchange when memory sharing should be the natural mode of data exchange. As a result, they often fail to produce efficient code, especially when large data throughput is required. This thesis proposes a data-oriented approach to multicore programming. It shows how dividing a program into discrete tasks with clearly specified inputs and outputs helps to formalise the problem of optimising high data throughput applications for a large range of multicore architectures, at the same time enabling an efficient, low-overhead implementation. In detail, its contributions are as follows. • Inefficiencies in existing programming models are demonstrated for the cases of the CAL actor language and Kahn process networks. Methods are shown to reduce these inefficiencies. • Ladybirds, a specificationmodel and language for parallel programs is presented. A Ladybirds program consists of a tasks with clearly defined inputs and outputs and of dependencies between them. It is explained how Ladybirds aims at execution efficiency also in the domains of data placement and transport and what steps are necessary to get from a Ladybirds specification to executable program code.The examples of comfortable debugging and ofminimising state retention overhead for transient systems underline the usability and versatility of Ladybirds. • An optimisation method for Ladybirds programs on the Kalray MPPA platform is presented. It tries to place data on different memory banks such as to avoid access conflicts. Afterwards, the Ladybirds optimisation problem for

[1]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[2]  Pierre-Louis Curien,et al.  Sequential Algorithms on Concrete Data Structures , 1982, Theor. Comput. Sci..

[3]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[4]  Edward A. Lee,et al.  Dataflow process networks , 2001 .

[5]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[6]  Ghislain Roquier,et al.  Scheduling of dynamic dataflow programs based on state space analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[8]  Lothar Thiele,et al.  Scenario-based design flow for mapping streaming applications onto on-chip many-core systems , 2012, CASES '12.

[9]  Edward G. Coffman,et al.  A Study of Interleaved Memory Systems , 1899 .

[10]  Albert Benveniste,et al.  Compositionality in Dataflow Synchronous Languages: Specification and Code Generation , 1997, COMPOS.

[11]  Andreas Olofsson,et al.  A 1024-core 70 GFLOP/W Floating Point Manycore Microprocessor , 2011 .

[12]  Heechul Yun,et al.  MEDUSA: A Predictable and High-Performance DRAM Controller for Multicore Based Embedded Systems , 2015, 2015 IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications.

[13]  B. Ramakrishna Rau,et al.  Interleaved Memory Bandwidth in a Model of a Multiprocessor Computer System , 1979, IEEE Transactions on Computers.

[14]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[15]  Kevin Fu,et al.  Mementos: system support for long-running computation on RFID-scale devices , 2011, ASPLOS XVI.

[16]  Thomas Nolte,et al.  Contention-Free Execution of Automotive Applications on a Clustered Many-Core Platform , 2016, 2016 28th Euromicro Conference on Real-Time Systems (ECRTS).

[17]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[18]  R. Karp,et al.  Properties of a model for parallel computations: determinacy , 1966 .

[19]  Alan Jay Smith,et al.  Interference in multiprocessor computer systems with interleaved memory , 1976, CACM.

[20]  Rodolfo Pellizzoni,et al.  PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms , 2014, 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[21]  Viera Sipková,et al.  Efficient Variable Allocation to Dual Memory Banks of DSPs , 2003, SCOPES.

[22]  Mickaël Raulet,et al.  Classification of Dataflow Actors with Satisfiability and Abstract Interpretation , 2012, Int. J. Embed. Real Time Commun. Syst..

[23]  Mickaël Raulet,et al.  Orcc: multimedia development made easy , 2013, MM '13.

[24]  Sébastien Lafond,et al.  Quasi-Static Scheduling of CAL Actor Networks for Reconfigurable Video Coding , 2011, J. Signal Process. Syst..

[25]  Edward A. Lee,et al.  PRET DRAM controller: Bank privatization for predictability and temporal isolation , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[26]  Björn Franke,et al.  Fast source-level data assignment to dual memory banks , 2008, SCOPES '08.

[27]  R. Govindarajan,et al.  An Array Allocation Scheme for Energy Reduction in Partitioned Memory Architectures , 2007, CC.

[28]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[29]  Luca Benini,et al.  Brain-Inspired Classroom Occupancy Monitoring on a Low-Power Mobile Platform , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Paul Chow,et al.  Exploiting dual data-memory banks in digital signal processors , 1996, ASPLOS VII.

[31]  André Rossi,et al.  Memory Allocation Problems in Embedded Systems: Optimization Methods , 2012 .

[32]  Benoît Dupont de Dinechin,et al.  Time-critical computing on a single-chip massively parallel processor , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Twan Basten,et al.  Efficient Execution of Process Networks , 2001 .

[34]  S. K. Nandy,et al.  A complexity effective communication model for behavioral modeling of signal processing applications , 2003, DAC '03.

[35]  Luca Benini,et al.  HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA , 2017, ArXiv.

[36]  Rodolfo Pellizzoni,et al.  Worst Case Analysis of DRAM Latency in Multi-requestor Systems , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[37]  Xiaobing Feng,et al.  Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors , 2010, NPC.

[38]  Luca Benini,et al.  P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Rainer Leupers,et al.  MPSoC programming using the MAPS compiler , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[40]  Pascal Sainrat,et al.  Temporal Isolation of Hard Real-Time Applications on Many-Core Processors , 2016, 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[41]  K. N. Dollman,et al.  - 1 , 1743 .

[42]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[43]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[44]  Michele Magno,et al.  Dynamic energy burst scaling for transiently powered systems , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[45]  Soonhoi Ha,et al.  Extended Synchronous Dataflow for Efficient DSP System Prototyping , 2002, Des. Autom. Embed. Syst..

[46]  Lei Liu,et al.  A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[47]  Rainer Leupers,et al.  An optimal allocation of memory buffers for complex multicore platforms , 2016, J. Syst. Archit..

[48]  Dileep Bhandarkar,et al.  Analysis of Memory Interference in Multiprocessors , 1975, IEEE Transactions on Computers.

[49]  Luca Mottola,et al.  Efficient State Retention for Transiently-powered Embedded Sensing , 2016, EWSN.

[50]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[51]  L. Dries,et al.  University of California at Berkeley Berkeley, CA, USA March 24–27, 2011 , 2012 .

[52]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[53]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[54]  Rainer Leupers,et al.  Variable partitioning for dual memory bank DSPs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[55]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[56]  Yunheung Paek,et al.  Efficient register and memory assignment for non-orthogonal architectures via graph coloring and MST algorithms , 2002, LCTES/SCOPES '02.

[57]  Xing Pan,et al.  TintMalloc: Reducing Memory Access Divergence via Controller-Aware Coloring , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[58]  Edward G. Coffman,et al.  A Combinatorial Problem Related to Interleaved Memory Systems , 1973, JACM.

[59]  Robert I. Davis,et al.  Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor , 2016, RTNS '16.

[60]  Lothar Thiele,et al.  Windowed FIFOs for FPGA-based Multiprocessor Systems , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).

[61]  Charles E. Skinner,et al.  Effects of Storage Contention on System Performance , 1969, IBM Syst. J..

[62]  William Daniel Strecker An analysis of the instruction execution rate in certain computer structures , 1970 .

[63]  Robert de Simone,et al.  Static Mapping of Real-Time Applications onto Massively Parallel Processor Arrays , 2014, 2014 14th International Conference on Application of Concurrency to System Design.

[64]  Kees G. W. Goossens A protocol and memory manager for on-chip communication , 2001, ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196).

[65]  Samuel H. Fuller,et al.  Markov chain models for analyzing memory interference in multiprocessor computer systems , 1973, ISCA '73.

[66]  Samuel Kotz,et al.  Urn Models and Their Application: An Approach to Modern Discrete Probability Theory , 1978 .

[67]  Luca Benini,et al.  PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision , 2015, Journal of Signal Processing Systems.

[68]  Joseph R. Cavallaro,et al.  Low power implementation of digital predistortion filter on a heterogeneous application specific multiprocessor , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Patrick Meumeu Yomsi,et al.  The Variability of Application Execution Times on a Multi-Core Platform , 2016, WCET.

[70]  Rainer Leupers,et al.  Throughput driven transformations of Synchronous Data Flows for mapping to heterogeneous MPSoCs , 2012, 2012 International Conference on Embedded Computer Systems (SAMOS).

[71]  Shuvra S. Bhattacharyya,et al.  Partitioning for DSP Software Synthesis , 2003, SCOPES.

[72]  Shuvra S. Bhattacharyya,et al.  A generalized scheduling approach for dynamic dataflow applications , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[73]  Eric Cheung,et al.  Automatic buffer sizing for rate-constrained KPN applications on multiprocessor system-on-chip , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.

[74]  Christian Haubelt,et al.  Classification of General Data Flow Actors into Known Models of Computation , 2008, 2008 6th ACM/IEEE International Conference on Formal Methods and Models for Co-Design.

[75]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[76]  Mickaël Raulet,et al.  Automatic Hierarchical Discovery of Quasi-Static Schedules of RVC-CAL Dataflow Programs , 2013, J. Signal Process. Syst..

[77]  Soonhoi Ha,et al.  Fractional Rate Dataflow Model for Efficient Code Synthesis , 2004, J. VLSI Signal Process..

[78]  Mickaël Raulet,et al.  The Reconfigurable Video Coding Standard [Standards in a Nutshell] , 2010, IEEE Signal Processing Magazine.

[79]  Kazuki Sakamoto,et al.  Grand Central Dispatch , 2012 .

[80]  Soonhoi Ha,et al.  Data memory minimization by sharing large size buffers , 2000, ASP-DAC.

[81]  Lothar Thiele,et al.  Mapping mixed-criticality applications on multi-core architectures , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[82]  Nacer-Eddine Zergainoh,et al.  Buffer Size Reduction through Control-Flow Decomposition , 2007, 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2007).

[83]  Meikang Qiu,et al.  Variable assignment and instruction scheduling for processor with multi-module memory , 2011, Microprocess. Microsystems.

[84]  Lothar Thiele,et al.  Mapping Applications to Tiled Multiprocessor Embedded Systems , 2007, Seventh International Conference on Application of Concurrency to System Design (ACSD 2007).

[85]  Yu Wang,et al.  4.7 A 65nm ReRAM-enabled nonvolatile processor with 6× reduction in restore time and 4× higher clock frequency using adaptive data retention and self-write-termination nonvolatile logic , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[86]  Taewhan Kim,et al.  Integration of Code Scheduling, Memory Allocation, and Array Binding for Memory-Access Optimization , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[87]  Frank Mueller,et al.  Reducing NoC and Memory Contention for Manycores , 2016, ARCS.

[88]  Todor Stefanov,et al.  pn: A Tool for Improved Derivation of Process Networks , 2007, EURASIP J. Embed. Syst..

[89]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[90]  Jean A. Peperstraete,et al.  Cycle-static dataflow , 1996, IEEE Trans. Signal Process..

[91]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[92]  Thomas Martyn Parks,et al.  Bounded scheduling of process networks , 1996 .

[93]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[94]  C. J. Date A Guide to the SQL Standard , 1987 .

[95]  Reinhold Heckmann,et al.  Worst case execution time prediction by static program analysis , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[96]  Edward A. Lee,et al.  Scheduling dynamic dataflow graphs with bounded memory using the token flow model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[97]  Sven-Bodo Scholz Single Assignment C - Functional Programming Using Imperative Style , 1994 .

[98]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[99]  Aviral Shrivastava,et al.  Operation and data mapping for CGRAs with multi-bank memory , 2010, LCTES '10.

[100]  Edward A. Lee Consistency in dataflow graphs , 1991, Proceedings of the International Conference on Application Specific Array Processors.