Design space exploration for efficient data intensive computing on SoCs

Finding efficient implementations of data intensive applications, such as radar/sonar signal and image processing, on a system-on-chip is a very challenging problem due to increasing complexity and performance requirements of such applications. One major issue is the optimization of data transfer and storage micro-architecture, which is crucial in this context. In this chapter, we propose a comprehensive method to explore the mapping of high-level representations of applications into a customizable hardware accelerator. The high-level representation is given in a language named Array-OL. The customizable architecture uses FIFO queues and a double buffering mechanism to mask the latency of data transfers and external memory access. The mapping of a high-level representation onto a given architecture is achieved by applying loop transformations in Array-OL. A method based on integer partition is used to reduce the space of explored solutions. Our proposition aims at facilitating the inference of adequate hardware realizations for data intensive applications. It is illustrated on a case study consisting in implementing a hydrophone monitoring application.

[1]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[2]  Francky Catthoor,et al.  Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications , 2008, J. Signal Process. Syst..

[3]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[4]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[5]  Henry Buller,et al.  Wolf , 2013 .

[6]  Jean-Luc Dekeyser,et al.  A Model-Driven Design Framework for Massively Parallel Embedded Systems , 2011, TECS.

[7]  Shambhu J. Upadhyaya,et al.  Defect Analysis and Defect Tolerant Design of Multi-port SRAMs , 2008, J. Electron. Test..

[8]  Ed F. Deprettere,et al.  A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[9]  Jeanny Hérault,et al.  Efficient Demosaicing Through Recursive Filtering , 2007, 2007 IEEE International Conference on Image Processing.

[10]  Vincenzo Catania,et al.  Efficient design space exploration for application specific systems-on-a-chip , 2007, J. Syst. Archit..

[11]  Ed F. Deprettere,et al.  An Approach for Quantitative Analysis of Application-Specific Dataflow Architectures , 1997, ASAP.

[12]  Rosilde Corvino Design Space Exploration for data-dominated image applications with non-affine array references. (Exploration de l'espace des architectures pour des systèmes de traitement d'image, analyse faite sur des blocs fondamentaux de la rétine numérique) , 2009 .

[13]  David B. Whalley,et al.  Fast, accurate design space exploration of embedded systems memory configurations , 2007, SAC '07.

[14]  Pierre Boulet,et al.  Projection of the Array-OL specification language onto the Kahn process network computation model , 2005, 8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05).

[15]  Jörg Henkel,et al.  The COSYMA environment for hardware/software cosynthesis of small embedded systems , 1996, Microprocess. Microsystems.

[16]  Scott A. Mahlke,et al.  High-level synthesis of nonprogrammable hardware accelerators , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[17]  Pierre Boulet,et al.  Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications , 2010, Euro-Par.

[18]  Keshav Pingali,et al.  Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[19]  Pierre Boulet,et al.  Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing , 2010, Multidimens. Syst. Signal Process..

[20]  Jürgen Teich,et al.  Parallelization Approaches for Hardware Accelerators - Loop Unrolling Versus Loop Partitioning , 2009, ARCS.

[21]  Luciano Lavagno,et al.  Hardware-Software Co-Design of Embedded Systems , 1997 .

[22]  Todor Stefanov,et al.  Affine Nested Loop Programs and their Binary Parameterized Dataflow Graph Counterparts , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[23]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[24]  Francky Catthoor,et al.  Incremental hierarchical memory size estimation for steering of loop transformations , 2007, TODE.

[25]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[26]  Jeanny Hérault,et al.  Modeling Visual Perception for Image Processing , 2007, IWANN.

[27]  Nadia Nedjah,et al.  Modern development methods and tools for embedded reconfigurable systems: A survey , 2010, Integr..

[28]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[29]  Erik Brockmeyer,et al.  Data Access and Storage Management for Embedded Programmable Processors , 2002, Springer US.

[30]  H. T. Kung Why systolic architectures? , 1982, Computer.

[31]  Surendra Byna,et al.  Hiding I/O latency with pre-execution prefetching for parallel applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Andreas Gerstlauer,et al.  System-level abstraction semantics , 2002, 15th International Symposium on System Synthesis, 2002..

[33]  Edward A. Lee,et al.  Multidimensional synchronous dataflow , 2002, IEEE Trans. Signal Process..

[34]  Jianwen Zhu,et al.  Electronic system-level design and high-level synthesis , 2009 .

[35]  S. Stuijk Predictable mapping of streaming applications on multiprocessors , 2007 .

[36]  Marc Pouzet,et al.  N-synchronous Kahn networks: a relaxed model of synchrony for real-time systems , 2006, POPL '06.

[37]  Luciano Lavagno,et al.  Hardware-software co-design of embedded systems: the POLIS approach , 1997 .

[38]  Erik Brockmeyer,et al.  Data and memory optimization techniques for embedded systems , 2001, TODE.

[39]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[40]  Alberto L. Sangiovanni-Vincentelli,et al.  Platform-Based Design and Software Design Methodology for Embedded Systems , 2001, IEEE Des. Test Comput..

[41]  Rosilde Corvino Exploration de l'espace des architectures mémoire pour des systèmes de traitement d'image avec références non affines aux données : application à des blocs fondamentaux d'un modèle de rétine numérique , 2009 .

[42]  Pierre Boulet,et al.  High Level Loop Transformations for Systematic Signal Processing Embedded Applications , 2008, SAMOS.

[43]  Roger F. Woods,et al.  SoC Memory Hierarchy Derivation from Dataflow Graphs , 2007, 2007 IEEE Workshop on Signal Processing Systems.

[44]  Yongmin Kim,et al.  Data Cache and Direct Memory Access in Programming Mediaprocessors , 2001, IEEE Micro.

[45]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[46]  Sarvapali D. Ramchurn,et al.  An Anytime Algorithm for Optimal Coalition Structure Generation , 2014, J. Artif. Intell. Res..