Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications

Due to the complexity of modern data parallel applications such as image processing applications, automatic approach to infer suitable and efficient hardware realizations are more and more required. Typically, the optimization of data transfer and storage micro-architecture has a key role for the data parallelism. In this paper, we propose a comprehensive method to explore the mapping of a high-level representation of an application into a customizable hardware accelerator. The highlevel representation is in a language called Array-OL. The customizable architecture uses FIFO queues and double buffering mechanism to mask the latency of data transfers and external memory access. The mapping of a high-level representation onto the given architecture is performed by applying a set of loop transformations in Array-OL. A method based on integer partition is used to reduce the space of explored solutions.

[1]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[2]  Sarvapali D. Ramchurn,et al.  An Anytime Algorithm for Optimal Coalition Structure Generation , 2014, J. Artif. Intell. Res..

[3]  Pierre Boulet,et al.  High Level Loop Transformations for Systematic Signal Processing Embedded Applications , 2008, SAMOS.

[4]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[5]  Erik Brockmeyer,et al.  Data Access and Storage Management for Embedded Programmable Processors , 2002, Springer US.

[6]  Stamatis Vassiliadis,et al.  Embedded Computer Systems: Architectures, Modeling, and Simulation 5th International Workshop, SAMOS 2005, Samos, Greece, July 18-20, 2005, Proceedings , 2005, International Conference / Workshop on Embedded Computer Systems: Architectures, Modeling and Simulation.

[7]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[8]  H. T. Kung Why systolic architectures? , 1982, Computer.

[9]  Erik Brockmeyer,et al.  Data and memory optimization techniques for embedded systems , 2001, TODE.

[10]  Jeanny Hérault,et al.  Modeling Visual Perception for Image Processing , 2007, IWANN.

[11]  Rosilde Corvino Design Space Exploration for data-dominated image applications with non-affine array references , 2009 .

[12]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[13]  Keshav Pingali,et al.  Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[14]  Pierre Boulet,et al.  Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing , 2010, Multidimens. Syst. Signal Process..

[15]  Pierre Boulet,et al.  Projection of the Array-OL specification language onto the Kahn process network computation model , 2005, 8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05).

[16]  Francky Catthoor,et al.  Incremental hierarchical memory size estimation for steering of loop transformations , 2007, TODE.

[17]  Surendra Byna,et al.  Hiding I/O latency with pre-execution prefetching for parallel applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Jürgen Teich,et al.  Parallelization Approaches for Hardware Accelerators - Loop Unrolling Versus Loop Partitioning , 2009, ARCS.

[19]  Francky Catthoor,et al.  Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications , 2008, J. Signal Process. Syst..

[20]  David B. Whalley,et al.  Fast, accurate design space exploration of embedded systems memory configurations , 2007, SAC '07.

[21]  Yongmin Kim,et al.  Data Cache and Direct Memory Access in Programming Mediaprocessors , 2001, IEEE Micro.

[22]  Jean-Luc Dekeyser,et al.  A Model-Driven Design Framework for Massively Parallel Embedded Systems , 2011, TECS.

[23]  Shambhu J. Upadhyaya,et al.  Defect Analysis and Defect Tolerant Design of Multi-port SRAMs , 2008, J. Electron. Test..

[24]  Jeanny Hérault,et al.  Efficient Demosaicing Through Recursive Filtering , 2007, 2007 IEEE International Conference on Image Processing.

[25]  Vincenzo Catania,et al.  Efficient design space exploration for application specific systems-on-a-chip , 2007, J. Syst. Archit..

[26]  Alberto Prieto,et al.  Computational and ambient intelligence , 2009, Neurocomputing.