On the exploitation of loop-level parallelism in embedded applications

Advances in the silicon technology have enabled increasing support for hardware parallelism in embedded processors. Vector units, multiple processors/cores, multithreading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. To what extent the available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in the given application and the congruence between the granularity of hardware and application parallelism. This paper discusses how loop-level parallelism in embedded applications can be exploited in hardware and software. Specifically, it evaluates the efficacy of automatic loop parallelization and the performance potential of different types of parallelism, viz., true thread-level parallelism (TLP), speculative thread-level parallelism and vector parallelism, when executing loops. Additionally, it discusses the interaction between parallelization and vectorization. Applications from both the industry-standard EEMBC®,1 1.1, EEMBC 2.0 and the academic MiBench embedded benchmark suites are analyzed using the Intel®2 C compiler. The results show the performance that can be achieved today on real hardware and using a production compiler, provide upper bounds on the performance potential of the different types of thread-level parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution. 1 Other names and brands may be claimed as the property of others. 2 Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.

[1]  Joshua S. Auerbach,et al.  Concert/C: A Language for Distributed Programming , 1994, USENIX Winter.

[2]  Richard S. Bird,et al.  Notes on recursion elimination , 1977, CACM.

[3]  Yat Sang Kwong On reductions and livelocks in asynchronous parallel computation , 1982 .

[4]  Yanhong A. Liu,et al.  From recursion to iteration: what are the optimizations? , 1999, PEPM '00.

[5]  Philip J. Hatcher,et al.  Data-Parallel Programming on MIMD Computers , 1991, IEEE Trans. Parallel Distributed Syst..

[6]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[7]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[8]  Aart J. C. Bik The Software Vectorization Handbook: Apply-ing Multimedia Extensions for Maximum Performance , 2004 .

[9]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[10]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, TOCS.

[11]  Narain H. Gehani,et al.  The concurrent C programming language , 1989 .

[12]  Karandeep Singh,et al.  LMPI: MPI for heterogeneous embedded distributed systems , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[13]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[14]  Ahmed Amine Jerraya,et al.  Automatic generation and targeting of application specific operating systems and embedded systems software , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[15]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[16]  Milind Girkar,et al.  On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings , 2006, ICS '06.

[17]  Dake Liu,et al.  Network Processor for , 2003 .

[18]  Paolo Faraboschi,et al.  Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[19]  Richard Gerber The Software Optimization Cookbook , 2002 .

[20]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[21]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[22]  Constantine D. Polychronopoulos Loop Coalesing: A Compiler Transformation for Parallel Machines , 1987, ICPP.

[23]  Edward A. Lee The problem with threads , 2006, Computer.

[24]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[25]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[26]  Wayne H. Wolf,et al.  The future of multiprocessor systems-on-chips , 2004, Proceedings. 41st Design Automation Conference, 2004..

[27]  Jonathan Schaeffer,et al.  From patterns to frameworks to parallel programs , 2002, Parallel Comput..

[28]  Wayne H. Wolf,et al.  Multiprocessor Systems-on-Chips , 2004, ISVLSI.

[29]  Andrew S. Grimshaw An Introduction to Parallel Object-Oriented Programming with Mentat , 1991 .

[30]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[31]  Peng Zhao,et al.  An integrated simdization framework using virtual vectors , 2005, ICS '05.

[32]  Alice C. Parker,et al.  SOS: Synthesis of application-specific heterogeneous multiprocessor systems , 2001, J. Parallel Distributed Comput..

[33]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[34]  Jaime H. Moreno,et al.  A high-performance embedded DSP core with novel SIMD features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[36]  Ahmed Amine Jerraya,et al.  Automatic generation and targeting of application specific operating systems and embedded systems software , 2001, DATE '01.

[37]  Tulika Mitra,et al.  Dynamic vectorization: a mechanism for exploiting far-flung ILP in ordinary programs , 1999, ISCA.

[38]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[39]  John Paul Shen,et al.  Helper threads via virtual multithreading , 2004, IEEE Micro.

[40]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[41]  Anoop Gupta,et al.  COOL: An object-based language for parallel programming , 1994, Computer.

[42]  H. P. E. Vranken,et al.  TriMedia CPU 64 Architecture , .

[43]  Amer Baghdadi,et al.  Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[44]  G. H. Barnes,et al.  A controllable MIMD architecture , 1986 .

[45]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[46]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[47]  Sri Parameswaran,et al.  Application-specific heterogeneous multiprocessor synthesis using differential-evolution , 1998, Proceedings. 11th International Symposium on System Synthesis (Cat. No.98EX210).

[48]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[49]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[50]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[51]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[52]  James R. Larus,et al.  Software and the Concurrency Revolution , 2005, ACM Queue.

[53]  Gurindar S. Sohi,et al.  The Expandable Split Window Paradigm for Exploiting Fine-grain Parallelism , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[54]  James R. Larus,et al.  Branch prediction for free , 1993, PLDI '93.

[55]  L. C. Smith PASSION Runtime Library for Parallel I/O , 1994 .

[56]  Rajiv Gupta,et al.  Region Scheduling: An Approach for Detecting and Redistributing Parallelism , 1990, IEEE Trans. Software Eng..

[57]  Srivaths Ravi,et al.  Synthesis of application-specific heterogeneous multiprocessor architectures using extensible processors , 2005, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design.

[58]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[59]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[60]  Xin-Min Tian,et al.  Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance , 2002 .

[61]  Milind Girkar,et al.  Challenges in exploitation of loop parallelism in embedded applications , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[62]  K. Mani Chandy,et al.  CC++: A Declarative Concurrent Object Oriented Programming Notation , 1993 .

[63]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[64]  Ralph E. Johnson,et al.  Components, frameworks, patterns , 1997, SSR '97.

[65]  Andy D. Pimentel,et al.  TriMedia CPU64 architecture , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[66]  B. Ramakrishna Rau,et al.  Region-based compilation: Introduction, motivation, and initial experience , 2007, International Journal of Parallel Programming.

[67]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[68]  Gang Ren,et al.  An empirical study on the vectorization of multimedia applications for multimedia extensions , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[69]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[70]  Dennis Gannon,et al.  Distributed pC++ Basic Ideas for an Object Parallel Language , 1993, Sci. Program..

[71]  B. Ramakrishna Rau,et al.  Instruction-level parallel processing: History, overview, and perspective , 2005, The Journal of Supercomputing.

[72]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[73]  Mario Nemirovsky,et al.  DISC: dynamic instruction stream computer , 1991, MICRO 24.

[74]  Gurindar S. Sohi,et al.  The expandable split window paradigm for exploiting fine-grain parallelsim , 1992, ISCA '92.

[75]  Robert D. Blumofe,et al.  Hood: A user-level threads library for multiprogrammed multiprocessors , 1998 .

[76]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[77]  Chris Ding,et al.  ZioLib: A parallel I/O library , 2003 .

[78]  A. Skjellum,et al.  eMPI/eMPICH: embedding MPI , 1996, Proceedings. Second MPI Developer's Conference.

[79]  Timothy G. Mattson,et al.  A Pattern Language for Parallel Application Programs (Research Note) , 2000, Euro-Par.

[80]  Richard E. Hank,et al.  Region-based compilation: an introduction and motivation , 1995, MICRO 1995.

[81]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .