Performance Metrics for Embedded Parallel Pipelines

A statistical approach to performance prediction is applied to a system development methodology for pipelines comprised of independent parallel stages. The methodology is aimed at distributed memory machines employing medium-grained parallelization. The target applications are continuous-flow embedded systems. The use of order statistics on this type of system is compared to previous practical usage which appears largely confined to traditional Non-Uniform Memory Access (NUMA) machines for loop parallelization. A range of suitable performance metrics which give upper bounds or estimates for task durations are discussed. The metrics have a practical role when included in prediction equations in checking fidelity to an application performance specification. An empirical study applies the mathematical findings to the performance of a multicomputer for a synchronous pipeline stage. The results of a simulation are given for larger numbers of processors. In a further simulation, the results are extended to take account of waiting-time distributions while data are buffered between stages of an asynchronous pipeline. Order statistics are also employed to estimate the degradation due to an output ordering constraint. Practical illustrations in the image communication and vision application domains are included.

[1]  M. Fleury The Design of a Clock Synchronization Sub � system for Parallel Embedded Systems , 1997 .

[2]  Greg Wilson,et al.  "Past, Present, Parallel": A Survey Of Available Parallel Computer Systems , 1991 .

[3]  Samuel T. Chanson,et al.  Performance prediction modeling of multicomputers , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[4]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[5]  Greg Wilson,et al.  Past, Present, Parallel , 1991, Springer London.

[6]  Martin Fleury,et al.  Parallelising a set of 2-D frequency transforms in a flexible manner , 1998 .

[7]  Dannie Durand,et al.  Impact of Memory Contention on Dynamic Scheduling on Numa Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[8]  Chien-Hsing Wu,et al.  High Speed Video Compression Testbed , 1994, IEEE International Conference on Consumer Electronics.

[9]  David E. Culler,et al.  Managing concurrent access for shared memory active messages , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[10]  Leslie G. Valiant,et al.  General Purpose Parallel Architectures , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[11]  Colin Upstill,et al.  Hybrid architecture paradigms in a radar ESM data processing application , 1989, Microprocess. Microsystems.

[12]  Robert H. Thomas,et al.  The Uniform System: An approach to runtime support for large scale shared memory parallel processors , 1988, ICPP.

[13]  H. Robbins,et al.  Maximally dependent random variables. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Edith Schonberg,et al.  Low-overhead scheduling of nested parallelism , 1991, IBM J. Res. Dev..

[15]  Michel Dubois,et al.  Performance of Synchronized Iterative Processes in Multiprocessor Systems , 1982, IEEE Transactions on Software Engineering.

[16]  Yunheung Paek,et al.  Experimental study of compiler techniques for NUMA machines , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[17]  Sivarama P. Dandamudi,et al.  A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[18]  Jonathan Schaeffer,et al.  The Enterprise model for developing distributed applications , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[19]  M. Kendall,et al.  The advanced theory of statistics , 1945 .

[20]  Stephen C. Glinski,et al.  Spoken Language Recognition on a DSP Array Processor , 1994, IEEE Trans. Parallel Distributed Syst..

[21]  S. Madala,et al.  Performance of Synchronous Parallel Algorithms with Regular Structures , 1991, IEEE Trans. Parallel Distributed Syst..

[22]  J. T. Robinson,et al.  Some Analysis Techniques for Asynchronous Multiprocessor Algorithms , 1979, IEEE Transactions on Software Engineering.

[23]  H. Robbins,et al.  A class of dependent random variables and their maxima , 1978 .

[24]  Alan Weiss,et al.  Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[25]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[26]  J. G. McWhirter,et al.  Algorithmic engineering in adaptive signal processing: worked examples , 1994 .

[27]  M L Arendt Practical parallel processing , 1986 .

[28]  William L. Maxwell,et al.  Theory of scheduling , 1967 .

[29]  Robert Bernecky,et al.  Book review: Past, Present, Parallel: A Survey of Available Parallel Computing Systems by Arthur Trew & Greg Wilson (Eds.), (Springer-Verlag 1991) , 1991, CARN.

[30]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[31]  Erol Gelenbe,et al.  Multiprocessor Performance , 1990, SIGMETRICS Perform. Evaluation Rev..

[32]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[33]  Peter J. B. King,et al.  Computer and Communication Systems Performance Modelling , 1990, SIGMETRICS Perform. Evaluation Rev..

[34]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[35]  Denis A. Nicole,et al.  The Virtual Channel Router , 1993 .

[36]  Andy C. Downton,et al.  Parallel pipeline implementation of wavelet transforms , 1997 .

[37]  Andy C. Downton,et al.  Karhunen-Loève Transform: An Exercise in Simple Image-Processing Parallel Pipelines , 1997, Comput. Artif. Intell..

[38]  Pramod K. Varshney,et al.  Design, implementation and evaluation of parallel pipelined STAP on parallel computers , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[39]  W. R. Buckland,et al.  Advanced Theory of Statistics Volume 1. , 1970 .

[40]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[41]  Narayanaswamy Balakrishnan,et al.  Order statistics and inference , 1991 .

[42]  Jake K. Aggarwal,et al.  A System Design/Scheduling Strategy for Parallel Image Processing , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  David M. Nicol,et al.  Optimal Processor Assignment for a Class of Pipelined Computations , 1994, IEEE Trans. Parallel Distributed Syst..

[44]  Ricardo Bianchini,et al.  Application Performance on the MIT Alewife Machine , 1996, Computer.

[45]  Marcin Paprzycki Structured development of parallel programs , 1999, IEEE Concurr..

[46]  Andrew S. Grimshaw,et al.  Dynamic, object-oriented parallel processing , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[47]  H. O. Hartley,et al.  Universal Bounds for Mean Range and Extreme Observation , 1954 .

[48]  A. C. Downton Generalised approach to parallelising image sequence coding algorithms , 1994 .

[49]  Richard M. Fujimoto,et al.  Multicomputer Networks: Message-Based Parallel Processing , 1987 .

[50]  Dennis Gannon,et al.  SIEVE: A Performance Debugging Environment for Parallel Programs , 1993, J. Parallel Distributed Comput..

[51]  Andrew M. Wallace,et al.  Dynamic control and prototyping of parallel algorithms for intermediate- and high-level vision , 1992, Computer.

[52]  David May,et al.  Communicating Process Architecture: Transputers and Occam , 1986, Future Parallel Computers.

[53]  Thomas J. LeBlanc Shared Memory Versus Message-Passing in a Tightly-Coupled Multiprocessor: A Case Study , 1986, ICPP.

[54]  Lionel M. Ni,et al.  A survey of wormhole routing techniques in direct networks , 1993, Computer.

[55]  Andy C. Downton,et al.  Parallel Structure in an Integrated Speech-Recognition Network , 1999, Euro-Par.

[56]  Shashi Shekhar,et al.  Parallelizing a GIS on a Shared Address Space Architecture , 1996, Computer.

[57]  H. T. Kung,et al.  A Two-Level Pipelined Systolic Array for Convolutions , 1981 .

[58]  Andy C. Downton,et al.  Structured parallel design for embedded vision systems: a case study , 1997, Microprocess. Microsystems.

[59]  Andy C. Downton,et al.  Fast implementation of discrete wavelet transform based on pipeline processor farming , 1997 .

[60]  Samuel T. Chanson,et al.  Performance Models for the Processor Farm Paradigm , 1997, IEEE Trans. Parallel Distributed Syst..

[61]  Andy C. Downton,et al.  Top down structured parallelisation of embedded image processing applications , 1994 .

[62]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[63]  Dharma P. Agrawal,et al.  A Pipelined Pseudoparallel System Architecture for Real-Time Dynamic Scene Analysis , 1982, IEEE Transactions on Computers.

[64]  Andy C. Downton,et al.  Modelling pipelines for embedded parallel processor system design , 1997 .

[65]  David B. Skillicorn,et al.  Lessons Learned from Implementing BSP , 1997, HPCN Europe.

[66]  Kenneth J. Omahen,et al.  Analysis and Applications of the Delay Cycle for the M/M/c Queueing System , 1978, JACM.

[67]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[68]  E. J. Gumbel,et al.  The Maxima of the Mean Largest Value and of the Range , 1954 .

[69]  Andy C. Downton,et al.  Design of a clock synchronisation sub-system for parallel embedded systems , 1997 .