Pattern-Oriented Application Frameworks for Domain Experts to Effectively Utilize Highly Parallel Manycore Microprocessors

Manycore microprocessors are powerful computing engines that are architected to embrace the use of parallelism to extract computational throughput from the continued improvements in the semiconductor manufacturing process. Yet the performance of the software applications running on these microprocessors is highly sensitive to factors such as data layout, data placement, and synchronization. These factors are not usually part of an application domain experts daily concerns, as they look to utilize the powerful compute capabilities of manycore microprocessors for their applications, but failure to carefully address these concerns could mean an order of magnitude of loss in application execution latency and/or throughput. With the proliferation of manycore microprocessors from servers to laptops and portable devices, there is increasing demand for the productive development of computationally efficient business and consumer applications in a wide range of usage scenarios. The sensitivity of execution speed to software architecture and programming techniques can impede the adoption of the manycore microprocessors and slow the momentum of the semiconductor industry. This thesis discusses how we can empower application domain experts with pattern-oriented application frameworks, which can allow them to effectively utilize the capabilities of highly parallel manycore microprocessors and productively develop efficient parallel software applications. Our pattern-oriented application framework includes an application context for outlining application characteristics, a software architecture for describing the application concurrency exploited in the framework, a reference implementation as a sample design, and a set of extension points for flexible customization. We studied the process of accelerating applications in the fields of machine learning and computational finance, specifically looking at automatic speech recognition (ASR), financial market value-at-risk estimation (VaR), and financial potential future exposure (PFE). We present a pattern-oriented application framework for ASR, as well as efficient reference implementations of VaR and PFE. For the ASR framework, we demonstrate its construction and two separate deployments, one of which flexibly extends the ASR framework to enable lip-reading in high-noise recognition environments. The framework enabled a Matlab/Java programmer to effectively utilize a manycore microprocessor to achieve a 20x speedup in recognition throughput as compared to a sequential CPU implementation. Our pattern-oriented application framework provides an approach for crystallizing and transferring the often-tacit knowledge of effective parallel programming techniques while allowing for flexible adaptation to various application usage scenarios. We believe that the pattern-oriented application framework will be an essential tool for the effective utilization of manycore microprocessors for application domain experts.

[1]  J. F. Moore,et al.  Predators and prey: a new ecology of competition. , 1993, Harvard business review.

[2]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Phhilippe Jorion Value at Risk: The New Benchmark for Managing Financial Risk , 2000 .

[4]  Pierre L'Ecuyer,et al.  Random numbers for simulation , 1990, CACM.

[5]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Gerald Friedland,et al.  Opportunities and challenges of parallelizing speech recognition , 2010 .

[7]  Wonyong Sung,et al.  OpenMP-based parallel implementation of a continuous speech recognizer on a multi-core system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Christopher Hughes,et al.  Scalable HMM based inference engine in large vocabulary continuous speech recognition , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[9]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[10]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[11]  R. Bailey Polar generation of random variates with the t -distribution , 1994 .

[12]  Murray Cole A “skeletal” approach to the exploitation of parallelism , 1989 .

[13]  Brian Roark,et al.  A generalized construction of integrated speech recognition transducers , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Andreas Stolcke,et al.  The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System , 2007, CLEAR.

[15]  Marco Danelutto,et al.  Dynamic run time support for skeletons , 1999, PARCO.

[16]  Peter Sommerlad,et al.  Pattern-Oriented Software Architecture Volume 1: A System of Patterns , 1996 .

[17]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[18]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[19]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[20]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[21]  John Darlington,et al.  Fortran-S : A Uniform Functional Interface to ParallelImperative LanguagesJohn , 1994 .

[22]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[23]  R. J. Dodd,et al.  Monte-Carlo simulations of galaxy systems , 1982 .

[24]  L. P. Deutsch,et al.  Design reuse and frameworks in the smalltalk-80 system , 1989 .

[25]  Christian Bouville,et al.  A Bayesian Monte Carlo Approach to Global Illumination , 2009, Comput. Graph. Forum.

[26]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[27]  Benjamin J. Waterhouse,et al.  Quasi-Monte Carlo for finance applications , 2008 .

[28]  Jorge F. Garza,et al.  Construction of file management systems from software components , 1989, [1989] Proceedings of the Thirteenth Annual International Computer Software & Applications Conference.

[29]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[30]  Kurt Keutzer,et al.  Acceleration of market value-at-risk estimation , 2009, WHPCF '09.

[31]  Sadaoki Furui,et al.  Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition , 2009, Comput. Speech Lang..

[32]  Hermann Ney,et al.  A comparison of two LVR search optimization techniques , 2002, INTERSPEECH.

[33]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[34]  Mitch Weintraub,et al.  SRI's DECIPHER System , 1989, HLT.

[35]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[36]  I. Sobol Uniformly distributed sequences with an additional uniform property , 1976 .

[37]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[38]  Julian V. Noble,et al.  The full Monte , 2002, Comput. Sci. Eng..

[39]  Herbert Kuchen,et al.  A Skeleton Library , 2002, Euro-Par.

[40]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[41]  I. Sloan,et al.  Low discrepancy sequences in high dimensions: How well are their projections distributed? , 2008 .

[42]  Douglas B. Paul The Lincoln Continuous Speech Recognition System: Recent Developments and Results , 1989, HLT.

[43]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[44]  D. Schmidt,et al.  Applying Design Patterns and Frameworks to Develop Object-Oriented Communication Software , 1997 .

[45]  Kurt Keutzer,et al.  A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects , 2010, ParaPLoP '10.

[46]  J. Demmel,et al.  Sun Microsystems , 1996 .

[47]  John D. Owens,et al.  Three-layer optimizations for fast GMM computations on GPU-like parallel processors , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[48]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[49]  Gul Agha,et al.  Energy-performance trade-off analysis of parallel algorithms for shared memory architectures , 2011, Sustain. Comput. Informatics Syst..

[50]  Sadaoki Furui,et al.  Fast acoustic computations using graphics processors , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[52]  S. Poon,et al.  Financial Modeling Under Non-Gaussian Distributions , 2006 .

[53]  Roy H. Campbell,et al.  A technique for documenting the framework of an object-oriented system , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.

[54]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[55]  Ellis Horowitz,et al.  Software Cost Estimation with COCOMO II , 2000 .

[56]  Kurt Keutzer,et al.  A map reduce framework for programming graphics processors , 2010 .

[57]  Sam S. Stone,et al.  MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .

[58]  Qing Xu,et al.  A New Computational Way to Monte Carlo Global Illumination , 2006, Int. J. Image Graph..

[59]  M. E. Muller,et al.  A Note on the Generation of Random Normal Deviates , 1958 .

[60]  Pierre Dumouchel,et al.  GPU accelerated acoustic likelihood computations , 2008, INTERSPEECH.

[61]  StateStart StateFinalFigure Parallel Implementation of Fast Beam Search for Speaker-independent Continuous Speech Recognition , 1993 .

[62]  Michael Bächle,et al.  Ruby on Rails , 2006, Softwaretechnik-Trends.

[63]  Kurt Keutzer,et al.  A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit , 2009, INTERSPEECH.

[64]  Ramón Fernández Astudillo,et al.  Missing Feature Audiovisual Speech Recognition under Real-Time Constraints , 2010, Conference on Natural Language Processing.

[65]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[66]  Douglas C. Schmidt,et al.  Implementing application frameworks: object-oriented frameworks at work , 1999 .

[67]  Pierre Duchesne,et al.  Intraday Value at Risk (Ivar) Using Tick-by-Tick Data with Application to the Toronto Stock Exchange , 2005 .

[68]  Harald Ganzinger,et al.  Smalltalk-80 , 1987, it Inf. Technol..

[69]  Kurt Keutzer,et al.  Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors , 2008 .

[70]  E. Hippel,et al.  Lead users: a source of novel product concepts , 1986 .

[71]  Adele Goldberg,et al.  Smalltalk-80 - the interactive programming environment , 1984 .

[72]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[73]  David Blaauw,et al.  Statistical Timing Analysis: From Basic Principles to State of the Art , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[74]  Paul Bratley,et al.  Algorithm 659: Implementing Sobol's quasirandom sequence generator , 1988, TOMS.

[75]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[76]  Patrick Horain,et al.  GpuCV: A GPU-Accelerated Framework for Image Processing and Computer Vision , 2008, ISVC.

[77]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[78]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[79]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[80]  Kurt Keutzer,et al.  Exploring recognition network representations for efficient speech inference on highly parallel platforms , 2010, INTERSPEECH.

[81]  Ryosuke Isotani,et al.  Parallel LVCSR Algorithm for Cellphone-Oriented Multicore Processors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[82]  Frances Y. Kuo,et al.  Remark on algorithm 659: Implementing Sobol's quasirandom sequence generator , 2003, TOMS.

[83]  Avraham Leff,et al.  Web-application development using the Model/View/Controller design pattern , 2001, Proceedings Fifth IEEE International Enterprise Distributed Object Computing Conference.

[84]  Bruce T. Lowerre,et al.  The HARPY speech recognition system , 1976 .

[85]  Lee W. Howes Efficient Random Number Generation and Application Using , 2007 .

[86]  Marco Vanneschi,et al.  A methodology for the development and the support of massively parallel programs , 1992, Future Gener. Comput. Syst..

[87]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[88]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[89]  S. A. Baeurle Multiscale modeling of polymer materials using field-theoretic methodologies: a survey about recent developments , 2008 .

[90]  Gökhan Tür,et al.  The CALO meeting speech recognition and understanding system , 2008, 2008 IEEE Spoken Language Technology Workshop.

[91]  Hermann Ney,et al.  Language-model look-ahead for large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[92]  Cyril S. Ku,et al.  Design Patterns , 2008, Wiley Encyclopedia of Computer Science and Engineering.

[93]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[94]  Kurt Keutzer,et al.  PALLAS: Mapping Applications onto Manycore , 2011, Multiprocessor System-on-Chip.

[95]  James Coplien,et al.  Advanced C++ Programming Styles and Idioms , 1991, Proceedings. Technology of Object-Oriented Languages and Systems, TOOLS 25 (Cat. No.97TB100239).

[96]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[97]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[98]  Peter G. Harrison,et al.  Parallel Programming Using Skeleton Functions , 1993, PARLE.

[99]  Eitan Grinspun,et al.  CHARMS: a simple framework for adaptive simulation , 2002, ACM Trans. Graph..

[100]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[101]  Mary Shaw,et al.  Software architecture - perspectives on an emerging discipline , 1996 .

[102]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[103]  John Vergo,et al.  Object oriented reuse: experience in developing a framework for speech recognition applications , 1998, Proceedings of the 20th International Conference on Software Engineering.

[104]  I. A. Antonov,et al.  An economic method of computing LPτ-sequences , 1979 .

[105]  Paul R. McJones,et al.  Evolving the UNIX System Interface to Support Multithreaded Programs , 1997 .

[106]  Hermann Ney,et al.  Improvements in beam search for 10000-word continuous-speech recognition , 1994, IEEE Trans. Speech Audio Process..

[107]  Kurt Keutzer,et al.  Efficient manycore CHMM speech recognition for audiovisual and multistream data , 2010, INTERSPEECH.

[108]  K. Nagarajan APPLICATION OF MODELLING AND SIMULATION TO UNDERSTAND CUSTOMER SATISFACTION PATTERNS FOR TELECOMMUNICATION SERVICE PROVIDERS , 2003 .

[109]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[110]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[111]  James F. Moore The Death of Competition: Leadership and Strategy in the Age of Business Ecosystems , 1996 .

[112]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[113]  Max Jacobson,et al.  A Pattern Language: Towns, Buildings, Construction , 1981 .

[114]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[115]  Peter Sommerlad,et al.  Pattern-Oriented Software Architecture , 1996 .

[116]  Ashok Srinivasan Parallel and distributed computing issues in pricing financial derivatives through quasi Monte Carlo , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[117]  John Shalf,et al.  The Cactus Framework and Toolkit: Design and Applications , 2002, VECPAR.

[119]  Brian Foote,et al.  Designing Reusable Classes , 2001 .

[120]  David A. Patterson,et al.  Scalable Vector Media-processors for Embedded Systems , 2002 .

[121]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .

[122]  George Horatiu Botorog,et al.  Skil: an imperative language with algorithmic skeletons for efficient distributed programming , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[123]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[124]  Anne Rogers,et al.  Parallel Speech Recognition , 2004, International Journal of Parallel Programming.