A Pragmatic Approach for Predicting the Scalability of Parallel Applications

Predicting the scalability of parallel applications is becoming crucial now that the number of cores in modern CPUs doubles roughly every two years. Traditional ways to get some understanding of the scalability of a parallel application rely on extensive experiments or detailed application models. Both are very time consuming and often hard to use. This paper presents PreSca, a pragmatic system for predicting the scalability of parallel applications. PreSca uses function approximation techniques to model scalability with an analytical performance function extracted from a set of measurements. By considering the application as a black-box without requiring any knowledge about its internals, PreSca can be applied with little ef- fort to any parallel application. We show how PreSca can be used statically to predict the scalability of a given application and decide which synchronization primitive scales best for it as well as how it can be used on-line to dynamically assist scheduling decisions and adjust core assignment. In some sense, PreSca shows, for the first time, how function approximation can be used to predict the scalability of parallel applications in a completely general way. We extensively evaluated PreSca using a large number of parallel benchmarks, including some that use locks and some that use transactional memory. We also consider two different multi- core systems. Our evaluation shows that PreSca produces accurate results. More specifically: (1) PreSca’s interpolations based on only 8 measurements have 90th percentile of error lower than 15%, (2) PreSca’s extrapolations using measurements with up to m cores predict the performance for n <= 2m cores with errors lower than 20% in most cases, and (3) PreSca’s on-line scheduler determines the optimal thread count using fewer than 7 measurements with errors lower than 3% on average.

[1]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[2]  Yannis Smaragdakis,et al.  Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3]  Ricardo Bianchini,et al.  DejaVu: accelerating resource allocation in virtualized environments , 2012, ASPLOS XVII.

[4]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[5]  Rachid Guerraoui,et al.  Why STM can be more than a research toy , 2011, Commun. ACM.

[6]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[7]  Jan Vitek,et al.  STMBench7: a benchmark for software transactional memory , 2007, EuroSys '07.

[8]  C. A. Petri Communication with automata , 1966 .

[9]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[10]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[12]  Donald E. Porter,et al.  Understanding transactional memory performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[13]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[14]  Armin Heindl,et al.  An Analytic Model for Optimistic STM with Lazy Locking , 2009, ASMTA.

[15]  Armin Heindl,et al.  An analytic framework for performance modeling of software transactional memory , 2009, Comput. Networks.

[16]  Laura Carrington,et al.  A performance prediction framework for scientific applications , 2003, Future Gener. Comput. Syst..

[17]  Radakovič The theory of approximation , 1932 .

[18]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[19]  Ali-Reza Adl-Tabatabai,et al.  An analytic model of optimistic Software Transactional Memory , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[21]  Rachid Guerraoui,et al.  Stretching transactional memory , 2009, PLDI '09.

[22]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[23]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[24]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[25]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[26]  Jeffrey S. Chase,et al.  Cutting Corners: Workbench Automation for Server Benchmarking , 2008, USENIX Annual Technical Conference.