ESTIMA: Extrapolating ScalabiliTy of In-Memory Applications

This article presents estima , an easy-to-use tool for extrapolating the scalability of in-memory applications. estima is designed to perform a simple yet important task: Given the performance of an application on a small machine with a handful of cores, estima extrapolates its scalability to a larger machine with more cores, while requiring minimum input from the user. The key idea underlying estima is the use of stalled cycles (e.g., cycles that the processor spends waiting for missed cache line fetches or busy locks). estima measures stalled cycles on a few cores and extrapolates them to more cores, estimating the amount of waiting in the system. estima can be effectively used to predict the scalability of in-memory applications for bigger execution machines. For instance, using measurements of memcached and SQLite on a desktop machine, we obtain accurate predictions of their scalability on a server. Our extensive evaluation shows the effectiveness of estima on a large number of in-memory benchmarks.

[1]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[2]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[3]  Donald E. Porter,et al.  Understanding transactional memory performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[4]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[5]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[6]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[7]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  J. Cooper,et al.  Theory of Approximation , 1960, Mathematical Gazette.

[9]  Laura Carrington,et al.  A performance prediction framework for scientific applications , 2003, Future Gener. Comput. Syst..

[10]  Bo Wu,et al.  ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Sally A. McKee,et al.  Real time power estimation and thread scheduling via performance counters , 2009, CARN.

[12]  Jim Ruppert,et al.  A Delaunay Refinement Algorithm for Quality 2-Dimensional Mesh Generation , 1995, J. Algorithms.

[13]  Ali-Reza Adl-Tabatabai,et al.  An analytic model of optimistic Software Transactional Memory , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Armin Heindl,et al.  An Analytic Model for Optimistic STM with Lazy Locking , 2009, ASMTA.

[15]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[16]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[17]  Frank Mueller,et al.  Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[18]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[19]  Yan Solihin,et al.  Scal-Tool: Pinpointing and Quantifying Scalability Bottlenecks in DSM Multiprocessors , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[20]  Rachid Guerraoui,et al.  Stretching transactional memory , 2009, PLDI '09.

[21]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[22]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[23]  Pradip Bose,et al.  SMT-centric power-aware thread placement in chip multiprocessors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[24]  Armin Heindl,et al.  An analytic framework for performance modeling of software transactional memory , 2009, Comput. Networks.

[25]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[26]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[27]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[28]  Craig Freedman,et al.  Hekaton: SQL server's memory-optimized OLTP engine , 2013, SIGMOD '13.

[29]  Marnix Goossens,et al.  Improving the Performance of Signature-Based Network Intrusion Detection Sensors by Multi-threading , 2004, WISA.

[30]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[31]  Thomas J. LeBlanc,et al.  Parallel performance prediction using lost cycles analysis , 1994, Proceedings of Supercomputing '94.

[32]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[33]  Francisco J. Cazorla,et al.  Power and thermal characterization of POWER6 system , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Yannis Smaragdakis,et al.  Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[35]  Rachid Guerraoui,et al.  Why STM can be more than a research toy , 2011, Commun. ACM.

[36]  David A. Wood,et al.  IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[37]  Bin Fan,et al.  MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.

[38]  Sandhya Dwarkadas,et al.  Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing , 2016, USENIX Annual Technical Conference.

[39]  Nathan Froyd,et al.  Scalability analysis of SPMD codes using expectations , 2007, ICS '07.

[40]  Catherine Mills Olschanowsky,et al.  HPC application address stream compression, replay and scaling , 2011 .

[41]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.