Selecting linear algebra kernel composition using response time prediction

Numerical linear algebra libraries provide many kernels that can be composed to perform complex computations. For a given computation, there is typically a large number of functionally equivalent kernel compositions. Some of these compositions achieve better response times than others for particular data and when executed on a particular computer architecture. Previous research provides methods to enumerate (a subset of) these kernel compositions. In this work, we study the problem of determining the composition that yields the lowest response time. Our approach is based on a response time prediction for each candidate combination. While this prediction could in principle be obtained using analytical and/or empirical performance models, developing accurate such models is known to be challenging. Instead, we define a feature space that captures salient properties of kernel combinations and predict response time using supervised machine learning. We experiment with a standard set of machine learning algorithms and identify an effective algorithm for our kernel composition selection problem. Using this algorithm, our approach widely outperforms the strategy that would consist in always using the simplest kernel composition and is often close to the fastest kernel compositions among those evaluated. We quantify the potential benefit of our approach if it were to be implemented as part of an interactive computational tool. We find that although the potential benefit is substantial, a limiting factor is the kernel composition enumeration overhead. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Erich Strohmaier,et al.  A genetic algorithms approach to modeling the performance of memory-bound computations , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  José Meseguer,et al.  Order-Sorted Algebra I: Equational Deduction for Multiple Inheritance, Overloading, Exceptions and Partial Operations , 1992, Theor. Comput. Sci..

[4]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  Lieven Eeckhout,et al.  Performance prediction based on inherent program similarity , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Laura Grigori,et al.  Towards an accurate performance modeling of parallel sparse factorization , 2006, Applicable Algebra in Engineering, Communication and Computing.

[7]  Paolo Bientinesi,et al.  Modeling performance through memory-stalls , 2012, PERV.

[8]  Marc Pantel,et al.  Advanced service trading for scientific computing over the grid , 2009, The Journal of Supercomputing.

[9]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[11]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[12]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[13]  Wayne Snyder,et al.  Complete Sets of Transformations for General E-Unification , 1989, Theor. Comput. Sci..

[14]  Luca Padovani,et al.  HELM and the Semantic Math-Web , 2001, TPHOLs.

[15]  Elmar Peise Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms , 2012, ArXiv.

[16]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[17]  Aurélie Hurault,et al.  Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve , 2010, VECPAR.

[18]  J. Demmel,et al.  Sun Microsystems , 1996 .

[19]  Rudolf Eigenmann,et al.  Context-sensitive domain-independent algorithm composition and selection , 2006, PLDI '06.

[20]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[21]  Chetan Gupta,et al.  PQR: Predicting Query Execution Times for Autonomous Workload Management , 2008, 2008 International Conference on Autonomic Computing.

[22]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[23]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[24]  RWTH Aachen,et al.  Hierarchical Performance Modeling for Ranking Dense Linear Algebra Algorithms , 2012 .

[25]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[26]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[27]  Fabien L. Gandon,et al.  A Machine Learning Approach to SPARQL Query Performance Prediction , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[28]  Nancy M. Amato,et al.  A framework for adaptive algorithm selection in STAPL , 2005, PPoPP.

[29]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[30]  Sally A. McKee,et al.  Machine learning based online performance prediction for runtime parallelization and task scheduling , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[31]  Michael R. Lowry,et al.  Deductive Composition of Astronomical Software from Subroutine Libraries , 1994, CADE.

[32]  Chun Chen,et al.  Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[33]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[34]  Paolo Bientinesi,et al.  Performance Modeling for Dense Linear Algebra , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[35]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[36]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[37]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[38]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[39]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[40]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .

[41]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[42]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.