An Exploration into the Effectiveness of Prefetching on Program Performance with the Help of an Autotuning Model

This thesis presents the effects of hardware prefetching on the performance of a collection of programs and how learning algorithms can be used to predict the optimal hardware prefetching algorithms to obtain improved performance. Modern processors are equipped with several hardware prefetchers, each of which implements a different prefetching algorithm. My goal was to select the best combination of these prefetchers, as there is no single combination that results in best performance across various programs. Effective program characterization is necessary when learning models are used to make predictions based on program behavior. This thesis uses hardware performance events in conjunction with a pruning algorithm to create a concise and expressive feature set. The feature set is used in three different learning models. These steps are tied together in the form of an autotuning framework that can, on average, achieve up to 96% of the possible speedup that can be attained by varying the combination of prefetchers in effect. The framework is built using open source tools and frameworks, thereby making the framework easy to use, extend and port to other architectures.

[1]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[2]  Mahmut T. Kandemir,et al.  Adaptive prefetching for shared cache based chip multiprocessors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[3]  Simha Sethumadhavan,et al.  Approximate graph clustering for program characterization , 2012, TACO.

[4]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Martin Burtscher,et al.  On the importance of optimizing the configuration of stream prefetchers , 2005, MSP '05.

[7]  Jennifer L. Wong,et al.  To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach , 2013, ASPLOS '13.

[8]  Apan Qasem,et al.  Exposing Tunable Parameters in Multi-threaded Numerical Code , 2010, NPC.

[9]  Donald Nguyen,et al.  Machine learning-based prefetch optimization for data center applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[12]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[13]  Yanbin Liu,et al.  Detection of false sharing using machine learning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[15]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[16]  Pen-Chung Yew,et al.  Multiprocessor cache design considerations , 1987, ISCA '87.

[17]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Vijayalakshmi Srinivasan,et al.  When prefetching improves/degrades performance , 2005, CF '05.

[19]  Simha Sethumadhavan,et al.  Rapid identification of architectural bottlenecks via precise event counting , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[20]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[21]  直野 健,et al.  Software Automatic Tuning, From Concepts to State-of-the-Art Results , 2010 .

[22]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[23]  Mahmut T. Kandemir,et al.  A compiler-directed data prefetching scheme for chip multiprocessors , 2009, PPoPP '09.

[24]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[25]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[26]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[27]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[28]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[29]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[30]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[31]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[32]  Collin McCurdy,et al.  Characterizing the Impact of Prefetching on Scientific Application Performance , 2013, PMBS@SC.

[33]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[34]  Zhenman Fang,et al.  Multi-stage coordinated prefetching for present-day processors , 2014, ICS '14.

[35]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.