pTask: A smart prefetching scheme for OS intensive applications

Instruction prefetching is a standard approach to improve the performance of operating system (OS) intensive workloads such as web servers, file servers and database servers. Sophisticated instruction prefetching techniques such as PIF [12] and RDIP [17] record the execution history of a program in dedicated hardware structures and use this information for prefetching if a known execution pattern is repeated. The storage overheads of the additional hardware structures are prohibitively high (64-200 KB per core). This makes it difficult for the deployment of such schemes in real systems. We propose a solution that uses minimal hardware modifications to tackle this problem. We notice that the execution of server applications keeps switching between tasks such as the application, system call handlers, and interrupt handlers. Each task has a distinct instruction footprint, and is separated by a special OS event. We propose a sophisticated technique to capture the instruction stream in the vicinity of such OS events; the captured information is then compressed significantly and is stored in a process's virtual address space. Special OS routines then use this information to prefetch instructions for the OS and the application codes. Using modest hardware support (4 registers per core), we report an increase in instruction throughput of 2-14% (mean: 7%) over state of the art instruction prefetching techniques for a suite of 8 popular OS intensive applications.

[1]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[2]  Prathmesh Kallurkar,et al.  Architectural Support for Handling Jitterin Shared Memory Based Parallel Applications , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3]  Babak Falsafi,et al.  SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Jignesh M. Patel,et al.  Call graph prefetching for database applications , 2003, TOCS.

[5]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  Gary S. Tyson,et al.  Branch history guided instruction prefetching , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[7]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[8]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[9]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[10]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  David W. Nellans,et al.  Interference Aware Cache Designs for Operating System Execution , 2009 .

[12]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[13]  Mikko H. Lipasti,et al.  Redeeming IPC as a performance metric for multithreaded programs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[14]  Anant Agarwal,et al.  Vote the OS off your Core , 2011 .

[15]  Nitin Gupta,et al.  TriKon: A hypervisor aware manycore processor , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[16]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Prathmesh Kallurkar,et al.  Tejas: A java based versatile micro-architectural simulator , 2015, 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[18]  Thomas F. Wenisch,et al.  Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[19]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[20]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[21]  Anastasia Ailamaki,et al.  SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[24]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[25]  Prathmesh Kallurkar,et al.  Tejas Simulator : Validation against Hardware , 2015, ArXiv.

[26]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[27]  Todd C. Mowry,et al.  Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.