SchedTask: A Hardware-Assisted Task Scheduler

The execution of workloads such as web servers and database servers typically switches back and forth between different tasks such as user applications, system call handlers, and interrupt handlers. The combined size of the instruction footprints of such tasks typically exceeds that of the i-cache (16-32 KB). This causes a lot of i-cache misses and thereby reduces the application’s performance. Hence, we propose SchedTask, a hardware-assisted task scheduler that improves the performance of such workloads by executing tasks with similar instruction footprints on the same core. We start by decomposing the combined execution of the OS and the applications into sequences of instructions calledSuperFunctions. We propose a scheme to determine the amount of overlap between the instruction footprints of different SuperFunctions by using Bloom filters. We then use a hierarchical scheduler to execute SuperFunctions with similar instruction footprints on the same core. For a suite of 8 popular OS-intensive workloads, we report an increase in the application’s performance of up to 29 percentage points (mean: 11.4 percentage points) over state of the art scheduling techniques. CCS CONCEPTS • Software and its engineering $\rightarrow$ Scheduling; Virtual memory; • Computer systems organization $\rightarrow$ Multicore architectures; Cloud computing;

[1]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[2]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[3]  Yanick Fratantonio,et al.  Drammer: Deterministic Rowhammer Attacks on Mobile Platforms , 2016, CCS.

[4]  Anastasia Ailamaki,et al.  STEPS towards Cache-resident Transaction Processing , 2004, VLDB.

[5]  Byung-Gon Chun,et al.  Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 135 Megapipe: a New Programming Interface for Scalable Network I/o , 2022 .

[6]  Min Lee,et al.  Memory region: a system abstraction for managing the complex memory structures of multicore platforms , 2013 .

[7]  Costin Raiciu,et al.  Rekindling network protocol innovation with user-level stacks , 2014, CCRV.

[8]  Prathmesh Kallurkar,et al.  pTask: A smart prefetching scheme for OS intensive applications , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[10]  Prathmesh Kallurkar,et al.  Architectural Support for Handling Jitterin Shared Memory Based Parallel Applications , 2014, IEEE Transactions on Parallel and Distributed Systems.

[11]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[12]  Jignesh M. Patel,et al.  Call graph prefetching for database applications , 2003, TOCS.

[13]  Nitin Gupta,et al.  TriKon: A hypervisor aware manycore processor , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[14]  Anastasia Ailamaki,et al.  SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Prathmesh Kallurkar,et al.  Sensitivity Analysis of Core Specialization Techniques , 2017, ArXiv.

[16]  Stefan Mangard,et al.  DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks , 2015, USENIX Security Symposium.

[17]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[18]  Mahmut T. Kandemir,et al.  GemDroid: a framework to evaluate mobile platforms , 2014, SIGMETRICS '14.

[19]  David W. Nellans,et al.  Interference Aware Cache Designs for Operating System Execution , 2009 .

[20]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[21]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[22]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Prathmesh Kallurkar,et al.  Tejas: A java based versatile micro-architectural simulator , 2015, 2015 25th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[24]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[25]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[26]  Muli Ben-Yehuda,et al.  The Turtles Project: Design and Implementation of Nested Virtualization , 2010, OSDI.

[27]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[28]  R. Jain Throughput fairness index : An explanation , 1999 .

[29]  Alex Landau,et al.  ELI: bare-metal performance for I/O virtualization , 2012, ASPLOS XVII.

[30]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[31]  Pierre Michaud Exploiting the cache capacity of a single-chip multi-core processor with execution migration , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[32]  P. Sen Estimates of the Regression Coefficient Based on Kendall's Tau , 1968 .

[33]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Anastasia Ailamaki,et al.  STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution , 2013, ISCA.