iHarmonizer: Improving the Disk Efficiency of I/O-intensive Multithreaded Codes

Challenged by serious power and thermal constraints and limited by available instruction-level parallelism, processor designs have evolved to multi-core architectures. These architectures, many augmented with native simultaneous multithreading, are driving software developers to use multithreaded programs to exploit thread-level parallelism. While multithreading is well known to introduce concerns of data dependency and CPU load balance, less known is that the uncertainty of relative progress of thread execution can cause patterns of I/O requests, issued by different threads, to be effectively random and so significantly degrade hard-disk efficiency. This effect can severely offset the performance gains from parallel execution, especially for I/O-intensive programs. Retaining the benefits of multithreading while not losing I/O efficiency is an urgent and challenging problem. We propose a user-level scheme, iHarmonizer, to streamline the servicing of I/O requests from multiple threads in the Open MP programs. Specifically, we use the compiler to insert code into Open MP programs so that data usage can be transmitted at run time to a supporting run-time library that prefetches data in a disk friendly way and coordinates threads' execution according to the availability of their requested data. Transparent to the programmer, iHarmonizer makes a multithreaded program I/O efficient while maintaining the benefits of parallelism. Our experiments show that iHarmonizer can significantly speed up the execution of a representative set of I/O-intensive scientific benchmarks.

[1]  Steven J. Deitz,et al.  High-level Language Support for User-defined Reductions , 2004, The Journal of Supercomputing.

[2]  Garth A. Gibson,et al.  Using speculative execution to automatically hide i/o latency , 2002 .

[3]  Hui Lei,et al.  An analytical approach to file prefetching , 1997 .

[4]  Long Li,et al.  Automatic multithreading and multiprocessing of C programs for IXP , 2005, PPoPP.

[5]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[6]  Wenguang Chen,et al.  OpenUH: an optimizing, portable OpenMP compiler , 2007, Concurr. Comput. Pract. Exp..

[7]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multi-threaded programs , 1997, TOCS.

[8]  Maurice J. Bach The Design of the UNIX Operating System , 1986 .

[9]  Xiaoning Ding,et al.  DiskSeen: Exploiting Disk Layout and Access History to Enhance I/O Prefetch , 2007, USENIX Annual Technical Conference.

[10]  Kai Shen,et al.  Competitive prefetching for concurrent sequential I/O , 2007, EuroSys '07.

[11]  Kang G. Shin,et al.  FS2: dynamic data replication in free disk space for improving disk performance and energy consumption , 2005, SOSP '05.

[12]  Surendra Byna,et al.  Hiding I/O latency with pre-execution prefetching for parallel applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[14]  George C. Necula,et al.  Capriccio: scalable threads for internet services , 2003, SOSP '03.

[15]  Stanley B. Zdonik,et al.  Fido: A Cache That Learns to Fetch , 1991, VLDB.

[16]  Darrell D. E. Long,et al.  The case for efficient file access pattern modeling , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[17]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[18]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[19]  Mache Creeger,et al.  Multicore CPUs for the Masses , 2005, QUEUE.

[20]  Marianne Winslett,et al.  Server-Directed Collective I/O in Panda , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[21]  Yale N. Patt,et al.  Scheduling algorithms for modern disk drives , 1994, SIGMETRICS 1994.

[22]  Ken Kennedy,et al.  A model and compilation strategy for out-of-core data parallel programs , 1995, PPOPP '95.

[23]  Garth A. Gibson,et al.  Automatic I/O hint generation through speculative execution , 1999, OSDI '99.

[24]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[25]  Todd C. Mowry,et al.  Automatic compiler-inserted I/O prefetching for out-of-core applications , 1996, OSDI '96.

[26]  Edward A. Lee The problem with threads , 2006, Computer.

[27]  Vagelis Hristidis,et al.  BORG: Block-reORGanization for Self-optimizing Storage Systems , 2009, FAST.

[28]  Guang R. Gao,et al.  Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP , 2009, IWOMP.

[29]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[30]  Steve Vandebogart,et al.  Reducing Seek Overhead with Application-Directed Prefetching , 2009, USENIX Annual Technical Conference.