Improving I/O Performance of Applications through Compiler-Directed Code Restructuring

Ever-increasing complexity of large-scale applications and continuous increases in sizes of the data they process make the problem of maximizing performance of such applications a very challenging task. In particular, many challenging applications from the domains of astrophysics, medicine, biology, computational chemistry, and materials science are extremely data intensive. Such applications typically use a disk system to store and later retrieve their large data sets, and consequently, their disk performance is a critical concern. Unfortunately, while disk density has significantly improved over the last couple of decades, disk access latencies have not. As a result, I/O is increasingly becoming a bottleneck for data-intensive applications, and has to be addressed at the software level if we want to extract the maximum performance from modern computer architectures. This paper presents a compiler-directed code restructuring scheme for improving the I/O performance of data-intensive scientific applications. The proposed approach improves I/O performance by reducing the number of disk accesses through a new concept called disk reuse maximization. In this context, disk reuse refers to reusing the data in a given set of disks as much as possible before moving to other disks. Our compiler-based approach restructures application code, with the help of a polyhedral tool, such that disk reuse is maximized to the extent allowed by intrinsic data dependencies in the application code. The proposed optimization can be applied to each loop nest individually or to the entire application code. The experiments show that the average I/O improvements brought by the loop nest based version of our approach are 9.0% and 2.7%, over the original application codes and the codes optimized using conventional schemes, respectively. Further, the average improvements obtained when our approach is applied to the entire application code are 15.0% and 13.5%, over the original application codes and the codes optimized using conventional schemes, respectively. This paper also discusses how careful file layout selection helps to improve our performance gains, and how our proposed approach can be extended to work with parallel applications.

[1]  Mahmut T. Kandemir,et al.  Software-directed disk power management for scientific applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[2]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[3]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[4]  Peter J. Varman,et al.  Optimal prefetching and caching for parallel I/O sytems , 2001, SPAA '01.

[5]  Jesús Labarta,et al.  Design issues of a cooperative cache with no coherence problems , 1997, IOPADS '97.

[6]  David Kotz,et al.  The galley parallel file system , 1997, ICS '96.

[7]  Xiaotie Deng,et al.  A TDI system and its application to approximation algorithms , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[8]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[9]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[10]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[11]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[12]  Ricardo Bianchini,et al.  Exploiting redundancy to conserve energy in storage systems , 2006, SIGMETRICS '06/Performance '06.

[13]  Gala Yadgar,et al.  Karma: Know-It-All Replacement for a Multilevel Cache , 2007, FAST.

[14]  Mahmut T. Kandemir,et al.  Discretionary Caching for I/O on Clusters , 2006, Cluster Computing.

[15]  Ken Kennedy,et al.  Compiler support for out-of-core arrays on parallel machines , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[16]  Mahmut T. Kandemir A Collective I/O Scheme Based on Compiler Analysis , 2000, LCR.

[17]  Alok N. Choudhary,et al.  Automatic optimization of communication in compiling out-of-core stencil codes , 1996, ICS '96.

[18]  Andrew A. Chien,et al.  PPFS: a high performance portable parallel file system , 1995, ICS '95.

[19]  Alok Choudhary,et al.  PASSION Runtime Library for parallel I/O , 1994, Proceedings Scalable Parallel Libraries Conference.

[20]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[21]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[22]  Todd C. Mowry,et al.  Automatic compiler-inserted I/O prefetching for out-of-core applications , 1996, OSDI '96.

[23]  Mahmut T. Kandemir,et al.  Compiler-directed I/O optimization , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[24]  Marianne Winslett,et al.  Server-Directed Collective I/O in Panda , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[25]  Yuanyuan Zhou,et al.  Power-aware storage cache management , 2005, IEEE Transactions on Computers.

[26]  Sharad Garg,et al.  TFLOPS PFS: Architecture and Design of a Highly Efficient Parallel File System , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[27]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[28]  Remzi H. Arpaci-Dusseau,et al.  Storage-Aware Caching: Revisiting Caching for Heterogeneous Storage Systems , 2002, FAST.

[29]  David Kotz Disk-directed I/O for an out-of-core computation , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[30]  Rajeev Thakur,et al.  Optimizing noncontiguous accesses in MPI-IO , 2002, Parallel Comput..

[31]  Kai Li,et al.  Application-Controlled File Caching Policies , 1994, USENIX Summer.

[32]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[33]  E. H. Kung,et al.  A PDF method for multidimensional modeling of HCCI engine combustion: effects of turbulence/chemistry interactions on ignition timing and emissions , 2005 .

[34]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[35]  Yuanyuan Zhou,et al.  Eviction-based Cache Placement for Storage Caches , 2003, USENIX Annual Technical Conference, General Track.

[36]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[37]  Todd C. Mowry,et al.  Taming the memory hogs: using compiler-inserted releases to manage physical memory intelligently , 2000, OSDI.

[38]  Marianne Winslett,et al.  Automatic parallel I/O performance optimization in Panda , 1998, SPAA '98.

[39]  John Wilkes,et al.  My Cache or Yours? Making Storage More Exclusive , 2002, USENIX Annual Technical Conference, General Track.