Runtime and Programming Support for Memory Adaptation in Scientific Applications via Local Disk and Remote Memory

The ever increasing memory demands of many scientific applications and the complexity of today’s shared computational resources still require the occasional use of virtual memory, network memory, or even out-of-core implementations, with well known drawbacks in performance and usability. In Mills et al. (Adapting to memory pressure from within scientific applications on multiprogrammed COWS. In: International Parallel and Distributed Processing Symposium, IPDPS, Santa Fe, NM, 2004), we introduced a basic framework for a runtime, user-level library, MMlib, in which DRAM is treated as a dynamic size cache for large memory objects residing on local disk. Application developers can specify and access these objects through MMlib, enabling their application to execute optimally under variable memory availability, using as much DRAM as fluctuating memory levels will allow. In this paper, we first extend our earlier MMlib prototype from a proof of concept to a usable, robust, and flexible library. We present a general framework that enables fully customizable memory malleability in a wide variety of scientific applications. We provide several necessary enhancements to the environment sensing capabilities of MMlib, and introduce a remote memory capability, based on MPI communication of cached memory blocks between ‘compute nodes’ and designated memory servers. The increasing speed of interconnection networks makes a remote memory approach attractive, especially at the large granularity present in large scientific applications. We show experimental results from three important scientific applications that require the general MMlib framework. The memory-adaptive versions perform nearly optimally under constant memory pressure and execute harmoniously with other applications competing for memory, without thrashing the memory system. Under constant memory pressure, we observe execution time improvements of factors between three and five over relying solely on the virtual memory system. With remote memory employed, these factors are even larger and significantly better than other, system-level remote memory implementations.

[1]  Sathish S. Vadhiyar,et al.  A performance oriented migration framework for the grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[2]  Robert J. Harrison,et al.  An out-of-core implementation of the COLUMBUS massively-parallel multireference configuration interaction program , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[3]  Jeffrey Scott Vitter,et al.  A theoretical framework for memory-adaptive algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[4]  Harvey Gould,et al.  An Introduction to Computer Simulation Methods: Applications to Physical Systems , 2006 .

[5]  Jarek Nieplocha,et al.  Exploiting processor groups to extend scalability of the GA shared memory programming model , 2005, CF '05.

[6]  Miron Livny,et al.  Memory-Adaptive External Sorting , 1993, VLDB.

[7]  Dror G. Feitelson,et al.  Gang scheduling with memory considerations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[8]  Dimitrios S. Nikolopoulos,et al.  Adaptive Scheduling under Memory Pressure on Multiprogrammed Clusters , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[9]  M. Schkolnick,et al.  9th International Conference on Very Large Data Bases , 1983, Very Large Data Bases Conference.

[10]  Mary K. Vernon,et al.  Characteristics of a Large Shared Memory Production Workload , 2001, JSSPP.

[11]  Todd C. Mowry,et al.  Taming the memory hogs: using compiler-inserted releases to manage physical memory intelligently , 2000, OSDI.

[12]  Fangzhe Chang,et al.  User-level resource-constrained sandboxing , 2000 .

[13]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[14]  Hyun-Wook Jin,et al.  Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[15]  Liviu Iftode,et al.  Home-based shared virtual memory , 1998 .

[16]  Pamela L. Eddy COLLEGE ' OF WILLIAM AND MARY , 2004 .

[17]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[18]  Sanjeev Setia,et al.  Dodo: a user-level system for exploiting idle memory in workstation clusters , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[19]  Dimitrios S. Nikolopoulos,et al.  Adapting to memory pressure from within scientific applications on multiprogrammed COWs , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  Richard T. Mills,et al.  Dynamic adaptation to cpu and memory load in scientific applications , 2004 .

[21]  Anna R. Karlin,et al.  Implementing cooperative prefetching and caching in a globally-managed memory system , 1998, SIGMETRICS '98/PERFORMANCE '98.

[22]  Anna R. Karlin,et al.  Implementing global memory management in a workstation cluster , 1995, SOSP.

[23]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[24]  Andrea C. Arpaci-Dusseau,et al.  Implicit coscheduling: coordinated scheduling with implicit information in distributed systems , 2001, TOCS.

[25]  Larry Rudolph,et al.  Evaluation of Design Choices for Gang Scheduling Using Distributed Hierarchical Control , 1996, J. Parallel Distributed Comput..

[26]  Sanjeev Setia,et al.  Availability and utility of idle memory in workstation clusters , 1999, SIGMETRICS '99.

[27]  Amnon Barak,et al.  Memory ushering in a scalable computing cluster , 1998, Microprocess. Microsystems.

[28]  Evangelos P. Markatos,et al.  Implementation of a Reliable Remote Memory Pager , 1996, USENIX ATC.

[29]  Dimitrios S. Nikolopoulos Malleable memory mapping: user-level control of memory bounds for effective program adaptation , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[30]  Yousef Saad,et al.  Parallel methods and tools for predicting material properties , 2000, Comput. Sci. Eng..

[31]  Joel H. Saltz,et al.  The utility of exploiting idle workstations for parallel computation , 1997, SIGMETRICS '97.

[32]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[33]  Evgenia Smirni,et al.  Algorithmic modifications to the Jacobi-Davidson parallel eigensolver to dynamically balance external CPU and memory load , 2001, ICS '01.

[34]  Scott Pakin,et al.  Dynamic Coscheduling on Workstation Clusters , 1998, JSSPP.

[35]  Francine Berman,et al.  A Decoupled Scheduling Approach for the GrADS Program Development Environment , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[36]  Yunhao Liu,et al.  Parallel network RAM: effectively utilizing global cluster memory for large data-intensive parallel programs , 2004 .

[37]  L. Iftode,et al.  Memory servers for multicomputers , 1993, Digest of Papers. Compcon Spring.

[38]  Wu-chun Feng,et al.  Time-Sharing Parallel Jobs in the Presence of Multiple Resource Requirements , 2000, JSSPP.