A Cluster Computer Performance Predictor for Memory Scheduling

Remote Memory Access (RMA) hardware allow a given motherboard in a cluster to directly access the memory installed in a remote motherboard of the same cluster. In recent works, this characteristic has been used to extend the addressable memory space of selected motherboards, which enable a better balance of main memory resources among cluster applications. This way is much more cost-effective than than implementing a full-fledged shared memory system. In this context, the memory scheduler is in charge of finding a suitable distribution of local and remote memory that maximizes the performance and guarantees a minimum QoS among the applications. Note that since changing the memory distribution is a slow process involving several motherboards, the memory scheduler needs to make sure that the target distribution provides better performance than the current one. In this paper, a performance predictor is designed in order to find the best memory distribution for a given set of applications executing in a cluster motherboard. The predictor uses simple hardware counters to estimate the expected impact on performance of the different memory distributions. The hardware counters provide the predictor with the information about the time spent in processor, memory access and network. The performance model used by the predictor has been validated in a detailed microarchitectural simulator using real benchmarks. Results show that the prediction accuracy never deviates more than 5% compared to the real results, being less than 0.5% in most of the cases.

[1]  Gabriel H. Loh,et al.  Dynamic Classification of Program Memory Behaviors in CMPs , 2008 .

[2]  José Duato,et al.  A cost-effective heuristic to schedule local and remote memory in cluster computers , 2011, The Journal of Supercomputing.

[3]  Jarek Nieplocha,et al.  Evaluation of Remote Memory Access Communication on the Cray XT3 , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[5]  Sriram Krishnamoorthy,et al.  Blue Gene system software - Design and implementation of a one-sided communication interface for the IBM eServer Blue Gene® supercomputer , 2006, SC.

[6]  H. Fröning,et al.  A HyperTransport Network Interface Controller For Ultra-low Latency Message Transfers , 2008 .

[7]  Hans Werner Meuer The TOP500 Project: Looking Back Over 15 Years of Supercomputing Experience , 2008, Informatik-Spektrum.

[8]  Zhiyi Huang,et al.  A Remote Memory Swapping System for Cluster Computers , 2007, Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007).

[9]  Ulrich Brüning,et al.  A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication , 2009, 2009 International Conference on Parallel Processing.

[10]  Mitsuhisa Sato,et al.  DLM: A distributed Large Memory System using remote memory swapping over cluster nodes , 2008, 2008 IEEE International Conference on Cluster Computing.

[11]  José Duato,et al.  A Scheduling Heuristic to Handle Local and Remote Memory in Cluster Computers , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[12]  T. Yamakami A Trap-Door Method for Subscription-Based Mobile Content , 2007 .

[13]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[14]  Pedro López,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[15]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[16]  Pat Conway,et al.  The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[17]  Atul Negi,et al.  Performance Prediction on Multi-core Processors , 2010, 2010 International Conference on Computational Intelligence and Communication Networks.

[18]  M. Blocksome,et al.  Design and Implementation of a One-Sided Communication Interface for the IBM eServer Blue Gene , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[19]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[20]  Sudhakar Yalamanchili,et al.  Extending HyperTransport Protocol for Improved Scalability , 2009 .