Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software

Previous studies have revealed that paravirtualization imposes minimal performance overhead on High Performance Computing (HPC) workloads, while exposing numerous benefits for this field. In this study, we are investigating the impact of paravirtualization on the performance of automatically-tuned software systems. We compare peak performance, performance degradation in constrained memory situations, performance degradation in multi-threaded applications, and inter-VM shared memory performance. For comparison purposes, we examine the proficiency of ATLAS, a quintessential example of an autotuning software system, in tuning the BLAS library routines for paravirtualized systems. Our results show that the combination of ATLAS and Xen paravirtualization delivers native execution performance and nearly identical memory hierarchy performance profiles in both single and multi-threaded scenarios. Furthermore, we show that it is possible to achieve memory sharing among OS instances at native speeds. These results expose new benefits to memory-intensive applications arising from the ability to slim down the guest OS without influencing the system performance. In addition, our findings support a novel and very attractive deployment scenario for computational science and engineering codes on virtual clusters and computational clouds.

[1]  Chandra Krintz,et al.  Using phase behavior in scientific application to guide Linux operating system customization , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[2]  Marianne Shaw,et al.  Scale and performance in the Denali isolation kernel , 2002, OSDI '02.

[3]  Chandra Krintz,et al.  Paravirtualization for HPC Systems , 2006, ISPA Workshops.

[4]  Volume Assp,et al.  ACOUSTICS. SPEECH. AND SIGNAL PROCESSING , 1983 .

[5]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[6]  Dhabaleswar K. Panda,et al.  Efficient one-copy MPI shared memory communication in Virtual Machines , 2008, 2008 IEEE International Conference on Cluster Computing.

[7]  Dimitrios S. Nikolopoulos,et al.  Application-Specific Customization on Many-Core Platforms: The VT-ASOS Framework , 2007 .

[8]  Dongyan Xu,et al.  Autonomic Live Adaptation of Virtual Computational Environments in a Multi-Domain Infrastructure , 2006, 2006 IEEE International Conference on Autonomic Computing.

[9]  Borja Sotomayor,et al.  Virtual Clusters for Grid Communities , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[10]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[11]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[13]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[14]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[15]  J. Demmel,et al.  Sun Microsystems , 1996 .

[16]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[17]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[18]  Chandra Krintz,et al.  Linux kernel special-ization for scientific application performance , 2005 .

[19]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[20]  Xiaolan Zhang,et al.  XenSocket: A High-Throughput Interdomain Transport for Virtual Machines , 2007, Middleware.

[21]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[22]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[23]  Geoffroy Vallée,et al.  Dynamic Adaptation using Xen , 2007 .

[24]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[25]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[26]  Chandra Krintz,et al.  Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[27]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[28]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[29]  Adit Ranadive,et al.  Performance implications of virtualizing multicore cluster machines , 2008, HPCVirt '08.

[30]  Orran Krieger,et al.  Virtualization for high-performance computing , 2006, OPSR.

[31]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[32]  Leslie Lamport,et al.  A new solution of Dijkstra's concurrent programming problem , 1974, Commun. ACM.