Management of an academic HPC cluster: The UL experience

The intensive growth of processing power, data storage and transmission capabilities has revolutionized many aspects of science. These resources are essential to achieve high-quality results in many application areas. In this context, the University of Luxembourg (UL) operates since 2007 an High Performance Computing (HPC) facility and the related storage by a very small team. The aspect of bridging computing and storage is a requirement of UL service - the reasons are both legal (certain data may not move) and performance related. Nowadays, people from the three faculties and/or the two Interdisciplinary centers within the UL, are users of this facility. More specifically, key research priorities such as Systems Bio-medicine (by LCSB) and Security, Reliability & Trust (by SnT) require access to such HPC facilities in order to function in an adequate environment. The management of HPC solutions is a complex enterprise and a constant area for discussion and improvement. The UL HPC facility and the derived deployed services is a complex computing system to manage by its scale: at the moment of writing, it consists of 150 servers, 368 nodes (3880 computing cores) and 1996 TB of shared storage which are all configured, monitored and operated by only three persons using advanced IT automation solutions based on Puppet [1], FAI [2] and Capistrano [3]. This paper covers all the aspects in relation to the management of such a complex infrastructure, whether technical or administrative. Most design choices or implemented approaches have been motivated by several years of experience in addressing research needs, mainly in the HPC area but also in complementary services (typically Web-based). In this context, we tried to answer in a flexible and convenient way many technological issues. This experience report may be of interest for other research centers and universities belonging either to the public or the private sector looking for good if not best practices in cluster architecture and management.

[1]  Georges Da Costa,et al.  2005 IEEE International Symposium on Cluster Computing and the Grid , 2005, CCGRID.

[2]  Andy Georges,et al.  EasyBuild: Building Software with Ease , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[4]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[5]  Peter W. Osel,et al.  Abstract Yourself With Modules , 1996, LISA.