HPC Environment Management: New Challenges in the Petaflop Era

High Performance Computing (HPC) is becoming much more popular nowadays. Currently, the biggest supercomputers in the world have hundreds of thousands of processors and consequently may have more software and hardware failures. HPC centers managers also have to deal with multiple clusters from different vendors with their particular architectures. However, since there are not enough HPC experts to manage all the new supercomputers, it is expected that non-experts will be managing those large clusters. In this paper we study the new challenges to manage HPC environments containing different clusters with different sizes and architectures. We review available tools and present LEMMing [1], an easy-to-use open source tool developed to support high performance computing centers. LEMMing integrates machine resources and the available management and monitoring tools on a single point of management.

[1]  Wolfgang Barth,et al.  Nagios: System and Network Monitoring , 2006 .

[2]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[3]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[4]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Marcel Gagné Zimbra collaboration suite, Version 4.5 , 2007 .

[6]  Philip M. Papadopoulos,et al.  NPACI: rocks: tools and techniques for easily deploying manageable Linux clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[7]  Linda Dailey Paulson,et al.  Building Rich Web Applications with Ajax , 2005, Computer.

[8]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[9]  Henri Chen,et al.  ZK Step-By-Step: Ajax without JavaScript Framework , 2007 .

[10]  Clarence A. Ellis,et al.  Groupware: some issues and experiences , 1991, CACM.

[11]  L.D. Paulson Will hard drives finally stop shrinking? , 2005, Computer.

[12]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[13]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[14]  Thomas Naughton,et al.  Open Source Cluster Application Resources (OSCAR) : design, implementation and interest for the (computer) scientific community. , 2003 .

[15]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .