Lessons learned while operating two large SCI clusters

The availability of commodity high performance components for workstations and networks made it possible to build up large, PC based compute clusters at modest costs. These clusters seem to be a realistic alternative to proprietary, massively parallel systems with respect to the price/performance ratio. However, from the administration point of view, those systems are still often solely a collection of autonomous nodes, connected by a fast short area network. Therefore, aiming at providing the best possible performance in daily work to all users, a lot of work has to be done before obtaining the expected result. The paper describes the problem areas we had to cope with during the integration of two large SCI clusters (one with 64 and one with 192 processors) in the environment of the Paderborn Center for Parallel Computing.