The Impact of Noise on the Scaling of Collectives: A Theoretical Approach

The performance of parallel applications running on large clusters is known to degrade due to the interference of kernel and daemon activities on individual nodes, often referred to as noise. In this paper, we focus on an important class of parallel applications, which repeatedly perform computation, followed by a collective operation such as a barrier. We model this theoretically and demonstrate, in a rigorous way, the effect of noise on the scalability of such applications. We study three natural and important classes of noise distributions: The exponential distribution, the heavy-tailed distribution, and the Bernoulli distribution. We show that the systems scale well in the presence of exponential noise, but the performance goes down drastically in the presence of heavy-tailed or Bernoulli noise.

[1]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[2]  Van Jacobson,et al.  The synchronization of periodic routing messages , 1994, TNET.

[3]  Yutaka Ishikawa,et al.  Highly Efficient Gang Scheduling Implementation , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Liana L. Fong,et al.  A Gang-Scheduling System for ASCI Blue-Pacific , 1999, HPCN Europe.

[5]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[6]  Scott Pakin,et al.  STORM: Lightning-Fast Resource Management , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  Fabrizio Petrini,et al.  Scalable collective communication on the ASCI Q machine , 2003, 11th Symposium on High Performance Interconnects, 2003. Proceedings..

[8]  Dror G. Feitelson,et al.  Flexible coscheduling: mitigating load imbalance and improving utilization of heterogeneous resources , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[9]  Chita R. Das,et al.  Co-ordinated coscheduling in time-sharing clusters through a generic framework , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[10]  Terry Jones,et al.  Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  Terry Jones,et al.  Impacts of Operating Systems on the Scalability of Parallel Applications , 2003 .

[12]  D.K. Panda,et al.  Scalable NIC-based Reduction on Large-scale Clusters , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[13]  William T. C. Kramer,et al.  Performance Variability of Highly Parallel Architectures , 2003, International Conference on Computational Science.

[14]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[15]  Scott Pakin,et al.  Identifying and Eliminating the Performance Variability on the ASCI Q Machine , 2003 .

[16]  R. Gioiosa,et al.  Analysis of system overhead on parallel computers , 2004, Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004..