Instability in parallel job scheduling simulation: the role of workload flurries

The performance of computer systems depends, among other things, on the workload. This motivates the use of real workloads (as recorded in activity logs) to drive simulations of new designs. Unfortunately, real workloads may contain various anomalies that contaminate the data. A previously unrecognized type of anomaly is workload flurries: rare surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. We find that long workloads often include at least one such event. We show that in the context of parallel job scheduling these events can have a significant effect on performance evaluation results, e.g. a very small perturbation of the simulation conditions might lead to a large and disproportional change in the outcome. This instability is due to jobs in the flurry being effected in unison, a consequence of the flurry's repetitive nature. We therefore advocate that flurries be filtered out before the workload is used, in order to achieve stable and more reliable evaluation results (analogously to the removal of outliers in statistical analysis). At the same time, we note that more research is needed on the possible effects of flurries

[1]  Jim Gemmell,et al.  Using Multicast FEC to Solve the Midnight Madness Problem , 1997 .

[2]  Mary K. Vernon,et al.  Characteristics of a Large Shared Memory Production Workload , 2001, JSSPP.

[3]  Steven Hotovy,et al.  Workload Evolution on the Cornell Theory Center IBM SP2 , 1996, JSSPP.

[4]  Ashok K. Agrawala,et al.  An Approach to the Workload Characterization Problem , 1976, Computer.

[5]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[6]  Dan Tsafrir,et al.  Workload sanitation for performance evaluation , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[7]  Dan Tsafrir,et al.  A Short Survey of Commercial Cluster Batch Schedulers , 2005 .

[8]  Mark Burgess,et al.  Measuring system normality , 2002, TOCS.

[9]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[10]  Giuseppe Serazzi,et al.  Workload characterization: a survey , 1993, Proc. IEEE.

[11]  Domenico Ferrari,et al.  Workload charaterization and Selection in Computer Performance Measurement , 1972, Computer.

[12]  Dror G. Feitelson,et al.  Metric and workload effects on computer systems evaluation , 2003, Computer.

[13]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[14]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[15]  Fang Wang,et al.  Modeling of Workload in MPPs , 1997, JSSPP.

[16]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[17]  Bo Hong,et al.  Managing flash crowds on the Internet , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[18]  Francine Berman,et al.  A comprehensive model of the supercomputer workload , 2001 .

[19]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[20]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.