A similarity measure for time, frequency, and dependencies in large-scale workloads

Performance evaluations of large-scale systems require the use of representative workloads with certifiable similar or dissimilar characteristics. To quantify the similarity of the characteristics, we describe a novel measure comprising two efficient methods that are suitable for large-scale workloads. One method uses the discrete wavelet transform to assess the periodic time and frequency characteristics in the workload. The second method evaluates dependencies in descriptive attributes via association rule learning. Both methods are evaluated to find the limits of their similarity spaces. Additionally, the wavelet method is evaluated against existing similarity methods and tested for noise robustness and random bias. An empirical study using workloads from seven operational large-scale systems evaluates the measure's accuracy. The results show that our measure is highly resistant to noise, well-suited for large-scale workloads, covers 87% of the possible similarity space, and improves accuracy by 24.5% and standard deviation by 10.8% when compared to existing work.

[1]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[2]  Karen L. Karavanic,et al.  Evaluating similarity-based trace reduction techniques for scalable performance analysis , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[3]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[4]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[5]  Hui Li,et al.  Workload dynamics on clusters and grids , 2008, The Journal of Supercomputing.

[6]  David De Roure,et al.  Managing very large distributed data sets on a data grid , 2009, Concurr. Comput. Pract. Exp..

[7]  Quan Pan,et al.  Two denoising methods by wavelet transform , 1999, IEEE Trans. Signal Process..

[8]  M. Stephens,et al.  K-Sample Anderson–Darling Tests , 1987 .

[9]  Peter A. Dinda,et al.  Host load prediction using linear models , 2000, Cluster Computing.

[10]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[11]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[12]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[13]  Anne Krampe,et al.  A hybrid Markov chain model for workload on parallel computers , 2010, HPDC '10.

[14]  Danilo P. Mandic,et al.  Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability , 2001 .

[15]  P. A. Blight The Analysis of Time Series: An Introduction , 1991 .

[16]  Geoffrey I. Webb OPUS: An Efficient Admissible Algorithm for Unordered Search , 1995, J. Artif. Intell. Res..

[17]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[18]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[19]  Thomas Fahringer,et al.  Identification, Modelling and Prediction of Non-periodic Bursts in Workloads , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[20]  Michael Laurenzano,et al.  How well can simple metrics represent the performance of HPC applications? , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[21]  Dick H. J. Epema,et al.  A Realistic Integrated Model of Parallel System Workloads , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[22]  Evgenia Smirni,et al.  CWS: a model-driven scheduling policy for correlated workloads , 2010, SIGMETRICS '10.

[23]  Michael C. Fu,et al.  Guest editorial , 2003, TOMC.

[24]  Tapio Elomaa,et al.  Principles of Data Mining and Knowledge Discovery , 2002, Lecture Notes in Computer Science.

[25]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[26]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  I. Daubechies Ten Lectures on Wavelets , 1992 .

[28]  Thomas Fahringer,et al.  Stream Monitoring in Large-Scale Distributed Concealed Environments , 2009, 2009 Fifth IEEE International Conference on e-Science.

[29]  Jeffrey C. Mogul,et al.  Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[30]  Miguel Branco Distributed data management for large scale applications , 2009 .

[31]  Martin Schulz,et al.  Preserving time in large-scale communication traces , 2008, ICS '08.

[32]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[33]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[34]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[35]  Zbigniew R. Struzik,et al.  The Haar Wavelet Transform in the Time Series Similarity Paradigm , 1999, PKDD.

[36]  Michael Muskulus,et al.  Analysis and modeling of job arrivals in a production grid , 2007, PERV.