Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks

One aspect of the so called Big Data challenge is the rising quantity of data in almost all scientific, social, governmental and commercial disciplines. As a result there are many ongoing developments of analysis techniques to substitute manual processes with automatic or semi-automatic algorithms. This means the knowledge of data analysts has to be transferred to algorithms which can be executed simultaneously on many data sets. Such, the rising amount of data can be analysed in an constant quality and in a shorter time. Even if the number of existing algorithms is enormous, a ready to use solution for each problem doesn't exist. Especially for analysing and comparing series of measurements, e.g. for analysing data of activity trackers or to monitor service execution infrastructures, we discovered a lack of options. Thus we explain the basics of an algorithm using the cross-correlation function to determine a meaningful value of similarity for two or more series of measurements. We used the new method to analyse and categorise job centric monitoring data.

[1]  Julie A. Dickerson,et al.  Fuzzy network profiling for intrusion detection , 2000, PeachFuzz 2000. 19th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.00TH8500).

[2]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[3]  J Andreeva,et al.  CMS analysis operations , 2010 .

[4]  Erik Elmroth,et al.  A coordinated accounting solution for SweGrid , 2003 .

[5]  Daniel Ch. von Grünigen Digitale Signalverarbeitung: mit einer Einführung in die kontinuierlichen Signale und Systeme , 2008 .

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  Dorothy E. Denning,et al.  An Intrusion-Detection Model , 1987, IEEE Transactions on Software Engineering.

[8]  John J. Grefenstette,et al.  Optimization of Control Parameters for Genetic Algorithms , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[10]  Sam Kwong,et al.  Genetic algorithms and their applications , 1996, IEEE Signal Process. Mag..

[11]  A. Soroko,et al.  Ganga: User-friendly Grid job submission and management tool for LHC and beyond , 2010 .

[12]  Salvatore J. Stolfo,et al.  Data Mining Approaches for Intrusion Detection , 1998, USENIX Security Symposium.

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[15]  Ralph Müller-Pfefferkorn,et al.  User- and job-centric monitoring: Analysing and presenting large amounts of monitoring data , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[16]  R. Ruthen The Frustrations of a Quark Hunter , 1992 .

[17]  Guido Juckeland,et al.  Comprehensive Performance Tracking with Vampir 7 , 2009, Parallel Tools Workshop.

[18]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[19]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[20]  Rosario M. Piro,et al.  Using historical accounting information to predict the resource usage of grid jobs , 2009, Future Gener. Comput. Syst..

[21]  Timothy Sherwood,et al.  Wavelet-based phase classification , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[23]  Matthias Weber,et al.  Automatic Analysis of Large Data Sets: A Walk-Through on Methods from Different Perspectives , 2013, 2013 International Conference on Cloud Computing and Big Data.

[24]  David Beasley,et al.  An overview of genetic algorithms: Part 1 , 1993 .

[25]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[26]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[27]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[28]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[29]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[30]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[31]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[32]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[33]  Ralph Müller-Pfefferkorn,et al.  Achieving scalability for job centric monitoring in a distributed infrastructure , 2012, ARCS 2012.

[34]  Mathilde Romberg,et al.  An Interoperable Grid Information System for Integrated Resource Monitoring Based on Virtual Organizations , 2009, Journal of Grid Computing.

[35]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[37]  Teresa F. Lunt,et al.  Knowledge-based intrusion detection , 1989, [1989] Proceedings. The Annual AI Systems in Government Conference.

[38]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[39]  Dimitrios Gunopulos,et al.  A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series , 2003 .