A Dissimilarity Measure for Comparing Subsets of Data : Application to Multivariate Time Series ∗

Similarity is a central concept in data mining. Many techniques, such as clustering and classification, use similarity or distance measures to compare various subsets of multivariate data. However, most of these measures are only designed to find the distances between a pair of records or attributes in a data set, and not for comparing whole data sets against one another. In this paper we present a novel dissimilarity measure based on principal component analysis for doing such comparisons between such data sets, and in particular time series data sets. Our measure accounts for the correlation structure of the data, and can be tuned by the user to account for domain knowledge. Our measure is useful in such applications as change point detection, anomaly detection, and clustering in fields such as intrusion detection, clinical trial data analysis, and stock analysis.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[5]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[7]  K. Jöreskog,et al.  Applied Factor Analysis in the Natural Sciences. , 1997 .

[8]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[9]  Alberto O. Mendelzon,et al.  Similarity-based queries , 1995, PODS '95.

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[12]  Ramesh Subramonian Defining diff as a Data Mining Primitive , 1998, KDD.

[13]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[14]  Ralph R. Martin,et al.  Incremental Eigenanalysis for Classification , 1998, BMVC.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[17]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[18]  V. Kvasnicka,et al.  Neural and Adaptive Systems: Fundamentals Through Simulations , 2001, IEEE Trans. Neural Networks.

[19]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[20]  Ales Leonardis,et al.  Incremental PCA for on-line visual learning and recognition , 2002, Object recognition supported by user interaction for service robots.

[21]  Charu C. Aggarwal,et al.  Towards systematic design of distance functions for data mining applications , 2003, KDD '03.

[22]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[23]  Cyrus Shahabi,et al.  A PCA-based similarity measure for multivariate time series , 2004, MMDB '04.

[24]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[25]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[26]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.