Pattern change discovery between high dimensional data sets

This paper investigates the general problem of pattern change discovery between high-dimensional data sets. Current methods either mainly focus on magnitude change detection of low-dimensional data sets or are under supervised frameworks. In this paper, the notion of the principal angles between the subspaces is introduced to measure the subspace difference between two high-dimensional data sets. Principal angles bear a property to isolate subspace change from the magnitude change. To address the challenge of directly computing the principal angles, we elect to use matrix factorization to serve as a statistical framework and develop the principle of the dominant subspace mapping to transfer the principal angle based detection to a matrix factorization problem. We show how matrix factorization can be naturally embedded into the likelihood ratio test based on the linear models. The proposed method is of an unsupervised nature and addresses the statistical significance of the pattern changes between high-dimensional data sets. We have showcased the different applications of this solution in several specific real-world applications to demonstrate the power and effectiveness of this method.

[1]  Nicole Immorlica,et al.  Joint Cluster Analysis of Attribute Data and Relationship Data , 2008 .

[2]  Anton Dries,et al.  Adaptive concept drift detection , 2009, SDM.

[3]  Ling Chen,et al.  Event detection from flickr data through wavelet-based spatial analysis , 2009, CIKM.

[4]  Yun Chi,et al.  Eigen-trend: trend analysis in the blogosphere based on singular value decompositions , 2006, CIKM '06.

[5]  Geoffrey J. Gordon,et al.  A Unified View of Matrix Factorization Models , 2008, ECML/PKDD.

[6]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[7]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[8]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[9]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[10]  Chris H. Q. Ding,et al.  Binary Matrix Factorization with Applications , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[11]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.

[12]  KlinkenbergRalf Learning drifting concepts: Example selection vs. example weighting , 2004 .

[13]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[14]  Naren Ramakrishnan,et al.  Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Philip S. Yu,et al.  Unsupervised learning on k-partite graphs , 2006, KDD '06.

[16]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[17]  Geoffrey J. Gordon Generalized² Linear² Models , 2003, NIPS 2003.

[18]  Pauli Miettinen,et al.  Matrix Decomposition Methods for Data Mining : Computational Complexity and Algorithms , 2009 .

[19]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[20]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[22]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[23]  Yong Shi,et al.  Categorizing and mining concept drifting data streams , 2008, KDD.

[24]  Aidong Zhang,et al.  An iterative strategy for pattern discovery in high-dimensional data sets , 2002, CIKM '02.

[25]  Rong Ge,et al.  Joint Cluster Analysis of Attribute Data and Relationship Data: the Connected k-Center Problem , 2006, SDM.

[26]  Gene H. Golub,et al.  Matrix computations , 1983 .

[27]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[28]  Pavlos Protopapas,et al.  Event Discovery in Time Series , 2009, SDM.

[29]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[30]  Geoffrey J. Gordon Generalized2 Linear2 Models , 2002, NIPS.

[31]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[32]  Geoffrey J. Gordon Generalized^2 Linear^2 Models , 2002, NIPS 2002.

[33]  Jilles Vreeken,et al.  Characterising the difference , 2007, KDD '07.

[34]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[35]  D. Stott Parker,et al.  Topic dynamics: an alternative model of bursts in streams of topics , 2010, KDD.

[36]  Hisashi Kashima,et al.  Unsupervised Change Analysis Using Supervised Learning , 2008, PAKDD.

[37]  Chris H. Q. Ding,et al.  Nonnegative Matrix Factorization for Combinatorial Optimization: Spectral Clustering, Graph Matching, and Clique Finding , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[38]  Sanjay Ranka,et al.  Statistical change detection for multi-dimensional data , 2007, KDD '07.