Formalising the subjective interestingness of a linear projection of a data set : two examples

The generic framework for formalising the subjective interestingness of patterns presented in [2] has already been applied to a number of data mining problems, including itemset (tile) mining [3, 8, 9], multi-relational pattern mining [18, 19, 20], clustering [10], and bi-clustering [12, 11]. Also, it has been pointed out without providing detail that also Principal Component Analysis (PCA) [7] can be derived from this framework [2]. This short note describes work-in-progress aiming to show in greater detail how this can be done. It also shows how the framework leads to a robust variant of PCA when used to formalise the subjective interestingness of a data projection for a user who expects outliers to be present in the data.

[1]  K. Zografos,et al.  On maximum entropy characterization of Pearson's type II and VII multivariate distributions , 1999 .

[2]  Tijl De Bie,et al.  Mining Interesting Patterns in Multi-relational Data with N-ary Relationships , 2013, Discovery Science.

[3]  Tijl De Bie,et al.  Interesting pattern mining in multi-relational data , 2013, Data Mining and Knowledge Discovery.

[4]  Jason Morphett,et al.  An integrated algorithm of incremental and robust PCA , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[5]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[6]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[7]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[8]  Tijl De Bie,et al.  Maximum entropy models and subjective interestingness: an application to tiles in binary databases , 2010, Data Mining and Knowledge Discovery.

[9]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[10]  Tijl De Bie,et al.  Maximum Entropy Modelling for Assessing Results on Real-Valued Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[11]  Y. L. Tong,et al.  The Multivariate t Distribution , 1990 .

[12]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[13]  Tijl De Bie,et al.  Interesting Multi-relational Patterns , 2011, 2011 IEEE 11th International Conference on Data Mining.

[14]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[15]  Tijl De Bie,et al.  Subjectively interesting alternative clusterings , 2013, Machine Learning.

[16]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[17]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[18]  Mia Hubert,et al.  ROBPCA: A New Approach to Robust Principal Component Analysis , 2005, Technometrics.

[19]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[20]  Tijl De Bie,et al.  Formalizing Complex Prior Information to Quantify Subjective Interestingness of Frequent Pattern Sets , 2012, IDA.

[21]  A. McNeil Multivariate t Distributions and Their Applications , 2006 .

[22]  Tijl De Bie,et al.  An information theoretic framework for data mining , 2011, KDD.

[23]  Tijl De Bie,et al.  Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data , 2013, ECML/PKDD.