Efficient discovery of contrast subspaces for object explanation and characterization

We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes $$C_+$$C+ and $$C_-$$C- and a query object $$o$$o, we want to find the top-$$k$$k subspaces that maximize the ratio of likelihood of $$o$$o in $$C_+$$C+ against that in $$C_-$$C-. Such subspaces are very useful for characterizing an object and explaining how it differs between two classes. We demonstrate that this problem has important applications, and, at the same time, is very challenging, being MAX SNP-hard. We present CSMiner, a mining method that uses kernel density estimation in conjunction with various pruning techniques. We experimentally investigate the performance of CSMiner on a range of data sets, evaluating its efficiency, effectiveness, and stability and demonstrating it is substantially faster than a baseline method.

[1]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[2]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[3]  Jianping Li,et al.  On the complexity of finding emerging patterns , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[4]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[5]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[6]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[7]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[8]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[9]  H. Jeffreys The Theory of Probability , 1922 .

[10]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[11]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[12]  Raymond Y. K. Lau,et al.  Answering Typicality Query Based on Automatically Prototype Construction , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[13]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.

[14]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[15]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[16]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[17]  Guozhu Dong,et al.  Masquerader Detection Using OCLEP: One-Class Classification Using Length Statistics of Emerging Patterns , 2006, 2006 Seventh International Conference on Web-Age Information Management Workshops.

[18]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[19]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[20]  James Bailey,et al.  Mining Contrast Subspaces , 2014, PAKDD.

[21]  James Bailey,et al.  Contrast Data Mining: Concepts, Algorithms, and Applications , 2012 .

[22]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[23]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[24]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[25]  Jian Pei,et al.  Top-k typicality queries and efficient query answering methods on large databases , 2009, The VLDB Journal.

[26]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[27]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.