GENCCS: A Correlated Group Difference Approach to Contrast Set Mining

Contrast set mining has developed as a data mining task which aims at discerning differences amongst groups. These groups can be patients, organizations, molecules, and even time-lines, and are defined by a selected property that distinguishes one from the other. A contrast set is a conjunction of attribute-value pairs that differ significantly in their distribution across groups. The search for contrast sets can be prohibitively expensive on relatively large datasets because every combination of attribute-values must be examined, causing a potential exponential growth of the search space. In this paper, we introduce the notion of a correlated group difference (CGD) and propose a contrast set mining technique that utilizes mutual information and all confidence to select the attribute-value pairs that are most highly correlated, in order to mine CGDs. Our experiments on real datasets demonstrate the efficiency of our approach and the interestingness of the CGDs discovered.

[1]  Tzu-Tsung Wong,et al.  Mining negative contrast sets from data with discrete attributes , 2005, Expert Syst. Appl..

[2]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[3]  Hui Xiong,et al.  TAPER: a two-step approach for all-strong-pairs correlation query in large databases , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Robert J. Hilderman,et al.  Exploratory Quantitative Contrast Set Mining: A Discretization Approach , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[5]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[6]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[7]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[8]  Wilfred Ng,et al.  Correlated pattern mining in quantitative databases , 2008, TODS.

[9]  Eamonn J. Keogh,et al.  Group SAX: Extending the Notion of Contrast Sets to Time Series and Multimedia Data , 2006, PKDD.

[10]  Nada Lavrac,et al.  Contrast Set Mining for Distinguishing Between Similar Diseases , 2007, AIME.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Mohammed J. Zaki,et al.  GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets , 2005, Data Mining and Knowledge Discovery.

[13]  Johannes Fürnkranz,et al.  Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings , 2006, PKDD.

[14]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[15]  Nandit Soparkar,et al.  Data organization and access for efficient data mining , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[16]  Theodoros Kostoulas,et al.  Detection of Negative Emotional States in Real-World Scenario , 2007 .

[17]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[18]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.