Detecting Group Differences: Mining Contrast Sets

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

[1]  Jinyan Li,et al.  Eecient Mining of Emerging Patterns: Discovering Trends and Diierences , 1999 .

[2]  Steven Ruggles,et al.  Sample Designs and Sampling Errors , 1995 .

[3]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[4]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[5]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[6]  Rajeev Motwani,et al.  Beyond Market Baskets: Generalizing Association Rules to Dependence Rules , 1998, Data Mining and Knowledge Discovery.

[7]  Nimrod Megiddo,et al.  Discovering Predictive Association Rules , 1998, KDD.

[8]  A. Tamhane,et al.  Multiple Comparison Procedures , 1989 .

[9]  Wynne Hsu,et al.  Using General Impressions to Analyze Discovered Classification Rules , 1997, KDD.

[10]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[11]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[13]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[14]  Willi Klösgen,et al.  A Support System for Interpreting Statistical Data , 1991, Knowledge Discovery in Databases.

[15]  R. Singer,et al.  The Audubon Society field guide to North American mushrooms , 1981 .

[16]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[17]  Wynne Hsu,et al.  Post-Analysis of Learned Rules , 1996, AAAI/IAAI, Vol. 1.

[18]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[19]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[20]  S. Ruggles Integrated Public Use Microdata Series , 2021, Encyclopedia of Gerontology and Population Aging.

[21]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[22]  W. Darity,et al.  Intergroup Disparity: Economic Theory and Social Science Evidence , 1998 .

[23]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.

[24]  Zvi M. Kedem,et al.  Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[25]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[26]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[27]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[28]  Wynne Hsu,et al.  Finding Interesting Patterns Using User Expectations , 1999, IEEE Trans. Knowl. Data Eng..

[29]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[30]  Oren Etzioni,et al.  Representation design and brute-force induction in a Boeing manufacturing domain , 1994, Appl. Artif. Intell..

[31]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[32]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[33]  Tom M. Mitchell,et al.  Version Spaces: A Candidate Elimination Approach to Rule Learning , 1977, IJCAI.

[34]  D UllmanJeffrey,et al.  Dynamic itemset counting and implication rules for market basket data , 1997 .

[35]  Jacob Cohen,et al.  THINGS I HAVE LEARNED (SO FAR) , 1990 .

[36]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[37]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[38]  Giuseppe Psaila,et al.  Querying Shapes of Histories , 1995, VLDB.

[39]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[40]  Sunita Sarawagi,et al.  Mining Surprising Patterns Using Temporal Description Length , 1998, VLDB.

[41]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[42]  Steven Ruggles,et al.  Integrated Public Use Microdata Series: Version 3 , 2003 .

[43]  H. Keselman,et al.  Multiple Comparison Procedures , 2005 .

[44]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[45]  J. Davies,et al.  Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task , 1996 .

[46]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[47]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 2004, Data Mining and Knowledge Discovery.

[48]  Joseph Felsenstein,et al.  THE ROBUSTNESS OF HOMOGENEITY TESTS IN 2 X N TABLES. , 1965 .