A Hierarchical Approach to Anomalous Subgroup Discovery

Understanding peculiar and anomalous behavior of machine learning models for specific data subgroups is a fundamental building block of model performance and fairness evaluation. The analysis of these data subgroups can provide useful insights into model inner working and highlight its potentially discriminatory behavior. Current approaches to subgroup exploration ignore the presence of hierarchies in the data, and can only be applied to discretized attributes. The discretization process required for continuous attributes may significantly affect the identification of relevant subgroups.We propose a hierarchical subgroup exploration technique to identify anomalous subgroup behavior at multiple granularity levels, along with a technique for the hierarchical discretization of data attributes. The hierarchical discretization produces, for each continuous attribute, a hierarchy of intervals. The subsequent hierarchical exploration can exploit data hierarchies, selecting for each attribute the optimal granularity to identify subgroups that are both anomalous, and with enough elements to be statistically and practically significant. Compared to non- hierarchical approaches, we show that our hierarchical approach is more powerful in identifying anomalous subgroups and more stable with respect to discretization and exploration parameters.

[1]  Elena Baralis,et al.  Identifying Biased Subgroups in Ranking and Classification , 2021, ArXiv.

[2]  Moritz Hardt,et al.  Retiring Adult: New Datasets for Fair Machine Learning , 2021, NeurIPS.

[3]  Elena Baralis,et al.  How Divergent Is Your Data? , 2021, Proc. VLDB Endow..

[4]  Elena Baralis,et al.  Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence , 2021, SIGMOD Conference.

[5]  Abolfazl Asudeh,et al.  Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes , 2021, SIGMOD Conference.

[6]  Matthias Boehm,et al.  SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging , 2021, SIGMOD Conference.

[7]  Yomi Kastro,et al.  Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks , 2019, Neural Computing and Applications.

[8]  Minsuk Kahng,et al.  FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning , 2019, 2019 IEEE Conference on Visual Analytics Science and Technology (VAST).

[9]  Abolfazl Asudeh,et al.  Assessing and Remedying Coverage for a Given Dataset , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[10]  Hannah Lebovits Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor , 2018, Public Integrity.

[11]  Tim Kraska,et al.  Automated Data Slicing for Model Validation: A Big Data - AI Integration Approach , 2018, IEEE Transactions on Knowledge and Data Engineering.

[12]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Validation , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[13]  Felix Naumann,et al.  Efficient Discovery of Approximate Dependencies , 2018, Proc. VLDB Endow..

[14]  Tony Doyle,et al.  Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , 2017, Inf. Soc..

[15]  Masaru Kitsuregawa,et al.  FP-tax: tree structure based generalized association rule mining , 2004, DMKD '04.

[16]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[17]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[18]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[19]  Jiawei Han,et al.  Mining Multiple-Level Association Rules in Large Databases , 1999, IEEE Trans. Knowl. Data Eng..

[20]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[21]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[22]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[23]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[24]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[25]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[26]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[27]  H. Jagadish,et al.  A Survey on Techniques for Identifying and Resolving Representation Bias in Data , 2022, ArXiv.

[28]  S. Ruggles Integrated Public Use Microdata Series , 2021, Encyclopedia of Gerontology and Population Aging.

[29]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[30]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .