Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

Collecting more diverse and representative training data is often touted as a remedy for the disparate performance of machine learning predictors across subpopulations. However, a precise framework for understanding how dataset properties like diversity affect learning outcomes is largely lacking. By casting data collection as part of the learning process, we demonstrate that diverse representation in training data is key not only to increasing subgroup performances, but also to achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.

[1]  D. Sculley,et al.  No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World , 2017, 1711.08536.

[2]  Robert D. Tortora,et al.  Sampling: Design and Analysis , 2000 .

[3]  Gang Niu,et al.  Does Distributionally Robust Supervised Learning Give Robust Classifiers? , 2016, ICML.

[4]  James Y. Zou,et al.  Multiaccuracy: Black-Box Post-Processing for Fairness in Classification , 2018, AIES.

[5]  Noel C. F. Codella,et al.  Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC) , 2019, ArXiv.

[6]  Adam Tauman Kalai,et al.  Decoupled Classifiers for Group-Fair and Efficient Machine Learning , 2017, FAT.

[7]  Karen Levy,et al.  Representativeness in Statistics, Politics, and Machine Learning , 2021, FAccT.

[8]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[9]  Percy Liang,et al.  Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.

[10]  Fei-Fei Li,et al.  Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy , 2019, FAT*.

[11]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[12]  Christopher Ré,et al.  No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems , 2020, NeurIPS.

[13]  Sekou Remy,et al.  Narratives and Counternarratives on Data Sharing in Africa , 2021, FAccT.

[14]  Eirini Ntoutsi,et al.  Dealing with Bias via Data Augmentation in Supervised Learning Scenarios , 2018 .

[15]  Jacob Abernethy,et al.  Adaptive Sampling to Reduce Disparate Performance , 2020, ArXiv.

[16]  Gagan Goel,et al.  Mechanism design for fair division: allocating divisible items without payments , 2012, EC '13.

[17]  R. Fletcher,et al.  Who's Responsible? , 1993, Annals of Internal Medicine.

[18]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[19]  Smitha Milli,et al.  Value-laden disciplinary shifts in machine learning , 2019, FAT*.

[20]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[21]  Inioluwa Deborah Raji,et al.  Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products , 2019, AIES.

[22]  Mengting Wan,et al.  Item recommendation on monotonic behavior chains , 2018, RecSys.

[23]  David Sontag,et al.  Why Is My Classifier Discriminatory? , 2018, NeurIPS.

[24]  Tatsunori B. Hashimoto,et al.  Distributionally Robust Neural Networks , 2020, ICLR.

[25]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[26]  A general exact optimal sample allocation algorithm: With bounded cost and bounded sample sizes , 2020 .

[27]  Harald Kittler,et al.  Descriptor : The HAM 10000 dataset , a large collection of multi-source dermatoscopic images of common pigmented skin lesions , 2018 .

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Amandalynne Paullada,et al.  Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.

[30]  Emily Denton,et al.  Bringing the People Back In: Contesting Benchmark Machine Learning Datasets , 2020, ArXiv.

[31]  James Zou,et al.  Who's Responsible? Jointly Quantifying the Contribution of the Learning Algorithm and Data , 2019, AIES.

[32]  Harini Suresh,et al.  A Framework for Understanding Unintended Consequences of Machine Learning , 2019, ArXiv.

[33]  Georg Langs,et al.  Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem , 2020, European Radiology Experimental.

[34]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[35]  Zachary C. Lipton,et al.  What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[36]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[37]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[38]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[39]  John Yearwood,et al.  Impact of ECG Dataset Diversity on Generalization of CNN Model for Detecting QRS Complex , 2019, IEEE Access.

[40]  Jitendra Malik,et al.  Are All Training Examples Created Equal? An Empirical Study , 2018, ArXiv.

[41]  Noel C. F. Codella,et al.  Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC) , 2016, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[42]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[43]  J. Neyman On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection , 1934 .

[44]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[45]  Yishay Mansour,et al.  Learning Bounds for Importance Weighting , 2010, NIPS.

[46]  Hanna M. Wallach,et al.  Measurement and Fairness , 2019, FAccT.

[47]  Hee Jung Ryu,et al.  InclusiveFaceNet: Improving Face Attribute Detection with Race and Gender Diversity , 2017 .