FACET: Fairness in Computer Vision Evaluation Benchmark

Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for common use-cases of computer vision models. We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large, publicly available evaluation set of 32k images for some of the most common vision tasks - image classification, object detection and segmentation. For every image in FACET, we hired expert reviewers to manually annotate person-related attributes such as perceived skin tone and hair type, manually draw bounding boxes and label fine-grained person-related classes such as disk jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art vision models and present a deeper understanding of potential performance disparities and challenges across sensitive demographic attributes. With the exhaustive annotations collected, we probe models using single demographics attributes as well as multiple attributes using an intersectional approach (e.g. hair color and perceived skin tone). Our results show that classification, detection, segmentation, and visual grounding models exhibit performance disparities across demographic attributes and intersections of attributes. These harms suggest that not all people represented in datasets receive fair and equitable treatment in these vision tasks. We hope current and future results using our benchmark will contribute to fairer, more robust vision models. FACET is available publicly at https://facet.metademolab.com/

[1]  Olga Russakovsky,et al.  ICON2: Reliably Benchmarking Predictive Inequity in Object Detection , 2023, ArXiv.

[2]  Shivani Kapania,et al.  A hunt for the Snark: Annotator Diversity in Data Practices , 2023, CHI.

[3]  Laura Gustafson,et al.  Pinpointing Why Object Recognition Performance Degrades Across Income Levels and Geographies , 2023, ArXiv.

[4]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Nicolas Usunier,et al.  Towards Reliable Assessments of Demographic Disparities in Multi-Label Image Classifiers , 2023, ArXiv.

[6]  Aaron B. Adcock,et al.  Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities , 2023, ArXiv.

[7]  Cristian Canton Ferrer,et al.  Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness , 2022, ArXiv.

[8]  Badr Youbi Idrissi,et al.  ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations , 2022, ICLR.

[9]  Jerone T. A. Andrews,et al.  Men Also Do Laundry: Multi-Attribute Bias Amplification , 2022, ICML.

[10]  R. Daneshjou,et al.  Towards Transparency in Dermatology Image Datasets with Skin Tone Annotations by Experts, Crowds, and an Algorithm , 2022, Proc. ACM Hum. Comput. Interact..

[11]  Ruth C. Fong,et al.  Gender Artifacts in Visual Datasets , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Emily L. Denton,et al.  CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation , 2022, FAccT.

[13]  Andrew Zaldivar,et al.  Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI , 2022, FAccT.

[14]  Priya Goyal,et al.  Fairness Indicators for Systematic Assessments of Visual Feature Extractors , 2022, FAccT.

[15]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[16]  Aaron B. Adcock,et al.  Revisiting Weakly Supervised Pre-Training of Visual Perception Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Armand Joulin,et al.  Detecting Twenty-thousand Classes using Image-level Supervision , 2022, ECCV.

[18]  Emily Denton,et al.  Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation , 2021, ArXiv.

[19]  Emily Denton,et al.  Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research , 2021, NeurIPS Datasets and Benchmarks.

[20]  Tom Goldstein,et al.  Comparing Human and Machine Bias in Face Recognition , 2021, ArXiv.

[21]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[22]  Emily Denton,et al.  Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development , 2021, Proc. ACM Hum. Comput. Interact..

[23]  Emily L. Denton,et al.  On the genealogy of machine learning datasets: A critical history of ImageNet , 2021, Big Data Soc..

[24]  John J. Howard,et al.  Reliability and Validity of Image-Based and Self-Reported Skin Phenotype Metrics , 2021, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[25]  Olga Russakovsky,et al.  Understanding and Evaluating Racial Biases in Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[27]  V. Ferrari,et al.  A Step Toward More Inclusive People Annotations for Fairness , 2021, AIES.

[28]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[29]  Albert Gordo,et al.  Towards Measuring Fairness in AI: The Casual Conversations Dataset , 2021, IEEE Transactions on Biometrics, Behavior, and Identity Science.

[30]  J. Lipoff,et al.  Equity in skin typing: why it is time to replace the Fitzpatrick scale , 2021, The British journal of dermatology.

[31]  A. Hanna,et al.  Documenting Computer Vision Datasets: An Invitation to Reflexive Data Practices , 2021, FAccT.

[32]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[33]  Olga Russakovsky,et al.  Directional Bias Amplification , 2021, ICML.

[34]  P. Banks No Dreadlocks Allowed: Race, Hairstyles, and Cultural Exclusion in Schools , 2021 .

[35]  Yun Fu,et al.  One Label, One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision , 2021, FAccT.

[36]  Jungseock Joo,et al.  FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Lalana Kagal,et al.  Investigating Bias in Image Classification using Model Explanations , 2020, ArXiv.

[38]  Amandalynne Paullada,et al.  Data and its (dis)contents: A survey of dataset development and use in machine learning research , 2020, Patterns.

[39]  Aylin Caliskan,et al.  Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases , 2020, FAccT.

[40]  Emily Denton,et al.  Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.

[41]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[42]  Emily Denton,et al.  Bringing the People Back In: Contesting Benchmark Machine Learning Datasets , 2020, ArXiv.

[43]  Brian Dolhansky,et al.  The DeepFake Detection Challenge Dataset , 2020, ArXiv.

[44]  Caitlin Lustig,et al.  How We've Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis , 2020, Proc. ACM Hum. Comput. Interact..

[45]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[46]  Olga Russakovsky,et al.  REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets , 2020, International Journal of Computer Vision.

[47]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[48]  Saran Donahoo,et al.  Controlling the Crown: Legal Efforts to Professionalize Black Hair , 2019, Race and Justice.

[49]  Laura Hollink,et al.  Is it a Fruit, an Apple or a Granny Smith? Predicting the Basic Level in a Concept Hierarchy , 2019, ArXiv.

[50]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[51]  Laurens van der Maaten,et al.  Does Object Recognition Work for Everyone? , 2019, CVPR Workshops.

[52]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[53]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[54]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[55]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[56]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations , 2018, 1807.01697.

[57]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Ahmed Hosny,et al.  The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards , 2018, Data Protection and Privacy.

[59]  Morgan Klaus Scheuerman,et al.  Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems , 2018, CHI.

[60]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[61]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[62]  Moustapha Cissé,et al.  ConvNets and ImageNet Beyond Accuracy: Explanations, Bias Detection, Adversarial Examples and Model Criticism , 2017, ArXiv.

[63]  D. Sculley,et al.  No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World , 2017, 1711.08536.

[64]  Leon A. Gatys,et al.  Texture and art with deep neural networks , 2017, Current Opinion in Neurobiology.

[65]  Kai-Wei Chang,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[66]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[67]  Yang Song,et al.  Age Progression/Regression by Conditional Adversarial Autoencoder , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[69]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Leon A. Gatys,et al.  Texture Synthesis Using Convolutional Neural Networks , 2015, NIPS.

[71]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[73]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[74]  June K Robinson,et al.  Accuracy of self-report in assessing Fitzpatrick skin phototypes I through VI. , 2013, JAMA dermatology.

[75]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Mary E. Campbell,et al.  The Implications of Racial Misclassification by Observers , 2007 .

[77]  Nikolaos G. Bourbakis,et al.  A survey of skin-color modeling and detection methods , 2007, Pattern Recognit..

[78]  Travis L. Dixon,et al.  Skin tone, crime news, and social reality judgments: Priming the stereotype of the dark and dangerous black criminal. , 2005 .

[79]  Brian A. Nosek,et al.  Harvesting implicit group attitudes and beliefs from a demonstration web site , 2002 .

[80]  M. Hill Race of the Interviewer and Perception of Skin Color: Evidence from the Multi-City Study of Urban Inequality , 2002, American Sociological Review.

[81]  A. Greenwald,et al.  Measuring individual differences in implicit cognition: the implicit association test. , 1998, Journal of personality and social psychology.

[82]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[83]  Jerone T. A. Andrews,et al.  Ethical Considerations for Collecting Human-Centric Image Datasets , 2023, ArXiv.

[84]  David Kanter,et al.  The Dollar Street Dataset: Images Representing the Geographic and Socioeconomic Diversity of the World , 2022, NeurIPS.

[85]  Vinay Uday Prabhu L ARGE DATASETS : A P YRRHIC WIN FOR COMPUTER VISION ? , 2020 .

[86]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[87]  H. Prendergast Case 10: West Indian/Caribbean , 2016 .

[88]  Cynthia Feliciano Shades of Race: How Phenotype and Observer Characteristics Shape Racial Classification , 2016 .

[89]  R. Hoogland Theories of gender , 2007 .

[90]  Steven Bird NLTK: The Natural Language Toolkit , 2006, ACL.

[91]  M. Jeanmougin SOLEIL ET PEAU , 1992 .