Underspecification in Scene Description-to-Depiction Tasks

Questions regarding implicitness, ambiguity and underspecification are crucial for understanding the task validity and ethical concerns of multimodal image+text systems, yet have received little attention to date. This position paper maps out a conceptual framework to address this gap, focusing on systems which generate images depicting scenes from scene descriptions. In doing so, we account for how texts and images convey meaning differently. We outline a set of core challenges concerning textual and visual ambiguity, as well as risks that may be amplified by ambiguous and underspecified elements. We propose and discuss strategies for addressing these challenges, including generating visually ambiguous images, and generating a set of diverse images.

[1]  Clayton D. Scott,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.

[3]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[4]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[5]  Maria De-Arteaga,et al.  Justice in Misinformation Detection Systems: An Analysis of Algorithms, Stakeholders, and Potential Harms , 2022, FAccT.

[6]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[7]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[8]  What's in an ALT Tag? Exploring Caption Content Priorities through Collaborative Captioning , 2022, ACM Transactions on Accessible Computing.

[9]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[10]  Hannah Rose Kirk,et al.  Handling and Presenting Harmful Text , 2022, ArXiv.

[11]  Mohit Bansal,et al.  DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.

[12]  Amandalynne Paullada,et al.  AI and the Everything in the Whole Wide World Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[13]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[14]  Trevor Paglen,et al.  Correction to: Excavating AI: the politics of images in machine learning training sets , 2021, AI & SOCIETY.

[15]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[16]  Jeffrey P. Bigham,et al.  “It’s Complicated”: Negotiating Accessibility and (Mis)Representation in Image Descriptions of Race, Gender, and Disability , 2021, CHI.

[17]  Jiecao Chen,et al.  WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[18]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[19]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[20]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[21]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David Schlangen,et al.  Targeting the Benchmark: On Methodology in Current Natural Language Processing Research , 2020, ACL.

[23]  Christopher Potts,et al.  Concadia: Tackling image accessibility with context , 2021, ArXiv.

[24]  Alexander D'Amour,et al.  Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[25]  Ming-Wei Chang,et al.  CapWAP: Image Captioning with a Purpose , 2020, EMNLP.

[26]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[27]  Emiel van Miltenburg On the use of human reference data for evaluating automatic image descriptions , 2020, ArXiv.

[28]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[29]  Douwe Kiela,et al.  The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , 2020, NeurIPS.

[30]  Radu Soricut,et al.  Cross-modal Coherence Modeling for Caption Generation , 2020, ACL.

[31]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[32]  Christopher Potts,et al.  Pragmatic Issue-Sensitive Image Captioning , 2020, FINDINGS.

[33]  Emily Denton,et al.  Diversity and Inclusion Metrics in Subset Selection , 2020, AIES.

[34]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35]  Emily M. Bender,et al.  Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics , 2019, Linguistic Fundamentals for Natural Language Processing II.

[36]  Jahna Otterbacher,et al.  How Do We Talk about Other People? Group (Un)Fairness in Natural Language Image Descriptions , 2019, HCOMP.

[37]  Emiel van Miltenburg Pragmatic factors in (automatic) image description , 2019 .

[38]  S. Hall The Determinations of News Photographs (1973) , 2019, Crime and Media.

[39]  Radu Soricut,et al.  Informative Image Captioning with External Sources of Information , 2019, ACL.

[40]  Shruti Bhargava,et al.  Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models , 2019, ArXiv.

[41]  Matthew Stone,et al.  CITE: A Corpus of Image-Text Discourse Relations , 2019, NAACL.

[42]  Frank Keller,et al.  Disambiguating Visual Verbs , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[44]  Matthew Stone,et al.  “Caption” as a Coherence Relation: Evidence and Implications , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[45]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[47]  Chong-Wah Ngo,et al.  PageSense: Toward Stylewise Contextual Advertising via Visual Analysis of Web Pages , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[48]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[49]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[50]  Piek T. J. M. Vossen,et al.  Cross-linguistic differences and similarities in image descriptions , 2017, INLG.

[51]  Larry S. Davis,et al.  The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Holly Hearon,et al.  Orality and Literacy , 2016 .

[53]  Roser Morante,et al.  Pragmatic Factors in Image Description: The Case of Negations , 2016, VL@ACL.

[54]  Emiel van Miltenburg Stereotyping and Bias in the Flickr30K Dataset , 2016, ArXiv.

[55]  Ross B. Girshick,et al.  Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  S. Hirsch Image Music Text , 2016 .

[57]  Wei Liu,et al.  Predicting Entry-Level Categories , 2015, International Journal of Computer Vision.

[58]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[59]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[60]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[61]  Neil Cohn,et al.  The Visual Language of Comics: Introduction to the Structure and Cognition of Sequential Images. , 2013 .

[62]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[63]  Tony Jappy Introduction to Peircean Visual Semiotics , 2013 .

[64]  Emiel Krahmer,et al.  Computational Generation of Referring Expressions: A Survey , 2012, CL.

[65]  Francis Bond,et al.  Language, Technology, and Society , 2010 .

[66]  J. Celeste Walley-Jean,et al.  Debunking the Myth of the "Angry Black Woman": An Exploration of Anger in Young African American Women , 2009 .

[67]  Jennifer Anne Green Between the earth and the air: multimodality in Arandic sand stories , 2009 .

[68]  Steven Frisson,et al.  Semantic Underspecification in Language Processing , 2009, Lang. Linguistics Compass.

[69]  Will Eisner,et al.  Comics and Sequential Art: Principles and Practices from theLegendary Cartoonist , 2008 .

[70]  Massimo Poesio,et al.  Ambiguity, Underspecification and Discourse Interpretation , 2007 .

[71]  R. Jakobson Closing Statement: Linguistics and Poetics , 2006 .

[72]  Timothy C. Craven Some features of alt texts associated with images in Web pages , 2006, Inf. Res..

[73]  Daniel Chandler,et al.  Semiotics: The Basics , 2001 .

[74]  Gregg C. Vanderheiden,et al.  Web content accessibility guidelines 1.0 , 2001, INTR.

[75]  Shih-Fu Chang,et al.  Conceptual framework for indexing visual information at multiple levels , 1999, Electronic Imaging.

[76]  Edgar A. Whitley,et al.  The Construction of Social Reality , 1999 .

[77]  Irene Heim,et al.  Semantics in generative grammar , 1998 .

[78]  Dan Flickinger,et al.  Minimal Recursion Semantics: An Introduction , 2005 .

[79]  D. Knight Ways of seeing , 2015, Nature.

[80]  A. D. Manning,et al.  Understanding Comics: The Invisible Art , 1993 .

[81]  James R. Griesemer,et al.  Must scientific diagrams be eliminable? The case of path analysis , 1991 .

[82]  Sara Shatford,et al.  Analyzing the Subject of a Picture: A Theoretical Approach , 1986 .

[83]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[84]  D. Shaw,et al.  Judging people in the news — unconsciously: Effect of camera angle and bodily activity , 1973 .