Implications of the Convergence of Language and Vision Model Geometries

Large-scale pretrained language models (LMs) are said to ``lack the ability to connect [their] utterances to the world'' (Bender and Koller, 2020). If so, we would expect LM representations to be unrelated to representations in computer vision models. To investigate this, we present an empirical evaluation across three different LMs (BERT, GPT2, and OPT) and three computer vision models (VMs, including ResNet, SegFormer, and MAE). Our experiments show that LMs converge towards representations that are partially isomorphic to those of VMs, with dispersion, and polysemy both factoring into the alignability of vision and language spaces. We discuss the implications of this finding.

[1]  Manu Srinath Halvagal,et al.  The combination of Hebbian and predictive plasticity learns invariant object representations in deep sensory networks , 2023, bioRxiv.

[2]  Alexander G. Huth,et al.  Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data , 2022, Neurobiology of Language.

[3]  Ellie Pavlick,et al.  Linearly Mapping from Image to Text Space , 2022, ICLR.

[4]  S. Piantadosi,et al.  Meaning without reference in large language models , 2022, ArXiv.

[5]  J. King,et al.  Brains and algorithms partially converge in natural language processing , 2022, Communications Biology.

[6]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Shachar Mirkin,et al.  Emergent Structures and Training Dynamics in Large Language Models , 2022, BIGSCIENCE.

[8]  Alexandre Gramfort,et al.  Long-range and hierarchical language predictions in brains and algorithms , 2021, ArXiv.

[9]  Anders Sogaard,et al.  Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , 2021, CONLL.

[10]  Anders Sandholm,et al.  Analogy Training Multilingual Encoders , 2021, AAAI.

[11]  Magnus Sahlgren,et al.  The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point , 2021, Frontiers in Artificial Intelligence.

[12]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[13]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[14]  B. Lake,et al.  Self-supervised learning through the eyes of a child , 2020, NeurIPS.

[15]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[16]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[17]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[18]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[19]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[20]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[21]  Gosse Minnema,et al.  From Brain Space to Distributional Space: The Perilous Journeys of fMRI Decoding , 2019, ACL.

[22]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Jonas Kubilius,et al.  Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? , 2018, bioRxiv.

[25]  Anders Søgaard,et al.  Why is unsupervised alignment of English embeddings from different algorithms so hard? , 2018, EMNLP.

[26]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[27]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[28]  Anders Søgaard,et al.  On the Limitations of Unsupervised Bilingual Dictionary Induction , 2018, ACL.

[29]  Lior Wolf,et al.  An Iterative Closest Point Method for Unsupervised Word Translation , 2018, ArXiv.

[30]  Anders Søgaard,et al.  Limitations of Cross-Lingual Learning from Image Search , 2017, Rep4NLP@ACL.

[31]  Marie-Francine Moens,et al.  Multi-Modal Representations for Improved Bilingual Lexicon Learning , 2016, ACL.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Stephen Clark,et al.  Visual Bilingual Lexicon Induction with Transferred ConvNet Features , 2015, EMNLP.

[34]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[35]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[36]  Angeliki Lazaridou,et al.  Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world , 2014, ACL.

[37]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[38]  Benjamin Van Durme,et al.  Learning Bilingual Lexicons Using the Visual Similarity of Labeled Web Images , 2011, IJCAI.

[39]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[40]  S. Harnad Symbol grounding problem , 1990, Scholarpedia.

[41]  P. Lodge,et al.  Stepping Back Inside Leibniz's Mill , 1998 .

[42]  John R. Searle,et al.  Minds, brains, and programs , 1980, Behavioral and Brain Sciences.

[43]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .