PatchGame: Learning to Signal Mid-level Patches in Referential Games

We study a referential game (a type of signaling game) where two agents communicate with each other via a discrete bottleneck to achieve a common goal. In our referential game, the goal of the speaker is to compose a message or a symbolic representation of “important” image patches, while the task for the listener is to match the speaker’s message to a different view of the same image. We show that it is indeed possible for the two agents to develop a communication protocol without explicit or implicit supervision. We further investigate the developed protocol and show the applications in speeding up recent Vision Transformers by using only important patches, and as pre-training for downstream recognition tasks (e.g., classification).

[1]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[2]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[4]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[5]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[6]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Joelle Pineau,et al.  On the Pitfalls of Measuring Emergent Communication , 2019, AAMAS.

[9]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[10]  Marco Baroni,et al.  Miss Tools and Mr Fruit: Emergent Communication in Agents Learning about Object Affordances , 2019, ACL.

[11]  Alexander Peysakhovich,et al.  Multi-Agent Cooperation and the Emergence of (Natural) Language , 2016, ICLR.

[12]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[13]  Fei-Fei Li,et al.  Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy , 2019, FAT*.

[14]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[15]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[16]  S. Pinker The language instinct : how the mind creates language , 1995 .

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[19]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[22]  Sander Dieleman,et al.  Generating Images with Sparse Representations , 2021, ICML.

[23]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[24]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[25]  Xiaohu Dong,et al.  Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs , 2020, BMVC.

[26]  M. Engelmann The Philosophical Investigations , 2013 .

[27]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[28]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Elia Bruni,et al.  Compositional properties of emergent languages in deep learning , 2020, ArXiv.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Ivan Titov,et al.  Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols , 2017, NIPS.

[32]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[33]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[34]  Jason Tyler Rolfe,et al.  Discrete Variational Autoencoders , 2016, ICLR.

[35]  Simon Kirby,et al.  Natural Language From Artificial Life , 2002, Artificial Life.

[36]  Kyunghyun Cho,et al.  Emergent Communication in a Multi-Modal, Multi-Step Referential Game , 2017, ICLR.

[37]  Olivier Teboul,et al.  Fast Differentiable Sorting and Ranking , 2020, ICML.

[38]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[39]  Patrick Esser,et al.  Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[41]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[42]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[43]  Tomas Mikolov,et al.  A Roadmap Towards Machine Intelligence , 2015, CICLing.

[44]  Kamal Gupta,et al.  PatchVAE: Learning Local Latent Codes for Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[46]  R. Kirk CONVENTION: A PHILOSOPHICAL STUDY , 1970 .

[47]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[52]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[53]  Li Fei-Fei,et al.  A Study of Face Obfuscation in ImageNet , 2021, ICML.

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  Vittorio Loreto,et al.  Journal of Statistical Mechanics: An IOP and SISSA journal Theory and Experiment Sharp transition towardsshared vocabularies in multi-agent systems , 2006 .

[56]  Luc Steels,et al.  What triggers the emergence of grammar , 2005 .

[57]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[58]  Michael S. Bernstein,et al.  Referring Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[60]  Emil Gustavsson,et al.  Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence , 2016, ArXiv.

[61]  Jonathon S. Hare,et al.  Avoiding hashing and encouraging visual semantics in referential emergent language games , 2019, ArXiv.

[62]  Jason Lee,et al.  Emergent Translation in Multi-Agent Communication , 2017, ICLR.

[63]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[64]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[65]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[66]  Dan Klein,et al.  Analogs of Linguistic Structure in Deep Representations , 2017, EMNLP.

[67]  Stephen Clark,et al.  Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input , 2018, ICLR.

[68]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[69]  James A. Reggia,et al.  Progress in the Simulation of Emergent Communication and Language , 2003, Adapt. Behav..

[70]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[72]  Larry S. Davis,et al.  LayoutTransformer: Layout Generation and Completion with Self-attention , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[73]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[74]  Jonathon S. Hare,et al.  The emergence of visual semantics through communication games , 2021, ArXiv.

[75]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[76]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[77]  E. Gumbel Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[78]  Simon Kirby,et al.  Minimal Requirements for the Emergence of Learned Signaling , 2014, Cogn. Sci..

[79]  Jacob Andreas,et al.  Measuring Compositionality in Representation Learning , 2019, ICLR.

[80]  Dan Klein,et al.  Learning with Latent Language , 2017, NAACL.