One of the most ambitious goals of AI is to develop agents that are able to communicate with humans. While many existing systems are already capable of producing human-like utterances, they often focus on learning structural properties of language and miss the utilitarian and functional aspects of communication, i.e., that humans use words to coordinate with others and make things happen in the world. In this work, we investigate if and how we could use the multi-agent interactions (between an agent and a user simulator) as a building component for learning natural language use, and how to harness the structural knowledge of language, that is easily extractable from large collections of texts using language models. One of the most ambitious goals of AI is to develop intelligent agents that are able to communicate with humans. Thus, communication and interaction should be at the core of the language learning process of these agents. However, traditional machine learning approaches to language learning [14, 17, 18] are dissociated from communication but are based on static, passive, and mainly supervised (or self-supervised) regimes, focusing on learning from corpora about the structural properties of language. While this is a great way to learn general statistical single-modality associations between symbols (e.g., the fact that adjectives come before nouns and after determiners) or even multi-modal associations between symbols and things in the world (e.g, the fact that the word cat refers to the furry animal with the four legs) it misses the functional aspects of communication, i.e., that humans use words to coordinate with others and make things happen in the real world [1, 3, 20]. One way to add communication in the core learning of agents is to cast functional language learning (i.e., learning to communicate grounded in a goal) as a supervised learning task and collect language data grounded to a particular goal. However this would require us to collect data of all potential language usages that we would want our agent to be able to communicate about. Motivated by this, previous research [12, 11] has focused on ways to emerge a communication protocol in a completely utilitarian framework implemented within a multi-agent setup where agents learn to communicate in order to maximize a task reward. While this purely utilitarian framework results in agents that successfully learn to solve the task by creating a communication protocol, these emergent communication protocols bear (at best) very little resemblance to natural language and pose doubts to the use of this type of functional learning as a viable alternative to language learning. Thus, it becomes clear that neither framework on its own is completely adequate for learning language use. Instead, in this work we propose to decompose the problem of learning language use into two components: Learning “what” to say based on a given situation, and learning “how” to say it. The “what” is, for us at least, the essence of communication that underlies our intentions. The “what” is chosen by maximizing a given utility, which can be anything, making it a functional, utility-driven process. On the other hand, the “how” is a surface realization of our intentions, i.e., the words we use to communicate this “what” successfully. Since our goal is to communicate with humans, there are particular constraints that govern the form of “how” so that it is understandable by humans, i.e., ∗Shared first co-authorship. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. structural properties of natural language that relate, among others, to grammaticality and fluency. This factorization into content planning (here, “what”) and surface realization (here, “how”), which can lead to meaning representations which are amenable to reinforcement learning, moves away from end-to-end neural generation system and is inline with more traditional views of natural language generation [16]. Under this factorization, generic language data do not have to be used as gold-standard of functional language learning (which, as we explained above is problematic) but can be used effectively as a good prior model of language, encapsulating all the intrinsic structural knowledge of language. In other words, language data are only used for the “how”. On the other hand, multi-agent interactions that provide task-rewards for the task of interest, can now be used only for the functional learning of the language use. This combination of functional and structural learning guarantees that, in theory, the emergent communication of agents arising from multi-agent interactions will be grounded in natural language semantics, bringing us closer to learning natural language. In this work, we present preliminary results of implementing this factorization of language use into “what” and ”how” and effective ways to combine functional (i.e., learning in the context of communicating with another agent so as to achieve a particular goal) and structural (i.e., traditional supervised learning of language) language learning.
[1]
Csr Young,et al.
How to Do Things With Words
,
2009
.
[2]
Jürgen Schmidhuber,et al.
Long Short-Term Memory
,
1997,
Neural Computation.
[3]
Dan Klein,et al.
Speaker-Follower Models for Vision-and-Language Navigation
,
2018,
NeurIPS.
[4]
Stephen Clark,et al.
Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input
,
2018,
ICLR.
[5]
Ivan Titov,et al.
Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols
,
2017,
NIPS.
[6]
Robert Dale,et al.
Building applied natural language generation systems
,
1997,
Natural Language Engineering.
[7]
C. Lawrence Zitnick,et al.
Bringing Semantics into Focus Using Visual Abstraction
,
2013,
2013 IEEE Conference on Computer Vision and Pattern Recognition.
[8]
Laura Graesser,et al.
Emergent Linguistic Phenomena in Multi-Agent Communication Games
,
2019,
EMNLP.
[9]
Jian Sun,et al.
Deep Residual Learning for Image Recognition
,
2015,
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10]
付伶俐.
打磨Using Language,倡导新理念
,
2014
.
[11]
Christopher Potts,et al.
Learning in the Rational Speech Acts Model
,
2015,
ArXiv.
[12]
Ronald J. Williams,et al.
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
,
2004,
Machine Learning.
[13]
M. Engelmann.
The Philosophical Investigations
,
2013
.
[14]
Lukás Burget,et al.
Recurrent neural network based language model
,
2010,
INTERSPEECH.
[15]
Marco Baroni,et al.
How agents see things: On visual representations in an emergent language game
,
2018,
EMNLP.
[16]
Quoc V. Le,et al.
A Neural Conversational Model
,
2015,
ArXiv.
[17]
Pietro Perona,et al.
Microsoft COCO: Common Objects in Context
,
2014,
ECCV.
[18]
Quoc V. Le,et al.
Sequence to Sequence Learning with Neural Networks
,
2014,
NIPS.
[19]
Christopher Potts,et al.
Pragmatically Informative Image Captioning with Character-Level Inference
,
2018,
NAACL.
[20]
Alexander Peysakhovich,et al.
Multi-Agent Cooperation and the Emergence of (Natural) Language
,
2016,
ICLR.
[21]
Kyunghyun Cho,et al.
Emergent Communication in a Multi-Modal, Multi-Step Referential Game
,
2017,
ICLR.