Multi-Modal Question-Answering: Questions without Keyboards

This paper describes our work to allow players in a virtual world to pose questions without relying on textual input. Our approach is to create enhanced virtual photographs by annotating them with semantic information from the 3D environment’s scene graph. The player can then use these annotated photos to interact with inhabitants of the world through automatically generated queries that are guaranteed to be relevant, grammatical and unambiguous. While the range of queries is more limited than a text input system would permit, in the gaming environment that we are exploring these limitations are offset by the practical concerns that make text input inappropriate.