Pictionary-Style Word Guessing on Hand-Drawn Object Sketches: Dataset, Analysis and Deep Network Models

The ability of intelligent agents to play games in human-like fashion is popularly considered a benchmark of progress in Artificial Intelligence. In our work, we introduce the first computational model aimed at Pictionary, the popular word-guessing social game. We first introduce Sketch-QA, a guessing task. Styled after Pictionary, Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data. Sketch-QA involves asking a fixed question (“What object is being drawn?”) and gathering open-ended guess-words from human guessers. We analyze the resulting dataset and present many interesting findings therein. To mimic Pictionary-style guessing, we propose a deep neural model which generates guess-words in response to temporally evolving human-drawn object sketches. Our model even makes human-like mistakes while guessing, thus amplifying the human mimicry factor. We evaluate our model on the large-scale guess-word dataset generated via Sketch-QA task and compare with various baselines. We also conduct a Visual Turing Test to obtain human impressions of the guess-words generated by humans and our model. Experimental results demonstrate the promise of our approach for Pictionary and similarly themed games.

[1]  Mario Fritz,et al.  Towards a Visual Turing Challenge , 2014, ArXiv.

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Lior Wolf,et al.  RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  詹志禹 Response order effects in Likert-type scales , 1991 .

[7]  Brent Kievit-Kylar,et al.  The Semantic Pictionary Project , 2011, CogSci.

[8]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[10]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Shimon Ullman,et al.  Atoms of recognition in human and computer vision , 2016, Proceedings of the National Academy of Sciences.

[15]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[16]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[20]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[21]  Stéphane Dupont,et al.  DeepSketch: Deep convolutional neural networks for sketch recognition and similarity search , 2015, 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI).

[22]  James Hays,et al.  The sketchy database , 2016, ACM Trans. Graph..

[23]  Ravi Kiran Sarvadevabhatla,et al.  Eye of the Dragon: Exploring Discriminatively Minimalist Sketch-based Abstractions for Object Categories , 2015, ACM Multimedia.

[24]  Manolis Falelakis,et al.  Improving video-mediated communication with orchestration , 2012, Comput. Hum. Behav..

[25]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[26]  Frans Mäyrä,et al.  The Contextual Game Experience: On the Socio-Cultural Contexts for Meaning in Digital Play , 2007, DiGRA Conference.

[27]  Michael A. Arbib,et al.  How to Bootstrap a Human Communication System , 2013, Cogn. Sci..

[28]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[29]  Tracey B. Wortham Adapting Common Popular Games to a Human Factors/Ergonomics Course , 2006 .

[30]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[31]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[32]  Dennis M. Dake,et al.  The Visual Analysis of Visual Metaphor. , 1995 .

[33]  Ravi Kiran Sarvadevabhatla,et al.  Enabling My Robot To Play Pictionary: Recurrent Neural Networks For Sketch Recognition , 2016, ACM Multimedia.

[34]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[35]  Ellen Yi-Luen Do,et al.  Games for sketch data collection , 2009, SBIM '09.

[36]  Tao Xiang,et al.  Sketch-a-Net that Beats Humans , 2015, BMVC.

[37]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Donald Geman,et al.  Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[39]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Marc Alexa,et al.  How do humans sketch objects? , 2012, ACM Trans. Graph..

[41]  Tinne Tuytelaars,et al.  Sketch classification and classification-driven analysis using Fisher vectors , 2014, ACM Trans. Graph..

[42]  Martha J. Farah,et al.  Agnosia , 1992, Current Opinion in Neurobiology.

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[45]  Ning Liu,et al.  Pictionary-based fMRI paradigm to study the neural correlates of spontaneous improvisation and figural creativity , 2015, Scientific Reports.

[46]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[47]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Tao Qin,et al.  Query-level loss functions for information retrieval , 2008, Inf. Process. Manag..

[49]  Ravi Kiran Sarvadevabhatla,et al.  Game of Sketches: Deep Recurrent Models of Pictionary-style Word Guessing , 2018, AAAI.

[50]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..