Scones: towards conversational authoring of sketches

Iteratively refining and critiquing sketches are crucial steps to developing effective designs. We introduce Scones, a mixed-initiative, machine-learning-driven system that enables users to iteratively author sketches from text instructions. Scones is a novel deep-learning-based system that iteratively generates scenes of sketched objects composed with semantic specifications from natural language. Scones exceeds state-of-the-art performance on a text-based scene modification task, and introduces a mask-conditioned sketching model that can generate sketches with poses specified by high-level scene information. In an exploratory user evaluation of Scones, participants reported enjoying an iterative drawing task with Scones, and suggested additional features for further applications. We believe Scones is an early step towards automated, intelligent systems that support human-in-the-loop applications for communicating ideas through sketching in art and design.

[1]  Alexei A. Efros,et al.  Interactive Sketch & Fill: Multiclass Sketch-to-Image Translation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Jeffrey Nichols,et al.  Swire: Sketch-based User Interface Retrieval , 2019, CHI.

[3]  John F. Canny,et al.  Sketchforme: Composing Sketched Scenes from Text Descriptions for Interactive Applications , 2019, UIST.

[4]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ersin Yumer,et al.  Photo-Sketching: Inferring Contour Drawings From Images , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[6]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tao Xiang,et al.  SketchyScene: Richly-Annotated Scene Sketches , 2018, ECCV.

[8]  Ranjitha Kumar,et al.  Designing the Future of Personal Fashion , 2018, CHI.

[9]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Xinlei Chen,et al.  CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication , 2017, ACL.

[11]  Chen Fang,et al.  Visually-Aware Fashion Recommendation and Design with Generative Image Models , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[12]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[13]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[16]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[17]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alexei A. Efros,et al.  Generative Visual Manipulation on the Natural Image Manifold , 2016, ECCV.

[20]  James Hays,et al.  The sketchy database , 2016, ACM Trans. Graph..

[21]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[22]  Alex J. Champandard,et al.  Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks , 2016, ArXiv.

[23]  Björn Hartmann,et al.  SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries , 2015, UIST.

[24]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[25]  Rubaiat Habib Kazi,et al.  Kitty: sketching dynamic and interactive illustrations , 2014, UIST.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Adrien Treuille,et al.  Real-time drawing assistance through crowdsourcing , 2013, HCOMP.

[28]  Gierad Laput,et al.  PixelTone: a multimodal interface for image editing , 2013, CHI.

[29]  A. Agarwala,et al.  PixelTone: a multimodal interface for image editing , 2013, CHI.

[30]  Takeo Igarashi,et al.  Sketch-editing games: human-machine communication, game theory and applications , 2012, UIST.

[31]  Marc Alexa,et al.  How do humans sketch objects? , 2012, ACM Trans. Graph..

[32]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[33]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[34]  Henry Lieberman,et al.  What am I gonna wear?: scenario-oriented recommendation , 2007, IUI '07.

[35]  James A. Landay,et al.  SILK: sketching interfaces like krazy , 1996, CHI Conference Companion.

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[37]  Regina Barzilay,et al.  Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2017, ACL 2017.