FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO. With practical applications in mind, we collect sketches that convey scene content well but can be sketched within a few minutes by a person with any sketching skills. Our dataset comprises 10,000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals, offering both object- and scene-level abstraction. Each sketch is augmented with its text description. Using our dataset, we study for the first time the problem of fine-grained image retrieval from freehand scene sketches and sketch captions. We draw insights on: (i) Scene salience encoded in sketches using the strokes temporal order; (ii) Performance comparison of image retrieval from a scene sketch and an image caption; (iii) Complementarity of information in sketches and image captions, as well as the potential benefit of combining the two modalities. In addition, we extend a popular vector sketch LSTM-based encoder to handle sketches with larger complexity than was supported by previous work. Namely, we propose a hierarchical sketch decoder, which we leverage at a sketch-specific"pre-text"task. Our dataset enables for the first time research on freehand scene sketch understanding and its practical applications.

[1]  Pinaki Nath Chowdhury,et al.  Adaptive Fine-Grained Sketch-Based Image Retrieval , 2022, ECCV.

[2]  Pinaki Nath Chowdhury,et al.  Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  T. Xiang,et al.  Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pinaki Nath Chowdhury,et al.  Partially Does It: Towards Scene-Level FG-SBIR with Partial Input , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Pinaki Nath Chowdhury,et al.  Sketch3T: Test-Time Training for Zero-Shot SBIR , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mohamed Elhoseiny,et al.  VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tao Xiang,et al.  Toward Fine-Grained Sketch-Based 3D Shape Retrieval , 2021, IEEE Transactions on Image Processing.

[8]  A. Jacobson,et al.  Supporting Reference Imagery for Digital Drawing , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[9]  Pinaki Nath Chowdhury,et al.  SketchLattice: Latticed Representation for Sketch Manipulation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  David Bau,et al.  Sketch Your Own GAN , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  L. McMillan,et al.  Tracing versus freehand for evaluating computer-generated drawings , 2021, ACM Transactions on Graphics.

[12]  Tao Xiang,et al.  Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jiecao Chen,et al.  WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[14]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[15]  C. Lawrence Zitnick,et al.  Creative Sketch Generation , 2020, ICLR.

[16]  Timothy M. Hospedales,et al.  Pixelor , 2020, ACM Trans. Graph..

[17]  David Vanderhaeghe,et al.  A benchmark for rough sketch cleanup , 2020, ACM Trans. Graph..

[18]  Alla Sheffer,et al.  Lifting freehand concept sketches into 3D , 2020, ACM Trans. Graph..

[19]  Tao Xiang,et al.  BézierSketch: A generative model for scalable vector sketches , 2020, ECCV.

[20]  Tao Xiang,et al.  Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yu-Gang Jiang,et al.  Sketch-BERT: Learning Sketch Bidirectional Encoder Representation From Transformers by Self-Supervised Learning of Sketch Gestalt , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[23]  Limin Wang,et al.  SketchyCOCO: Image Generation From Freehand Scene Sketches , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Stefan Roth,et al.  Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings , 2020, ICLR.

[25]  Fang Liu,et al.  SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches , 2020, ECCV.

[26]  Adrien Bousseau,et al.  OpenSketch: a richly-annotated dataset of product design sketches , 2019, ACM Trans. Graph..

[27]  Amos J. Storkey,et al.  How to train your MAML , 2018, ICLR.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Tao Xiang,et al.  SketchyScene: Richly-Annotated Scene Sketches , 2018, ECCV.

[30]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[31]  Niloy J. Mitra,et al.  Learning a shared shape space for multimodal garment design , 2018, ACM Trans. Graph..

[32]  Ning Xu,et al.  Learn to Combine Modalities in Multimodal Deep Learning , 2018, ArXiv.

[33]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[34]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Antonio Torralba,et al.  Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Tao Xiang,et al.  Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Svetlana Lazebnik,et al.  Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[39]  Timothy M. Hospedales,et al.  Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma , 2017, BMVC.

[40]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[41]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Feng Liu,et al.  Sketch Me That Shoe , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[44]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[45]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[46]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[48]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[49]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[51]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[52]  Tao Xiang,et al.  Sketch-a-Net that Beats Humans , 2015, BMVC.

[53]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Tinne Tuytelaars,et al.  Sketch classification and classification-driven analysis using Fisher vectors , 2014, ACM Trans. Graph..

[58]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[59]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Markus H. Gross,et al.  Smart Scribbles for Sketch Segmentation , 2012, Comput. Graph. Forum.

[62]  Marc Alexa,et al.  How do humans sketch objects? , 2012, ACM Trans. Graph..

[63]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[64]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[65]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[66]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.