论文信息 - FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO. With practical applications in mind, we collect sketches that convey scene content well but can be sketched within a few minutes by a person with any sketching skills. Our dataset comprises 10,000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals, offering both object- and scene-level abstraction. Each sketch is augmented with its text description. Using our dataset, we study for the first time the problem of fine-grained image retrieval from freehand scene sketches and sketch captions. We draw insights on: (i) Scene salience encoded in sketches using the strokes temporal order; (ii) Performance comparison of image retrieval from a scene sketch and an image caption; (iii) Complementarity of information in sketches and image captions, as well as the potential benefit of combining the two modalities. In addition, we extend a popular vector sketch LSTM-based encoder to handle sketches with larger complexity than was supported by previous work. Namely, we propose a hierarchical sketch decoder, which we leverage at a sketch-specific"pre-text"task. Our dataset enables for the first time research on freehand scene sketch understanding and its practical applications.

[1] Pinaki Nath Chowdhury,et al. Adaptive Fine-Grained Sketch-Based Image Retrieval , 2022, ECCV.

[2] Pinaki Nath Chowdhury,et al. Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] T. Xiang,et al. Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Pinaki Nath Chowdhury,et al. Partially Does It: Towards Scene-Level FG-SBIR with Partial Input , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Pinaki Nath Chowdhury,et al. Sketch3T: Test-Time Training for Zero-Shot SBIR , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Mohamed Elhoseiny,et al. VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Tao Xiang,et al. Toward Fine-Grained Sketch-Based 3D Shape Retrieval , 2021, IEEE Transactions on Image Processing.

[8] A. Jacobson,et al. Supporting Reference Imagery for Digital Drawing , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[9] Pinaki Nath Chowdhury,et al. SketchLattice: Latticed Representation for Sketch Manipulation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] David Bau,et al. Sketch Your Own GAN , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] L. McMillan,et al. Tracing versus freehand for evaluating computer-generated drawings , 2021, ACM Transactions on Graphics.

[12] Tao Xiang,et al. Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Jiecao Chen,et al. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[14] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[15] C. Lawrence Zitnick,et al. Creative Sketch Generation , 2020, ICLR.

[16] Timothy M. Hospedales,et al. Pixelor , 2020, ACM Trans. Graph..

[17] David Vanderhaeghe,et al. A benchmark for rough sketch cleanup , 2020, ACM Trans. Graph..

[18] Alla Sheffer,et al. Lifting freehand concept sketches into 3D , 2020, ACM Trans. Graph..

[19] Tao Xiang,et al. BézierSketch: A generative model for scalable vector sketches , 2020, ECCV.

[20] Tao Xiang,et al. Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Yu-Gang Jiang,et al. Sketch-BERT: Learning Sketch Bidirectional Encoder Representation From Transformers by Self-Supervised Learning of Sketch Gestalt , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[23] Limin Wang,et al. SketchyCOCO: Image Generation From Freehand Scene Sketches , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Stefan Roth,et al. Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings , 2020, ICLR.

[25] Fang Liu,et al. SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches , 2020, ECCV.

[26] Adrien Bousseau,et al. OpenSketch: a richly-annotated dataset of product design sketches , 2019, ACM Trans. Graph..

[27] Amos J. Storkey,et al. How to train your MAML , 2018, ICLR.

[28] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29] Tao Xiang,et al. SketchyScene: Richly-Annotated Scene Sketches , 2018, ECCV.

[30] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[31] Niloy J. Mitra,et al. Learning a shared shape space for multimodal garment design , 2018, ACM Trans. Graph..

[32] Ning Xu,et al. Learn to Combine Modalities in Multimodal Deep Learning , 2018, ArXiv.

[33] Douglas Eck,et al. A Neural Representation of Sketch Drawings , 2017, ICLR.

[34] Vittorio Ferrari,et al. COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Antonio Torralba,et al. Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Tao Xiang,et al. Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Svetlana Lazebnik,et al. Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[39] Timothy M. Hospedales,et al. Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma , 2017, BMVC.

[40] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[41] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Feng Liu,et al. Sketch Me That Shoe , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Yu Qiao,et al. A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[44] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[45] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[46] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[48] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[49] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[51] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[52] Tao Xiang,et al. Sketch-a-Net that Beats Humans , 2015, BMVC.

[53] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Yoshua Bengio,et al. NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[56] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57] Tinne Tuytelaars,et al. Sketch classification and classification-driven analysis using Fisher vectors , 2014, ACM Trans. Graph..

[58] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[59] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60] Yang Song,et al. Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61] Markus H. Gross,et al. Smart Scribbles for Sketch Segmentation , 2012, Comput. Graph. Forum.

[62] Marc Alexa,et al. How do humans sketch objects? , 2012, ACM Trans. Graph..

[63] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[64] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[65] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[66] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.