PinterNet: A thematic label curation tool for large image datasets

Recent progress in big data and computer vision with deep learning models has gained a lot of attention. Deep learning has been performed on tasks such as image classification, object detection, image segmentation, image captioning, visual question and answering, using large collections of annotated images. This calls for more curated large image datasets with clearer descriptions, cleaner contents, and diversified usability. However, the curation and labeling of such datasets can be labor-intensive. In this paper, we present PinterNet, an algorithm for automatic curation and label generation from noisy textual descriptions, and also publish a big image dataset containing over 110K images automatically labeled with their themes. Our dataset is hierarchical in nature, it has high level category information which we refer as verticals with fine-grained thematic labels at lower level. This advocates a new type of hierarchical theme classification problem closer to human cognition and of business value. We provide benchmark performances using deep learning models based on AlexNet architecture with different pre-training schemes for this novel task and new data.

[1]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[3]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[5]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[7]  James R. Curran Proceedings of the COLING/ACL on Interactive presentation sessions , 2006 .

[8]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[9]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[12]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[13]  Filippo Ricca,et al.  Improving Test Suites Maintainability with the Page Object Pattern: An Industrial Case Study , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation Workshops.

[14]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sanja Fidler,et al.  Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[20]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Qiang Ji,et al.  Interactive labeling of facial action units , 2008, 2008 19th International Conference on Pattern Recognition.

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[25]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.