Mining a Deep And-OR Object Semantics from Web Images via Cost-Sensitive Question-Answer-Based Active Annotations

This paper presents a cost-sensitive active Question-Answering (QA) framework for learning a nine-layer And-Or graph (AOG) from web images. The AOG explicitly represents object categories, poses/viewpoints, parts, and detailed structures within the parts in a compositional hierarchy. The QA framework is designed to minimize an overall risk, which trades off the loss and query costs. The loss is defined for nodes in all layers of the AOG, including the generative loss (measuring the likelihood of the images) and the discriminative loss (measuring the fitness to human answers). The cost comprises both the human labor of answering questions and the computational cost of model learning. The cost-sensitive QA framework iteratively selects different storylines of questions to update different nodes in the AOG. Experiments showed that our method required much less human supervision (e.g., labeling parts on 3--10 training objects for each category) and achieved better performance than baseline methods.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Andrea Vedaldi,et al.  Learning the semantic structure of objects from Web supervision , 2016, ArXiv.

[3]  Quanshi Zhang,et al.  Interpreting CNN knowledge via an Explanatory Graph , 2017, AAAI.

[4]  Song-Chun Zhu,et al.  Learning AND-OR Templates for Object Recognition and Detection , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Matthieu Cord,et al.  MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[7]  Cordelia Schmid,et al.  Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Michael S. Bernstein,et al.  Scalable multi-label annotation , 2014, CHI.

[10]  Quanshi Zhang,et al.  Interpreting CNNs via Decision Trees , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Quanshi Zhang,et al.  Mining Object Parts from CNNs via Active Question-Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Davide Modolo,et al.  Learning Semantic Part-Based Models from Google Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Dhruv Batra,et al.  Active learning for structured probabilistic models with histogram approximation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Quanshi Zhang,et al.  Interpretable Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Quanshi Zhang,et al.  Mining And-Or Graphs for Graph Matching and Object Discovery , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Song-Chun Zhu,et al.  Learning Hybrid Image Templates (HIT) by Information Projection , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xinlei Chen,et al.  Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Xiaowei Zhou,et al.  Single Image Pop-Up from Discriminatively Learned Parts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Quanshi Zhang,et al.  Network Transplanting , 2018, ArXiv.

[27]  Pietro Perona,et al.  Strong supervision from weak annotation: Interactive training of deformable part models , 2011, 2011 International Conference on Computer Vision.

[28]  Alan L. Yuille,et al.  Scene-Domain Active Part Models for Object Representation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Pascal Fua,et al.  Introducing Geometry in Active Learning for Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Jean Ponce,et al.  Learning Graphs to Match , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Zhuowen Tu,et al.  Unsupervised object class discovery via saliency-guided multiple class learning , 2012, CVPR.

[35]  Quanshi Zhang,et al.  Unsupervised Learning of Neural Networks to Explain Neural Networks , 2018, ArXiv.

[36]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Eric P. Xing,et al.  On multiple foreground cosegmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Cees Snoek,et al.  Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Martial Hebert,et al.  Unsupervised Learning for Graph Matching , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Xinlei Chen,et al.  Enriching Visual Knowledge Bases via Object Discovery and Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Wenze Hu,et al.  Modeling Occlusion by Discriminative AND-OR Structures , 2013, 2013 IEEE International Conference on Computer Vision.

[44]  Xiang Bai,et al.  Relaxed Multiple-Instance SVM with Application to Object Discovery , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Kristen Grauman,et al.  Large-scale live active learning: Training object detectors with crawled data and crowds , 2011, CVPR.

[47]  Ramakant Nevatia,et al.  Automatic Concept Discovery from Parallel Text and Visual Corpora , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[49]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[51]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[52]  Ivan Laptev,et al.  Object Detection Using Strongly-Supervised Deformable Part Models , 2012, ECCV.

[53]  Sanja Fidler,et al.  Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Song-Chun Zhu,et al.  Attribute And-Or Grammar for Joint Parsing of Human Pose, Parts and Attributes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Gang Hua,et al.  Multi-class Multi-annotator Active Learning with Robust Gaussian Process for Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[57]  Jonathan Krause,et al.  Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Zaïd Harchaoui,et al.  On learning to localize objects with minimal supervision , 2014, ICML.

[61]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.