Grammatically Recognizing Images with Tree Convolution

Similar to language, understanding an image can be considered as a hierarchical decomposition process from scenes to objects, parts, pixels, and the corresponding spatial/contextual relations. However, the existing convolutional networks concentrate on stacking redundant convolutional layers with a large number of kernels in a hierarchical organization to implicitly approximate this decomposition. This may limit the network to learn the semantic information conveyed in the internal feature maps that may reveal minor yet crucial differences for visual understanding. Attempting to tackle this problem, this paper proposes a simple yet effective tree convolution (TreeConv) operation for deep neural networks. Specifically, inspired by the image grammar techniques[73] that serve as a unified framework of object representation, learning, and recognition, our TreeConv designs a generative image grammar, i.e., tree generation rule, to parse the hierarchy of internal feature maps by generating tree structures and implicitly learning the specific visual grammars for each object category. Extensive experiments on a variety of benchmarks, i.e., classification (ImageNet / CIFAR), detection & segmentation (COCO 2017), and person re-identification (CUHK03), demonstrate the superiority of our TreeConv in both boosting the accuracy and reducing the computational cost. The source code will be available at: https://github.com/wanggrun/TreeConv.

[1]  Rogério Schmidt Feris,et al.  Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition , 2018, ICLR.

[2]  Thomas S. Huang,et al.  Free-Form Image Inpainting With Gated Convolution , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[4]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[6]  Shuiwang Ji,et al.  Graph Representation Learning via Hard and Channel-Wise Attention Networks , 2019, KDD.

[7]  Tony Lindeberg,et al.  Scale-Space Theory in Computer Vision , 1993, Lecture Notes in Computer Science.

[8]  Jianhuang Lai,et al.  Smoothing Adversarial Domain Attack and P-Memory Reconsolidation for Cross-Domain Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Song-Chun Zhu,et al.  Image Parsing with Stochastic Scene Grammar , 2011, NIPS.

[10]  Jun Chang,et al.  DAML: Dual Attention Mutual Learning between Ratings and Reviews for Item Recommendation , 2019, KDD.

[11]  Kun Gai,et al.  Learning Tree-based Deep Model for Recommender Systems , 2018, KDD.

[12]  Jian-Huang Lai,et al.  Spatial-Temporal Person Re-identification , 2018, AAAI.

[13]  Jiahuan Zhou,et al.  Towards a Unified Compositional Model for Visual Pattern Modeling , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[15]  Liang Lin,et al.  Adaptively Connected Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Geoffrey K. Pullum,et al.  Generalized Phrase Structure Grammar , 1985 .

[17]  Xuanhui Wang,et al.  Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search , 2019, KDD.

[18]  Liang Lin,et al.  Human Re-identification by Matching Compositional Template with Cluster Sampling , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Shuicheng Yan,et al.  Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Yifan Sun,et al.  SVDNet for Pedestrian Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Xiong Chen,et al.  Learning Discriminative Features with Multiple Granularities for Person Re-Identification , 2018, ACM Multimedia.

[24]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Liang Zheng,et al.  Unsupervised Person Re-identification: Clustering and Fine-tuning , 2017 .

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28]  Meng Wang,et al.  Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Aaron C. Courville,et al.  Neural Language Modeling by Jointly Learning Syntax and Lexicon , 2017, ICLR.

[32]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.

[33]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[34]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[35]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[36]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bernhard Schölkopf,et al.  Discovering Causal Signals in Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Wang Ling,et al.  Memory Architectures in Recurrent Neural Network Language Models , 2018, ICLR.

[41]  Tianfu Wu,et al.  AOGNets: Compositional Grammatical Architectures for Deep Learning , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[43]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Rui Zhang,et al.  Deep Structured Scene Parsing by Learning with Image Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[46]  Liang Zheng,et al.  Re-ranking Person Re-identification with k-Reciprocal Encoding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[48]  Hengjie Song,et al.  AKUPM: Attention-Enhanced Knowledge-Aware User Preference Model for Recommendation , 2019, KDD.

[49]  Yixin Cao,et al.  KGAT: Knowledge Graph Attention Network for Recommendation , 2019, KDD.

[50]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Shaogang Gong,et al.  Person Re-identification by Deep Learning Multi-scale Representations , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[52]  Rui Yu,et al.  Divide and Fuse: A Re-ranking Approach for Person Re-identification , 2017, BMVC.

[53]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Bhaskara Marthi,et al.  A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs , 2017, Science.

[55]  D. G. Hays Dependency Theory: A Formalism and Some Observations , 1964 .

[56]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[57]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[58]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[59]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jianhuang Lai,et al.  P2SNet: Can an Image Match a Video for Person Re-Identification in an End-to-End Way? , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[62]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[63]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Jianhuang Lai,et al.  Weakly Supervised Person Re-ID: Differentiable Graphical Learning and a New Benchmark , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[66]  Priyadarshini Panda,et al.  Tree-CNN: A hierarchical Deep Convolutional Neural Network for incremental learning , 2018, Neural Networks.

[67]  Ying Wu,et al.  Deeply Learned Compositional Models for Human Pose Estimation , 2018, ECCV.

[68]  Stella X. Yu,et al.  Multigrid Neural Architectures , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Narayan Bhamidipati,et al.  Understanding Consumer Journey using Attention based Recurrent Neural Networks , 2019, KDD.

[70]  Xiaonan Luo,et al.  Learning a Wavelet-like Auto-Encoder to Accelerate Deep Neural Networks , 2017, AAAI.

[71]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).