Visual question answering model for fruit tree disease decision-making based on multimodal deep learning

Visual Question Answering (VQA) about diseases is an essential feature of intelligent management in smart agriculture. Currently, research on fruit tree diseases using deep learning mainly uses single-source data information, such as visible images or spectral data, yielding classification and identification results that cannot be directly used in practical agricultural decision-making. In this study, a VQA model for fruit tree diseases based on multimodal feature fusion was designed. Fusing images and Q&A knowledge of disease management, the model obtains the decision-making answer by querying questions about fruit tree disease images to find relevant disease image regions. The main contributions of this study were as follows: (1) a multimodal bilinear factorized pooling model using Tucker decomposition was proposed to fuse the image features with question features: (2) a deep modular co-attention architecture was explored to simultaneously learn the image and question attention to obtain richer graphical features and interactivity. The experiments showed that the proposed unified model combining the bilinear model and co-attentive learning in a new network architecture obtained 86.36% accuracy in decision-making under the condition of limited data (8,450 images and 4,560k Q&A pairs of data), outperforming existing multimodal methods. The data augmentation is adopted on the training set to avoid overfitting. Ten runs of 10-fold cross-validation are used to report the unbiased performance. The proposed multimodal fusion model achieved friendly interaction and fine-grained identification and decision-making performance. Thus, the model can be widely deployed in intelligent agriculture.

[1]  Y. Lan,et al.  Influence of the downwash airflow distribution characteristics of a plant protection UAV on spray deposit distribution , 2022, Biosystems Engineering.

[2]  Xiaoling Deng,et al.  Citrus Huanglongbing Detection Based on Multi-Modal Feature Fusion Learning , 2021, Frontiers in Plant Science.

[3]  Shuihua Wang,et al.  MIDCAN: A multiple input deep convolutional attention network for Covid-19 diagnosis based on chest CT and chest X-ray , 2021, Pattern Recognition Letters.

[4]  Shuihua Wang,et al.  ADVIAN: Alzheimer's Disease VGG-Inspired Attention Network Based on Convolutional Block Attention Module and Multiple Way Data Augmentation , 2021, Frontiers in Aging Neuroscience.

[5]  Thierry Denoeux,et al.  Evidential fully convolutional network for semantic segmentation , 2021, Applied Intelligence.

[6]  Juan Manuel Górriz,et al.  Covid-19 classification by FGCNet with deep feature fusion from graph convolutional network and convolutional neural network , 2020, Information Fusion.

[7]  Huasheng Huang,et al.  Comparison of machine learning methods for citrus greening detection on UAV multispectral images , 2020, Comput. Electron. Agric..

[8]  Fuji Ren,et al.  CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering , 2020, IEEE Access.

[9]  Liang Huang,et al.  Data Augmentation for Deep Learning-Based Radio Modulation Classification , 2019, IEEE Access.

[10]  Zheng Zheng,et al.  Field detection and classification of citrus Huanglongbing based on hyperspectral reflectance , 2019, Comput. Electron. Agric..

[11]  Yubin Lan,et al.  Field evaluation of an unmanned aerial vehicle (UAV) sprayer: effect of spray volume on deposition and the control of pests and disease in wheat. , 2019, Pest management science.

[12]  Sidan Du,et al.  Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation , 2019, Multimedia Tools and Applications.

[13]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[14]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Dong Huk Park,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[17]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[18]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[21]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[22]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[23]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[26]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[27]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Charless C. Fowlkes,et al.  Bilinear classifiers for visual recognition , 2009, NIPS.

[33]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[34]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Zhen Xu,et al.  Multimodal Fusion with Co-Attention Networks for Fake News Detection , 2021, FINDINGS.

[36]  Tamer AbuHmed,et al.  Robust hybrid deep learning models for Alzheimer's progression detection , 2021, Knowl. Based Syst..

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Raphael Sznitman,et al.  Ensemble of Streamlined Bilinear Visual Question Answering Models for the ImageCLEF 2019 Challenge in the Medical Domain , 2019, CLEF.

[39]  N. Otsu A threshold selection method from gray level histograms , 1979 .