Multimodal Data Processing Framework for Smart City: A Positional-Attention Based Deep Learning Approach

In the past few years, edge computing has brought tremendous convenience to the development of smart cities, releasing computation pressure to edge compute nodes. However, a series of problems, such as the explosive growth of smart devices and limited spectrum resources, still greatly limit the application of edge computing. Different types of end devices generate and collect multimodal information, and substantial data is transmitted to upper nodes. Multimodal machine learning methods process data at edge nodes, and only high-level features are uploaded to the cloud in order to save bandwidth. In this article, we propose a novel multimodal data processing framework based on multiple attention mechanisms. Two distinct attention mechanisms are used to capture inter and intra-modality dependencies and align different modalities together. We conduct experiments on image captioning, a core research hotspot in multimodal machine learning. A unified hierarchical structure extracts features from images and natural language. Matching attention aligns visual and textual information. Besides, we propose a new attention mechanism, positional attention, which finds the relationship of each element within one sensory modality. The hierarchical structure realizes parallel computation in the training phase and speeds up the training of the model. Experiments and analysis demonstrate significant improvements over baselines, proving the effectiveness of our method.

[1]  Hwee Pink Tan,et al.  Mobile big data analytics using deep learning and apache spark , 2016, IEEE Network.

[2]  Kevin Ashton,et al.  That ‘Internet of Things’ Thing , 1999 .

[3]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[4]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Ahmet Aker,et al.  Generating Image Descriptions Using Dependency Relational Patterns , 2010, ACL.

[6]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  K. B. Letaief,et al.  A Survey on Mobile Edge Computing: The Communication Perspective , 2017, IEEE Communications Surveys & Tutorials.

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[12]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[14]  Vinod Vokkarane,et al.  A New Deep Learning-Based Food Recognition System for Dietary Assessment on An Edge Computing Service Infrastructure , 2018, IEEE Transactions on Services Computing.

[15]  Yansong Feng,et al.  How Many Words Is a Picture Worth? Automatic Caption Generation for News Images , 2010, ACL.

[16]  Wei Liu,et al.  Learning to Guide Decoding for Image Captioning , 2018, AAAI.

[17]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[18]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dave Evans,et al.  How the Next Evolution of the Internet Is Changing Everything , 2011 .

[21]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[22]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[23]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[24]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[26]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[27]  Chee Seng Chan,et al.  phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning , 2016, ACCV.

[28]  Heng Ji,et al.  The Age of Social Sensing , 2018, Computer.

[29]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[30]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[31]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[32]  Christoph Meinel,et al.  Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[33]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[34]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[36]  Dong Wang,et al.  When Social Sensing Meets Edge Computing: Vision and Challenges , 2019, 2019 28th International Conference on Computer Communication and Networks (ICCCN).

[37]  Nicholas D. Lane,et al.  Sparsification and Separation of Deep Learning Layers for Constrained Resource Inference on Wearables , 2016, SenSys.

[38]  Mianxiong Dong,et al.  Learning IoT in Edge: Deep Learning for the Internet of Things with Edge Computing , 2018, IEEE Network.

[39]  Ivor W. Tsang,et al.  A Convex Method for Locating Regions of Interest with Multi-instance Learning , 2009, ECML/PKDD.

[40]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[41]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[42]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.