Multi-Modal Chatbot in Intelligent Manufacturing

Artificial intelligence (AI) has been widely used in various industries. In this work, we concentrate on what AI is capable of doing in manufacturing, in the form of a chatbot. We designed a chatbot that helps users complete an assembly task that simulates those in manufacturing settings. In order to recreate this setting, we have users assemble a Meccanoid robot through multiple stages, with the help of an interactive dialogue system. Based on classifying users’ intent, the chatbot is able to provide answers or instructions to the user when the user encounters problems during the assembly process. Our goal is to improve our system so that it can capture users’ needs by detecting their intent and therefore provide relevant and helpful information to the user. However, in a multiple-step task, we cannot rely on intent classification with user question utterance as the only input, as user questions raised from different steps may share the same intent but require different responses. In this paper, we proposed two methods to address this problem. One is that we capture not only textual features but also visual features through the YOLO-based Masker with CNN (YMC) model. Another is the usage of an Autoencoder to encode multi-modal features for user intent classification. By incorporating visual information, we have significantly improved the chatbot’s performance from the experiments conducted on different dataset.

[1]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[2]  Steve Tauroza,et al.  Speech Rates in British English , 1990 .

[3]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  David Griol,et al.  An Architecture to Develop Multimodal Educative Applications with Chatbots , 2013 .

[5]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Peng Gao,et al.  Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Edward Rajan Samuel Nadar,et al.  Computer vision for automatic detection and classification of fabric defect employing deep learning algorithm , 2019, International Journal of Clothing Science and Technology.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jiliang Tang,et al.  A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[12]  Xifan Yao,et al.  From Intelligent Manufacturing to Smart Manufacturing for Industry 4.0 Driven by Next Generation Artificial Intelligence and Further On , 2017, 2017 5th International Conference on Enterprise Systems (ES).

[13]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[14]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[15]  Hermann Ney,et al.  Word Error Rates: Decomposition over POS classes and Applications for Error Analysis , 2007, WMT@ACL.

[16]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Khairy A.H. Kobbacy,et al.  Intelligent systems in manufacturing: current developments and future prospects , 2000 .

[19]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[21]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[22]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[24]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Р Ю Чуйков,et al.  Обнаружение транспортных средств на изображениях загородных шоссе на основе метода Single shot multibox Detector , 2017 .

[26]  Luca Lattanzi,et al.  Geometrical calibration of a 6-axis robotic arm for high accuracy manufacturing task , 2020, The International Journal of Advanced Manufacturing Technology.

[27]  Vladimir Robles-Bykbaev,et al.  Personal robotic assistants: a proposal based on the intelligent services of the IBM cloud and additive manufacturing , 2020, 2020 IEEE ANDESCON.

[28]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.