Towards Robust Human-Robot Collaborative Manufacturing: Multimodal Fusion

Intuitive and robust multimodal robot control is the key toward human–robot collaboration (HRC) for manufacturing systems. Multimodal robot control methods were introduced in previous studies. The methods allow human operators to control robot intuitively without programming brand-specific code. However, most of the multimodal robot control methods are unreliable because the feature representations are not shared across multiple modalities. To target this problem, a deep learning-based multimodal fusion architecture is proposed in this paper for robust multimodal HRC manufacturing systems. The proposed architecture consists of three modalities: speech command, hand motion, and body motion. Three unimodal models are first trained to extract features, which are further fused for representation sharing. Experiments show that the proposed multimodal fusion model outperforms the three unimodal models. This paper indicates a great potential to apply the proposed multimodal fusion architecture to robust HRC manufacturing systems.

[1]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[2]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[3]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Magdalena D. Bugajska,et al.  Building a Multimodal Human-Robot Interface , 2001, IEEE Intell. Syst..

[5]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ahmad Akbari,et al.  An evolutionary based discriminative system for keyword spotting , 2011, 2011 International Symposium on Artificial Intelligence and Signal Processing (AISP).

[7]  Jens Grivolla,et al.  Multimodal Music Mood Classification Using Audio and Lyrics , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[8]  Richard Bowden,et al.  Detection and Tracking of Humans by Probabilistic Body Part Assembly , 2005, BMVC.

[9]  Torgny Brogårdh,et al.  Present and future robot control development - An industrial perspective , 2007, Annu. Rev. Control..

[10]  Lihui Wang,et al.  Interface architecture design for minimum programming in human-robot collaboration , 2018 .

[11]  Andreas Rauber,et al.  Integration of Text and Audio Features for Genre Classification in Music Information Retrieval , 2007, ECIR.

[12]  Yi Yao,et al.  Boosting for transfer learning with multiple sources , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[14]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[15]  Wenjun Xu,et al.  Sensorless and adaptive admittance control of industrial robot in physical human−robot interaction , 2018, Robotics and Computer-Integrated Manufacturing.

[16]  Gerhard Widmer,et al.  Towards Score Following In Sheet Music Images , 2016, ISMIR.

[17]  Csaba Kardos,et al.  Context-dependent multimodal communication in human-robot collaboration , 2018 .

[18]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[19]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[21]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[22]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[23]  Yan Song,et al.  Robust sound event recognition using convolutional neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Lihui Wang,et al.  Gesture recognition for human-robot collaboration: A review , 2017, International Journal of Industrial Ergonomics.

[25]  Giulio Sandini,et al.  Robot reading human gaze: Why eye tracking is better than head tracking for human-robot collaboration , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[27]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[29]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[30]  Tarik Arici,et al.  Robust gesture recognition using feature pre-processing and weighted dynamic time warping , 2014, Multimedia Tools and Applications.

[31]  Marius-Calin Silaghi,et al.  Spotting Subsequences Matching an HMM Using the Average Observation Probability Criteria with Application to Keyword Spotting , 2005, AAAI.

[32]  Frank Weichert,et al.  Analysis of the Accuracy and Robustness of the Leap Motion Controller , 2013, Sensors.

[33]  Lihui Wang,et al.  Deep Learning-based Multimodal Control Interface for Human-Robot Collaboration , 2018 .

[34]  Lihui Wang,et al.  Deep learning-based human motion recognition for predictive context-aware human-robot collaboration , 2018 .

[35]  Andreas Rauber,et al.  An Audio-Visual Approach to Music Genre Classification through Affective Color Features , 2015, ECIR.

[36]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[37]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[38]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[39]  Fang Yuan,et al.  Static hand gesture recognition based on HOG characters and support vector machines , 2013, 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA).

[40]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[41]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[42]  Cem Keskin,et al.  REAL TIME HAND TRACKING AND 3D GESTURE RECOGNITION FOR INTERACTIVE INTERFACES USING HMM , 2003 .

[43]  Hervé Bourlard,et al.  Iterative Posterior-Based Keyword Spotting Without Filler Models , 1999 .

[44]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[45]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[48]  B. Watanapa,et al.  Human gesture recognition using Kinect camera , 2012, 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE).

[49]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[50]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[51]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Lihui Wang,et al.  Human motion prediction for human-robot collaboration , 2017 .

[53]  Tarik Arici,et al.  Gesture Recognition using Skeleton Data with Weighted Dynamic Time Warping , 2013, VISAPP.

[54]  André Crosnier,et al.  Multimodal control for human-robot cooperation , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[57]  Ayoub Al-Hamadi,et al.  A Hidden Markov Model-based continuous gesture recognition system for hand motion trajectory , 2008, 2008 19th International Conference on Pattern Recognition.

[58]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[59]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[60]  Fabio Tesser,et al.  Multimodal child-robot interaction: building social bonds , 2013, HRI 2013.

[61]  Luca Maria Gambardella,et al.  Max-pooling convolutional neural networks for vision-based hand gesture recognition , 2011, 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).

[62]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[63]  Lovekesh Vig,et al.  Long Short Term Memory Networks for Anomaly Detection in Time Series , 2015, ESANN.

[64]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.