A framework for the fusion of visual and tactile modalities for improving robot perception

Robots should ideally perceive objects using human-like multi-modal sensing such as vision, tactile feedback, smell, and hearing. However, the features presentations are different for each modal sensor. Moreover, the extracted feature methods for each modal are not the same. Some modal features such as vision, which presents a spatial property, are static while features such as tactile feedback, which presents temporal pattern, are dynamic. It is difficult to fuse these data at the feature level for robot perception. In this study, we propose a framework for the fusion of visual and tactile modal features, which includes the extraction of features, feature vector normalization and generation based on bag-of-system (BoS), and coding by robust multi-modal joint sparse representation (RM-JSR) and classification, thereby enabling robot perception to solve the problem of diverse modal data fusion at the feature level. Finally, comparative experiments are carried out to demonstrate the performance of this framework.创新点提出了一种视触觉信息融合框架和鲁棒多模态联合稀疏表示编码方法, 解决由于机器人感知的视(静态)、触觉(动态)跨模态信息特征空间维度不同而带来的特征层融合难题。具体包括:视触觉特征提取、用“词袋”算法归一化维度不同的特征向量、鲁棒多模态联合稀疏表示编码、通过视触觉融合算法进行分类。

[1]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[2]  Yonina C. Eldar,et al.  C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework , 2010, IEEE Transactions on Signal Processing.

[3]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  Yiannis S. Boutalis,et al.  Accurate Image Retrieval Based on Compact Composite Descriptors and Relevance Feedback Information , 2010, Int. J. Pattern Recognit. Artif. Intell..

[5]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Alexander H. Waibel,et al.  Towards Unrestricted Lip Reading , 2000, Int. J. Pattern Recognit. Artif. Intell..

[8]  Wotao Yin,et al.  Bregman Iterative Algorithms for (cid:2) 1 -Minimization with Applications to Compressed Sensing ∗ , 2008 .

[9]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[10]  Antoni B. Chan,et al.  A Bag of Systems Representation for Music Auto-Tagging , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[12]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[13]  Ali Jalali,et al.  A Dirty Model for Multi-task Learning , 2010, NIPS.

[14]  Qiang Zhang,et al.  Robust Multi-Focus Image Fusion Using Multi-Task Sparse Representation and Spatial Context , 2016, IEEE Transactions on Image Processing.

[15]  Stephen J. Wright,et al.  Sparse reconstruction by separable approximation , 2009, IEEE Trans. Signal Process..

[16]  Fuchun Sun,et al.  Fusion tracking in color and infrared images using joint sparse representation , 2012, Science China Information Sciences.

[17]  D. Aldous Exchangeability and related topics , 1985 .

[18]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[19]  Robert D. Nowak,et al.  Classification With the Sparse Group Lasso , 2016, IEEE Transactions on Signal Processing.

[20]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[21]  Thomas S. Huang,et al.  Joint-Structured-Sparsity-Based Classification for Multiple-Measurement Transient Acoustic Signals , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[23]  Antoni B. Chan,et al.  A Scalable and Accurate Descriptor for Dynamic Textures Using Bag of System Trees , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Rama Chellappa,et al.  Joint Sparse Representation for Robust Multimodal Biometrics Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  A. Manitius Optimization and Nonsmooth Analysis (Frank H. Clarke) , 1985 .

[26]  Trac D. Tran,et al.  Robust multi-sensor classification via joint sparse representation , 2011, 14th International Conference on Information Fusion.

[27]  Gregory J. Wolff,et al.  Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration , 1993, NIPS.

[28]  Eric C. Chi,et al.  Splitting Methods for Convex Clustering , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[29]  Fuchun Sun,et al.  Linear dynamic system method for tactile object classification , 2014, Science China Information Sciences.

[30]  Xiaojun Chen,et al.  Smoothing Nonlinear Conjugate Gradient Method for Image Restoration Using Nonsmooth Nonconvex Minimization , 2010, SIAM J. Imaging Sci..

[31]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.