Natural Language Controlled Real-Time Object Recognition Framework for Household Robot

Assistive robot systems are designed to help individuals with physical disabilities perform routine daily living activities. Current assistive robot systems are increasingly capable of performing a variety of tasks that benefit the human user. This has created the need to specify the particular task that the disabled human user desires the system to perform. Enhanced user interfaces such as joysticks can provide fine-grained control commands, but this is time-consuming and mentally and physically taxing. This article describes an ongoing project that aims to provide people with physical disabilities a method to control a robot arm using voice commands. The goal is to enable the user to control the system using natural language, i.e., without learning a special robot control vocabulary. The work describes the design and evaluation of a real-time framework that combines speech recognition, camera-based object detection, and an inference module that matches the potentially ambiguous results of speech recognition with object detection outputs to generate a unique control input for a computer vision-based robot arm. Thus, the integration of natural language and object detection systems reduces the ambiguity in specifying tasks, one of the major bottlenecks in voice-based user interfaces. A modified version of the deep learning-based object detection network YOLO (You only look once) is used to identify all potential objects of interest in the environment. Evaluating this integrated voice and object recognition-based user interface indicates that tasks can be specified accurately in different settings.

[1]  Sadaoki Furui,et al.  Speech-to-text and speech-to-speech summarization of spontaneous speech , 2004, IEEE Transactions on Speech and Audio Processing.

[2]  Stuart M. Shieber,et al.  Foundational Issues in Natural Language Processing: Introduction , 1991 .

[3]  Constantine D. Spyropoulos,et al.  HUMAN-ROBOT INTERACTION BASED ON SPOKEN NATURAL LANGUAGE DIALOGUE , 2001 .

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Abhishek Barla Voice Controlled Robotic Arm , 2015 .

[7]  Francisco Cuellar,et al.  Hybrid BCI system to operate an electric wheelchair and a robotic arm for navigation and manipulation tasks , 2016, 2016 IEEE Workshop on Advanced Robotics and its Social Impacts (ARSO).

[8]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Nguyen Truong Thinh,et al.  Design strategies to improve self-feeding device - FeedBot for Parkinson patients , 2017, 2017 International Conference on System Science and Engineering (ICSSE).

[11]  Yap June Wai,et al.  Fixed Point Implementation of Tiny-Yolo-v2 using OpenCL on FPGA , 2018 .

[12]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jerome Rene Bellegarda,et al.  Latent Semantic Mapping , 2007 .

[14]  Ripcy Anna John,et al.  Assistive Device for Physically Challenged Persons Using Voice Controlled Intelligent Robotic Arm , 2020, 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).

[15]  Ye Zhang,et al.  A light and faster regional convolutional neural network for object detection in optical remote sensing images , 2018, ISPRS Journal of Photogrammetry and Remote Sensing.

[16]  Robert Platt,et al.  Open world assistive grasping using laser selection , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jeff A. Bilmes,et al.  The VoiceBot: a voice controlled robot arm , 2009, CHI.

[19]  Eunjeong Park,et al.  A context-aware citation recommendation model with BERT and graph convolutional networks , 2019, Scientometrics.

[20]  Juan Du,et al.  Understanding of Object Detection Based on CNN Family and YOLO , 2018 .

[21]  Medhat Moussa,et al.  Performing Complex Tasks by Users With Upper-Extremity Disabilities Using a 6-DOF Robotic Arm: A Study , 2017, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[22]  Supunmali Ahangama,et al.  A Rule-based Lemmatizing Approach for Sinhala Language , 2018, 2018 3rd International Conference on Information Technology Research (ICITR).

[23]  J.R. Bellegarda,et al.  Latent semantic mapping [information retrieval] , 2005, IEEE Signal Processing Magazine.