论文信息 - Reinforcement learning for linked multicomponent robotic systems

Reinforcement learning for linked multicomponent robotic systems

The thesis is focused in a specific kind of system, the Linked Multi-componentRobotic System (L-MCRS) consisting in a collection of automous robots attachedto a one dimensional object which a passive, flexible and/or elasticelement constraining the dynamics of the autonomous robots in a non-linearfashion. Therefore, modeling and prediction of the system dynamics needs totake into account the linking element as well as the mobile autonomous robots.In fact, the kind of practical tasks best suited for this kinds of systems is relatedto the manipulation and transport of the one dimensional object. Theparadigmatic example is the transportation or deplyment of a hose for fluiddisposal. The present dissertation follows a line of research of the group thathas laid some background supporting the present work. First, some proof ofconcept physical systems have been built and tested where the expected effectof the linking element is demonstrated. The hose sometimes hiders the motionof the robots, sometimes introduces drifts, and sometimes drags lagging robots.Some of these systems have been commented in Chapter 2 of this dissertation.Second, a theoretical framework for the accurate modeling and simulation ofthese kind of systems was provided. The Geometrically Exact Dynamic Splines(GEDS) allow modeling the hose and the forces playing inside it as a response ofthe external forces exerted by the robots and the environment. In this dissertation,the GEDS model has been adapted to be embedded in the computationalexperimentation required by the Reinforcement Learning (RL) approach.Although the physical model demonstrations provide some evidence of thelinking element efect, simulation does provide a repeatable and fully controlledexperimental setting to provide additional evidence supporting the intuition thatthe L-MCRS belongs to a category of systems different from the disconnectedcollection of robots (D-MCRS). The reasoning is that a control scheme derivedwith a D-MCRS in mind would not be able to deal with a L-MCRS identicalin every respect except for the existence of the linking element, that is, the distinctionbetween system categories lies in their controlability. The experimentalsetup in Chapter 3 was the formulation of a minimalistic L-MCRS model wherethe linking element is a compressible spring that exerts some force only when thesegment between two robots extends beyond a limit size. A distributed controlsystem was defined for a path following task with a mixed formation, where eachrobot unit control was designed to follow a reference position by a Proportional-Integral controller. Keeping the formation was the role of a distributed controlprocess, where the rear robot position corresponds to the coordination variable.The robots performed a consensus-based asynchronous distributed estimationof the coordination variable allowing for the successful completion of the taskwhen no linking element was present. The introduction of the linking elementproduced easily observable interactions between units, rendering the controllersystem ineffective to solve the path following task. The experiment demostratesthat the L-MCRS are specific category of systems from the point of view ofcontrollability.Reinforcement Learning (RL) allow autonomous learning of control systems.The main aim of the Thesis is to show that RL can provide a solution to theautonomous control design problem for L-MCRS. First we have identifed asuitable problem as prototypic instance of the L-MCRS control problem. Suchproblem is the deployment of a hose to make the tip reach a desired position.The state variables of the Markov Decision Problem (MDP) and the rewardsystem are they sensitive elements of the definition of the Q-learning system.The de denition of the state variables includes the decision about the discretizationresolution of the configuration space where the hose is moving. We have foundthat the discretization resolution can have a strong effect in the computationalcost of the process and in its success rate. Low resolution imply smaller statespaces and higher success because rough approximations to the solution arebetter tolerated. High resolutions imply greater computational complexity andlower success, because the exploration time grows exponentially with systemsize. Nevertheless we reach very high sucess rates in some instances of thelearning experiments.The state variables are determined by the abilities of the agent. The basicability is to sense its current position, which allowed in all systems. Next is theability to perceive the hose and determine if it has become an obstacle. Thisability is the minimal perception required and it may correspond to very simplesensors in real life experiences. The system is able to provide good results withthis minimal perception ability, under specific reward systems. The ability todetermine if the hose is inside a some specific region of the configuration is afurther sophistication of the perception ability of the system. An additionaldegree of perception is the ability to sense the position of two specific points ofthe the hose, allowing to have an implicit model of the hose to reason about.Finally, the ability to predict the danger of undesired termination state one stepahead is the last perception stage reached in our modeling, providing the bestresults as expected.In all the cases, the reward policy has a bigger impact on the learning performance.Basically they give positive reward for reaching the goal position,negative reward for reaching a failed state and diverse ways to value the inconclusivestates. When there is only positive reward for reaching the goal statethe results are good, meaning that negative reinforcement is not so influentialas expected from an intuitive point of view. Simplistic ways to give value toinconclusive states, such as zero value or a value proportional to the distance ofthe tip to the desired position, give good learning performance.We have tested a single-robot and a two-robots configurations with similarresults, the two robot system improving someplaces the single-robot configuration.For the two robot configuration we have tested single robot reward policiesapplied to the robot at the tip of the hose, the other robot remaining rewardless,with not-so-bad learning results, suggesting that teaching the "guiding robot"may be enough for the task. Besides, we have tested two-robot specific rewardsystems improving the single robot reward systems.Finally, learning time is highly dependent on the simulation time employedto reproduce the experiences on the real system. We have tested the improvementintroduced by storing the visited state transitions and their correspondingobserved rewards in a variation of Q-learning call TRQ-learning. We find improvedresults with TRQ-learning due to reduced need for exporation and fastercomputation.As lines of future work we find highly interesting the research on methodologicalimprovements in the definition of the RL algorithms allowing fasterand more successfull learning processes. The hierarchical decomposition of thesystem into diferent layers of abstraction can allow the progressive refinementof learning results until reaching the final stage of the most realistic modelingand simulation. Learning in the simple models can be fast and the refinementlearning can be much faster than the brute force approach on the whole model.Such approaches would need innovative ways to define the equivalence betweenmodels and how the transition between levels of abstraction could be made.We are also interested in bringing into real life systems the results of thelearning on the simulated model and exploring more realistic systems closerto the industrial applications, such as the hose deployment from a compactinterleaved state. The physical system design problems are challeging and havebeen only scratched by research groups interested in similar problems, suchas the GII from the Universidad de A Coruna. Innovative hose graspers andminimal mobile robot configurations, that may even be folded with the hose inthe resting state, or power transmision systems are extremely appealing lines ofresearch.

José Manuel López Guede

[1] Katsushi Ikeuchi,et al. Separating reflection components based on chromaticity and noise analysis , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Iván Villaverde De la Nava. On computational intelligence tools for vision based navigation of mobile robots , 2009 .

[3] Gregor Klan. Tracking-error model-based predictive control for mobile robots in real time , 2007 .

[4] M. Rubin. Cosserat Theories: Shells, Rods and Points , 2000 .

[5] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[6] Shigeharu Miyata,et al. Automatic Path Search for Roving Robot Using Reinforcement Learning , 2009, 2009 Fourth International Conference on Innovative Computing, Information and Control (ICICIC).

[7] Hong Qin,et al. D-NURBS: A Physics-Based Framework for Geometric Design , 1996, IEEE Trans. Vis. Comput. Graph..

[8] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9] S. Antman. Nonlinear problems of elasticity , 1994 .

[10] Manuel Graña,et al. Linked multi-component mobile robots: Modeling, simulation and control , 2010, Robotics Auton. Syst..

[11] Michael R. M. Jenkin,et al. A taxonomy for multi-agent robotics , 1996, Auton. Robots.

[12] Richard J. Duro,et al. On the potential contributions of hybrid intelligent approaches to Multicomponent Robotic System development , 2010, Inf. Sci..

[13] Randal W. Beard,et al. Coordination Variables, Coordination Functions, and Cooperative-Timing Missions , 2005 .

[14] Dinesh K. Pai,et al. STRANDS: Interactive Simulation of Thin Solids using Cosserat Models , 2002, Comput. Graph. Forum.

[15] Carl de Boor,et al. A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[16] Daoyi Dong,et al. A Quantum-inspired Q-learning Algorithm for Indoor Robot Navigation , 2008, 2008 IEEE International Conference on Networking, Sensing and Control.

[17] Francisco S. Melo,et al. Reinforcement learning with function approximation for cooperative navigation tasks , 2008, 2008 IEEE International Conference on Robotics and Automation.

[18] Guangjun Liu,et al. Formation Control of Mobile Robots with Active Obstacle Avoidance , 2007 .

[19] Laurent Grisoni,et al. Geometrically exact dynamic splines , 2008, Comput. Aided Des..

[20] Loulin Huang. Speed control of differentially driven wheeled mobile robots—model-based adaptive approach , 2005 .

[21] Francesco Maria Raimondi,et al. A new fuzzy robust dynamic controller for autonomous vehicles with nonholonomic constraints , 2005, Robotics Auton. Syst..

[22] R. Bellman. A Markovian Decision Process , 1957 .

[23] Anupam Shukla,et al. DYNAMIC ENVIRONMENT ROBOT PATH PLANNING USING HIERARCHICAL EVOLUTIONARY ALGORITHMS , 2010, Cybern. Syst..

[24] Ekaitz Zulueta,et al. Linked Multicomponent Robotic Systems: Basic Assessment of Linking Element Dynamical Effect , 2010, HAIS.

[25] Zelmar Echegoyen Ferreira. Contributions to visual servoing for legged and linked multicomponent robots , 2009 .

[26] Ning Liu,et al. Intelligent Path Following Method for Nonholonomic Robot Using Fuzzy Control , 2009, 2009 Second International Conference on Intelligent Networks and Intelligent Systems.

[27] Hyung Suck Cho,et al. A Smooth Path Tracking Algorithm for Wheeled Mobile Robots with Dynamic Constraints , 1999, J. Intell. Robotic Syst..

[28] Daniele Nardi,et al. Multirobot systems: a classification focused on coordination , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29] Randal W. Beard,et al. Distributed Consensus in Multi-vehicle Cooperative Control - Theory and Applications , 2007, Communications and Control Engineering.

[30] Sandra Clara Gadanho,et al. Emotion-triggered Learning in Autonomous Robot Control , 2001, Cybern. Syst..

[31] Henk Tijms,et al. Discrete‐Time Markov Decision Processes , 2004 .

[32] Yong Duan,et al. Fuzzy reinforcement learning and its application in robot navigation , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[33] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[34] Caro Lucas,et al. DESIGN OF AN AEROSPACE LAUNCH VEHICLE AUTOPILOT BASED ON OPTIMIZED EMOTIONAL LEARNING ALGORITHM , 2008, Cybern. Syst..

[35] Raed Abu Zitar,et al. HYBRID TRAJECTORY PLANNING USING REINFORCEMENT AND BACKPROPAGATION THROUGH TIME TECHNIQUES , 2003, Cybern. Syst..

[36] Marilena Vendittelli,et al. WMR control via dynamic feedback linearization: design, implementation, and experimental validation , 2002, IEEE Trans. Control. Syst. Technol..

[37] Ramón Moreno,et al. Experiments on Robotic Multi-agent System for Hose Deployment and Transportation , 2010, PAAMS.

[38] Manuel Graña,et al. Economical Implementation of Control Loops for Multi-robot Systems , 2008, ICONIP.

[39] Alex Fukunaga,et al. Cooperative mobile robotics: antecedents and directions , 1995 .

[40] Toshio Fukuda,et al. A HUMANLIKE GRASPING FORCE PLANNER FOR OBJECT MANIPULATION BY ROBOT MANIPULATORS , 2003, Cybern. Syst..

[41] Patrick Dähne,et al. Real-Time Virtual Cables Based on Kinematic Simulation , 2000, WSCG.