TY - CONF
T1 - TOWARDS CONTINUOUS CONTROL FOR MOBILE ROBOT NAVIGATION: A REINFORCEMENT LEARNING AND SLAM BASED APPROACH
AU - Khaled, Mustafa
AU - Sirmaçek, Beril
AU - Stramigioli, Stefano
AU - Poel, Mannes
PY - 2019/6
Y1 - 2019/6
N2 - Autonomous navigation of robots in unknown environments from their current position to a desired target without colliding with obstacles represents an important aspect in the field of mobile robots. In literature, traditional methods do exist in case a complete knowledge of the environment is available. However, this is not the case in real-life applications where a complete knowledge about stochastic environments can hardly be obtained where the positions of the obstacles are unknown. Our main goal is to navigate a skid-steering mobile robot (SSMR) with non-holonomic constraints in an unknown environment to its desired target in minimum time while avoiding colliding with the obstacles. In the context of autonomous cart navigation, autonomous navigation of mobile robots in unstructured environments can be formulated as a Reinforcement learning (RL) problem. Reinforcement learning is a paradigm in which the agent (robot) learns its optimal path by interacting with the environment and receiving a reward based on its actions. Based on this action-reward scenario, the optimal action for each state can be discovered by maximizing a predefined accumulated reward that reflects the quality of the trajectory taken by the robot. Reinforcement learning can be categorized into two main methods (Bagnell J. Peters J. Kober, 2013), namely value based methods (Q-learning, Deep Q-learning) and policy based methods (REINFORCE with Policy Gradient).(i) Value based methods: In these methods, the value function that maps each state-action pair into a value is learned. Thanksto these methods, the best action to take for each state, the actionwith the biggest value, can be found. This works well in case ofa finite set of actions.(ii) Policy based methods: In policy based methods, instead oflearning value functions that give an indication of the expectedtotal reward for each state-action pair, the policy function thatmaps the state into action is optimized directly (Brundage. MArulkumaran. K and A., 2017). Policy search methods are moreeffective in high dimensional action spaces, or when using continuousactions like the case of mobile robot navigation (NeumannG. Peters J. Deisenroth, 2013).However, since both of these methods have their own drawbacks,there is hybrid method called Actor-Critic that employs both valuefunctions and policy search (Brundage. M Arulkumaran. K andA., 2017). In this method, the actor adjust the policy parametersby a policy gradient ascent whereas the critic estimates the actionvalue function using a policy evaluation algorithm such astemporal difference (TD) learning. This can be done through twoneural networks running in parallel.Herein, we introduce a reinforcement learning based navigationsystem that enhance the navigation capabilities of a skid steeringmobile robot. This goal will be achieved by using asynchronousdeep deterministic policy gradient (ADDPG) throughan actor-critic algorithm (Badia A. Mirza M. Graves A. HarleyT. Lillicrap T. Silver D. Kavukcuoglu K. Mnih, 2016). Two deepneural-networks are introduced in this case. The first one (actornetwork)transfers the input vector that represents the states ofthe robot to linear and angular velocity commands that representthe actions. Whereas the second neural network (critic-network)is responsible for predicting the Q-value of the state and actionpairs. Deterministic policies will be utilized since it was proposedthat they outperform their stochastic counterparts in highdimensionalaction spaces (Lever G. Hess N. Degris T. WierstraD. Riedmiller M. Silver, 2014). On the other hand, SimultaneousLocalization and Mapping also known as SLAM techniques areto be integrated with reinforcement learning in an attempt to improvethe learning rate by providing more accurate estimation ofthe robots states. SLAM has been an active topic for many years(C. Cadena, 2016) because it provides two fundamental componentsto robotics: where I am, and what I see.Herein, we introduce how the navigation problem of non-holonomicmobile robots can be formulated as a reinforcement learning problemthat could be solved by using ADDPG actor-critic algorithm.Besides, we share the experimental results on how SLAM wouldhelp reinforcement learning at all by comparing the learning ratewithout SLAM (comparison when a partial map of the environmentis given and when no map is available). For implementationof the parallel running neural networks, we do our initial experimentsusing the GPU cluster of University of Twente. However,when it comes to real-life implementations and demonstrationson real robots, the hardware becomes the key component to determinewhether the goal can be achieved or not. Therefore, weimplement the trained networks on NVIDIA Jetson TX2 in orderto show real-life implementation possibilities when the algorithmsneed to run in a navigating robot (Jetson TX2 Module,online documentation, n.d.). In this sense, NVIDIA Jetson TX2gives opportunity to run even multiple complex neural networksreal-time on a credic card size processor unit which can easilybe carried by a small (less than 20cm length) robot. We show ourexperiments on sensor selection and the OpenAI package in ROSGazeboplatform as a possible simulation environment. Last butnot least, we discuss the real-life implementation scenarios usingour NVIDIA Jetson TX2 and how realistic the simulation resultscould be when the algorithms are implemented on the real robot.The early experiments indicate that, for an optimal design of aSLAM and reinforcement learning based system to truly work inreal-world applications in order to reach a goal in an unknownenvironment, the hardware setup and the algorithms must be collaborativelydesigned.
AB - Autonomous navigation of robots in unknown environments from their current position to a desired target without colliding with obstacles represents an important aspect in the field of mobile robots. In literature, traditional methods do exist in case a complete knowledge of the environment is available. However, this is not the case in real-life applications where a complete knowledge about stochastic environments can hardly be obtained where the positions of the obstacles are unknown. Our main goal is to navigate a skid-steering mobile robot (SSMR) with non-holonomic constraints in an unknown environment to its desired target in minimum time while avoiding colliding with the obstacles. In the context of autonomous cart navigation, autonomous navigation of mobile robots in unstructured environments can be formulated as a Reinforcement learning (RL) problem. Reinforcement learning is a paradigm in which the agent (robot) learns its optimal path by interacting with the environment and receiving a reward based on its actions. Based on this action-reward scenario, the optimal action for each state can be discovered by maximizing a predefined accumulated reward that reflects the quality of the trajectory taken by the robot. Reinforcement learning can be categorized into two main methods (Bagnell J. Peters J. Kober, 2013), namely value based methods (Q-learning, Deep Q-learning) and policy based methods (REINFORCE with Policy Gradient).(i) Value based methods: In these methods, the value function that maps each state-action pair into a value is learned. Thanksto these methods, the best action to take for each state, the actionwith the biggest value, can be found. This works well in case ofa finite set of actions.(ii) Policy based methods: In policy based methods, instead oflearning value functions that give an indication of the expectedtotal reward for each state-action pair, the policy function thatmaps the state into action is optimized directly (Brundage. MArulkumaran. K and A., 2017). Policy search methods are moreeffective in high dimensional action spaces, or when using continuousactions like the case of mobile robot navigation (NeumannG. Peters J. Deisenroth, 2013).However, since both of these methods have their own drawbacks,there is hybrid method called Actor-Critic that employs both valuefunctions and policy search (Brundage. M Arulkumaran. K andA., 2017). In this method, the actor adjust the policy parametersby a policy gradient ascent whereas the critic estimates the actionvalue function using a policy evaluation algorithm such astemporal difference (TD) learning. This can be done through twoneural networks running in parallel.Herein, we introduce a reinforcement learning based navigationsystem that enhance the navigation capabilities of a skid steeringmobile robot. This goal will be achieved by using asynchronousdeep deterministic policy gradient (ADDPG) throughan actor-critic algorithm (Badia A. Mirza M. Graves A. HarleyT. Lillicrap T. Silver D. Kavukcuoglu K. Mnih, 2016). Two deepneural-networks are introduced in this case. The first one (actornetwork)transfers the input vector that represents the states ofthe robot to linear and angular velocity commands that representthe actions. Whereas the second neural network (critic-network)is responsible for predicting the Q-value of the state and actionpairs. Deterministic policies will be utilized since it was proposedthat they outperform their stochastic counterparts in highdimensionalaction spaces (Lever G. Hess N. Degris T. WierstraD. Riedmiller M. Silver, 2014). On the other hand, SimultaneousLocalization and Mapping also known as SLAM techniques areto be integrated with reinforcement learning in an attempt to improvethe learning rate by providing more accurate estimation ofthe robots states. SLAM has been an active topic for many years(C. Cadena, 2016) because it provides two fundamental componentsto robotics: where I am, and what I see.Herein, we introduce how the navigation problem of non-holonomicmobile robots can be formulated as a reinforcement learning problemthat could be solved by using ADDPG actor-critic algorithm.Besides, we share the experimental results on how SLAM wouldhelp reinforcement learning at all by comparing the learning ratewithout SLAM (comparison when a partial map of the environmentis given and when no map is available). For implementationof the parallel running neural networks, we do our initial experimentsusing the GPU cluster of University of Twente. However,when it comes to real-life implementations and demonstrationson real robots, the hardware becomes the key component to determinewhether the goal can be achieved or not. Therefore, weimplement the trained networks on NVIDIA Jetson TX2 in orderto show real-life implementation possibilities when the algorithmsneed to run in a navigating robot (Jetson TX2 Module,online documentation, n.d.). In this sense, NVIDIA Jetson TX2gives opportunity to run even multiple complex neural networksreal-time on a credic card size processor unit which can easilybe carried by a small (less than 20cm length) robot. We show ourexperiments on sensor selection and the OpenAI package in ROSGazeboplatform as a possible simulation environment. Last butnot least, we discuss the real-life implementation scenarios usingour NVIDIA Jetson TX2 and how realistic the simulation resultscould be when the algorithms are implemented on the real robot.The early experiments indicate that, for an optimal design of aSLAM and reinforcement learning based system to truly work inreal-world applications in order to reach a goal in an unknownenvironment, the hardware setup and the algorithms must be collaborativelydesigned.
M3 - Paper
T2 - ISPRS Workshop Indoor 3D 2019
Y2 - 11 June 2019 through 12 June 2019
ER -