Sample Efficient Learning of Path Following and Obstacle Avoidance Behavior for Quadrotors

 

ABSTRACT

In this paper we propose an algorithm for the training of neural network control policies for quad rotors . The learned control policy computes control commands directly from sensor inputs and is hence computationally efficient. An imitation learning algorithm produces a policy that reproduces the behavior of a supervisor. The supervisor provides demonstrations of path following and collision avoidance maneuvers. Due to the generalization ability of neural networks, the resulting policy performs local collision avoidance while following a global reference path. The algorithm uses a time-free model predictive path-following controller as a supervisor. The controller generates demonstrations by following few example paths. This enables an easy to implement learning algorithm that is robust to errors of the model used in the model predictive controller. The policy is trained on the real quad rotor, which requires collision-free exploration around the example path. An adapted version of the supervisor is used to enable exploration. Thus, the policy can be trained from a relatively small number of examples on the real quad rotor, making the training sample efficient.

 

 

 

EXISTING SYSTEM :

Many applications of micro aerial vehicles (MAVs)require safe navigation in environments with obstacles and therefore methods for trajectory planning and real-time collision avoidance. Several strategies exist to make this problem computationally tractable. The use of model free controllers with path planning  is computationally attractive but requires conservative flight. Model based methods, including local receding horizon methods such as Model Predictive Control (MPC), combining slow global planning with fast local avoidance , or avoidance via search of a motion primitive library are computationally demanding, but can achieve more aggressive maneuvers. A theoretical analysis of the dynamical system can provide insights in a limited number of cases  leading to faster computation times. These methods have limited scope, taking into account a specific dynamics model. Furthermore, these methods require estimation of obstacle positions from the sensor data.

PROPOSED SYSTEM :

In this paper, we address such issues with a novel imitation learning algorithm, schematically summarized, that produces control commands directly from sensor inputs. Producing control signals directly form sensor inputs has two main benefits. First, the algorithm does not require estimation of obstacle positions. Second, function approximators , such as neural networks, can be much more computationally efficient compared to traditional planning methods  while still achieving safe flight. Learning can be combined with motion planning  combine learning of a low level controller with a path planning algorithm. This hybrid approach shows that control for quad rotor navigation can be learned, but still requires expensive off-line collision avoidance. Our algorithm produces a controller that learns how to avoid collisions and runs in real time. The most general approach, to learn a controller, here called a control policy, is model-free reinforcement learning (RL), a class of methods that learns the control policy through interaction with the environment. However, these methods are sample inefficient, requiring a large number of trials, and therefore can only be applied in simulation . A more sample efficient option is model-based RL where the model parameters are learned while the control policy is optimized. In this setting, learning the model requires dangerous maneuvers, which can lead to damage of the quad rotor or the environment. A final option is to learn the policy by imitating an oracle, either a human pilot or an optimization algorithm. If the oracle can provide examples of safe maneuvers, this is the most immediate choice to learn policies for real quad rotors . Imitating the oracle is not a trivial task. Primarily because data from the ideal trajectory is not enough to learn a policy, since it does not provide examples of correcting drift from the ideal trajectory. In the policy learns to steer the quad rotor from a human pilot demonstration. The biggest challenge was to collect sufficient data, since it is challenging to control the quadrotor manually. We resolve this issue by learning the policy from a trajectory optimization oracle. Imitating a trajectory optimization oracle requires to generate data that can efficiently train the control policy. Two main issues arise when training the policy with this approach. First, efficiently generating training data that can produce the control policy is not straightforward because the trajectory optimization is computationally demanding. Thus in prior work a single trajectory was used to compute control signals for states on the trajectory and states close to the trajectory. However, this approach only works in simulation with a perfectly known model. Second, the algorithm needs to work with an approximate model to be applied on a real world system. In, a long horizon trajectory was followed with a short horizon Model Predictive Controller (MPC) to generate training data efficiently. Since the model is not correct, this learning algorithm requires a complex adaptation strategy that guides the control policy to the desired behavior. Alternatively, one could use only a short horizon MPC to provide samples for training of the control policy. However, MPC can produce suboptimal solutions that can lead to deadlocks or collisions during training.

CONCLUSION :

We have proposed a method for learning control policies using neural networks in imitation learning settings. The approach leverages a time-free MPCC path following controller as a supervisor in both off-policy and on-policy learning. We experimentally verified that the approach converges to stable policies which can be rolled out successfully to unseen environments both in simulation and in the real-world. Furthermore, we demonstrated that the policies generalize well to unseen environments and have initially explored the possibility to roll out policies in dynamic environments.