Reinforcement Learning for Autonomous Navigation on Uncertain Terrain

Implemented a DeepRL (A3C) pipeline

Work done with Dr. M Vidyasagar FRS (IIT Hyderabad)

Problem

Online and local planning of unmanned ground vehicles (UGVs) for navigation on unstructured, uneven, and uncertain terrain. The problem is framed as a Deep RL task, where the agent must learn to navigate through cluttered environments with both static and dynamic obstacles while maintaining traversability on rough surfaces.

A3C Algorithm

The Asynchronous Advantage Actor-Critic (A3C) algorithm runs multiple parallel worker threads, each interacting with its own copy of the environment. Each worker computes local policy gradients and asynchronously updates the shared global network parameters.

The policy gradient with advantage estimation:

\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \, A(s_t, a_t)\right] \]

where the advantage function is estimated via Generalized Advantage Estimation (GAE):

\[ \hat{A}_t = \sum_{l=0}^{k-1} (\gamma \lambda)^l \, \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

The value loss uses the squared temporal difference error, and an entropy bonus \(H(\pi_\theta)\) encourages exploration:

\[ \mathcal{L} = \mathcal{L}_{\text{policy}} + c_1 \, \mathcal{L}_{\text{value}} - c_2 \, H(\pi_\theta) \]

Octomap-Based Occupancy Mapping

The environment is represented using Octomaps, a hierarchical 3D occupancy grid based on octrees. Octomaps provide efficient memory usage through multi-resolution representation and support probabilistic updates from noisy sensor data. The occupancy probability is updated via a log-odds formulation:

\[ L(n \mid z_{1:t}) = L(n \mid z_{1:t-1}) + L(n \mid z_t) \]

where \(L\) is the log-odds representation of occupancy probability for node \(n\) given sensor measurements \(z\).

Progressive Obstacle Curriculum

  • Obstacles are introduced progressively during training, encouraging early exploration before the environment becomes cluttered
  • Obstacle density is scheduled: starting sparse and increasing to the target density over the first 50% of training
  • Both static obstacles (walls, rocks) and dynamic obstacles (moving objects) are incorporated
  • User-defined obstacle placement allows testing specific navigation scenarios

Results

The progressive curriculum approach resulted in a 14% increase in success rate for the Pioneer robot navigating uneven terrain in Gazebo simulation, compared to training with a fixed obstacle density from the start.

Pioneer robot navigation
The Pioneer robot navigating an all-terrain environmental scene in Gazebo simulation.