LFA-NPG: Computationally Efficient Reinforcement Learning

LFA-NPG: Computationally Efficient RL

Published July 30, 2021

Nationally Recognized by Regeneron Science Talent Search 2024

Featured on NPR & KCRA

My paper “LFA-NPG: Computationally Efficient Reinforcement Learning” deals with the mathematical theory and evaluation methodology of LFA-NPG, the computationally efficient Reinforcement Learning algorithm that I designed. LFA-NPG is a policy space algorithm with a Natural Actor-Critic optimization method that uses linear-function-approximation-based value estimation. In sufficiently low dimensionality use cases, LFA-NPG outperforms leading policy space RL algorithms TRPO and PPO.

Abstract:

Neural Network based approximations of the Value function make up the core of leading Policy Based methods such as Trust Regional Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). While this adds significant value when dealing with very complex environments, I note that in sufficiently low state and action space environments, a computationally expensive Neural Network architecture offers marginal improvement over simpler Value approximation methods. I present an implementation of Natural Actor Critic algorithms with actor updates through Natural Policy Gradient methods. This paper proposes that Natural Policy Gradient (NPG) methods with Linear Function Approximation as a paradigm for value approximation may surpass the performance and speed of Neural Network based models such as TRPO and PPO within these environments. Over Reinforcement Learning benchmarks Cart Pole and Acrobot, I observe that my algorithm trains much faster (over 20%) than complex neural network architectures, and obtains an equivalent or greater result. This allows us to recommend the use of NPG methods with Linear Function Approximation over TRPO and PPO for both traditional and sparse reward low dimensional problems.

The Big Idea

Reinforcement Learning (RL) is used in a number of fields, from Natural Language Processing (a la ChatGPT) to self driving cars. Most RL algorithms use high dimensionality neural networks to measure how well the model performs during training steps (value estimation). However, for low dimensionality use-cases, these neural networks can be overkill, and can compromise valuable computer resources on live robotic implementations. As such, I aimed to create LFA-NPG: a reinforcement learning algorithm that used mathematically simple linear function approximation for value estimation.

On the left is the performance of successive training iterations of LFA-NPG on the cartpole and acrobot environments (OpenAI Gym). LFA-NPG is able to reach a beneficial reward state very quickly due to the short time per iteration. It is important to note that Cartpole and Acrobot are both low dimensionality (<10 dimensions). This is why the value estimation structure of LFA-NPG is able to perform well.

Cartpole

The performance of LFA-NPG on cartpole matches that of PPO and TRPO, approaching an optimal reward of 200 in a similar number of policy iterations. However, due to the significantly faster training time for each iteration, LFA-NPG outspeeds TRPO and PPO by over 20 percent.

Acrobot

The Acrobot environment is a sparse-reward environment. As shown through its reward behavior, LFA-NPG actually is better suited to this type of environment than TRPO and PPO, due to its algebraic step size having more variability than the Kullback-Leibler constraint (TRPO) and reward function clipping (PPO).