Paper ID | MLSP-27.1 |
Paper Title |
Gaussian Process Temporal-Difference Learning with Scalability and Worst-Case Performance Guarantees |
Authors |
Qin Lu, Georgios B. Giannakis, University of Minnesota, United States |
Session | MLSP-27: Reinforcement Learning 3 |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 13:00 - 13:45 |
Presentation Time: | Thursday, 10 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Machine Learning for Signal Processing: [MLR-REI] Reinforcement learning |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper revisits policy evaluation via temporal-difference (TD) learning from the Gaussian process (GP) perspective. Leveraging random features to approximate the GP prior, an online scalable (OS) approach, termed {OS-GPTD}, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in the adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed. The cumulative Bellman error, as well as the long-term reward prediction error, are upper bounded relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Performance of the novel OS-GPTD is evaluated on two benchmark problems. |