Paper ID | MLSP-27.1 | ||
Paper Title | Gaussian Process Temporal-Difference Learning with Scalability and Worst-Case Performance Guarantees | ||
Authors | Qin Lu, Georgios B. Giannakis, University of Minnesota, United States | ||
Session | MLSP-27: Reinforcement Learning 3 | ||
Location | Gather.Town | ||
Session Time: | Thursday, 10 June, 13:00 - 13:45 | ||
Presentation Time: | Thursday, 10 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Machine Learning for Signal Processing: [MLR-REI] Reinforcement learning | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper revisits policy evaluation via temporal-difference (TD) learning from the Gaussian process (GP) perspective. Leveraging random features to approximate the GP prior, an online scalable (OS) approach, termed {OS-GPTD}, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in the adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed. The cumulative Bellman error, as well as the long-term reward prediction error, are upper bounded relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Performance of the novel OS-GPTD is evaluated on two benchmark problems. |