A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Abstract

Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)–a policy-dependent model–and linearly combining them with instantaneous rewards.

Publication
AAAI Conference on Artificial Intelligence