arxiv:2006.06294

Adaptive Reward-Free Exploration

Published on Jun 11, 2020

Authors:

Michal Valko

Abstract

An adaptive approach for reward-free exploration in reinforcement learning reduces MDP estimation error and improves sample-complexity bounds compared to previous methods.

AI-generated summary

Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order ({SAH^4}/{varepsilon^2})(log(1/δ) + S) episodes to output, with probability 1-δ, an varepsilon-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small varepsilon and the small δ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2006.06294 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2006.06294 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.