Exploration versus Exploitation in Reinforcement Learning: A Stochastic Control Approach

33 Pages Posted: 27 Jan 2019 Last revised: 15 Feb 2019

See all articles by Haoran Wang

Haoran Wang

Columbia University - Department of Industrial Engineering and Operations Research (IEOR)

Thaleia Zariphopoulou

University of Texas at Austin - Red McCombs School of Business

Xun Yu Zhou

Columbia University - Department of Industrial Engineering and Operations Research (IEOR)

Date Written: January 15, 2019

Abstract

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed, other things being equal. As the weight of exploration decays to zero, we prove the convergence of the solution to the entropy-regularized LQ problem to that of the classical LQ problem. Finally, we characterize the cost of exploration, which is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate in the LQ case.

Keywords: Reinforcement Learning, Exploration, Exploitation, Entropy Regularization, Stochastic Control, Relaxed Control, Linear-Quadratic, Gaussian

JEL Classification: C02, C44, C45, C61, G11

Suggested Citation

Wang, Haoran and Zariphopoulou, Thaleia and Zhou, Xunyu, Exploration versus Exploitation in Reinforcement Learning: A Stochastic Control Approach (January 15, 2019). Available at SSRN: https://ssrn.com/abstract=3316387 or http://dx.doi.org/10.2139/ssrn.3316387

Haoran Wang (Contact Author)

Columbia University - Department of Industrial Engineering and Operations Research (IEOR) ( email )

331 S.W. Mudd Building
500 West 120th Street
New York, NY 10027
United States

Thaleia Zariphopoulou

University of Texas at Austin - Red McCombs School of Business ( email )

Austin, TX 78712
United States

Xunyu Zhou

Columbia University - Department of Industrial Engineering and Operations Research (IEOR) ( email )

331 S.W. Mudd Building
500 West 120th Street
New York, NY 10027
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
143
Abstract Views
751
rank
246,028
PlumX Metrics