The Basics of Reinforcement Learning

[math] \DeclareMathOperator*{\dprime}{\prime \prime} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\N}{\mathbb{N}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\Sc}{\mathcal{S}} \DeclareMathOperator{\Ac}{\mathcal{A}} \DeclareMathOperator{\Pc}{\mathcal{P}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\sx}{\underline{\sigma}_{\pmb{X}}} \DeclareMathOperator{\sqmin}{\underline{\sigma}_{\pmb{Q}}} \DeclareMathOperator{\sqmax}{\overline{\sigma}_{\pmb{Q}}} \DeclareMathOperator{\sqi}{\underline{\sigma}_{Q,\textit{i}}} \DeclareMathOperator{\sqnoti}{\underline{\sigma}_{\pmb{Q},-\textit{i}}} \DeclareMathOperator{\sqfir}{\underline{\sigma}_{\pmb{Q},1}} \DeclareMathOperator{\sqsec}{\underline{\sigma}_{\pmb{Q},2}} \DeclareMathOperator{\sru}{\underline{\sigma}_{\pmb{R}}^{u}} \DeclareMathOperator{\srv}{\underline{\sigma}_{\pmb{R}}^v} \DeclareMathOperator{\sri}{\underline{\sigma}_{R,\textit{i}}} \DeclareMathOperator{\srnoti}{\underline{\sigma}_{\pmb{R},\textit{-i}}} \DeclareMathOperator{\srfir}{\underline{\sigma}_{\pmb{R},1}} \DeclareMathOperator{\srsec}{\underline{\sigma}_{\pmb{R},2}} \DeclareMathOperator{\srmin}{\underline{\sigma}_{\pmb{R}}} \DeclareMathOperator{\srmax}{\overline{\sigma}_{\pmb{R}}} \DeclareMathOperator{\HH}{\mathcal{H}} \DeclareMathOperator{\HE}{\mathcal{H}(1/\varepsilon)} \DeclareMathOperator{\HD}{\mathcal{H}(1/\varepsilon)} \DeclareMathOperator{\HCKI}{\mathcal{H}(C(\pmb{K}^0))} \DeclareMathOperator{\HECK}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HECKI}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} \DeclareMathOperator{\HC}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HCK}{\mathcal{H}(C(\pmb{K}))} \DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}))} \DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}^0))} \DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} \newcommand{\mathds}{\mathbb}[/math]

Reinforcement learning is an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as [math]Q[/math]-learning and Actor-Critic. Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go [1] and multi-player StarCraft II [2], as well as the development of ChatGPT [3]. The generality of the reinforcement learning framework allows its application in both discrete and continuous spaces to solve tasks in both real and simulated environments [4]. Classical RL research during the last third of the previous century developed an extensive theoretical core on which modern algorithms are based. Several algorithms, such as Temporal-Difference Learning and [math]Q[/math]-learning, were developed and are able to solve small-scale problems when either the states of the environment can be enumerated (and stored in memory) or the optimal policy lies in the space of linear or quadratic functions of the state variable. Although these restrictions are extremely limiting, these foundations of classical RL theory underlie the modern approaches which benefit from increased computational power. Combining this framework with deep learning [5] was popularized by the Deep [math]Q[/math]-learning algorithm, introduced in [6], which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years [7][8][9].

Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in Section with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In Section we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in section value based methods and policy-based methods in section policy based methods, with an infinite-time horizon. Finally, section discussion contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL.

Setup: Markov Decision Processes

We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker.


Infinite Time Horizon and Discounted Reward. Let us start with a discrete time MDP with an infinite time horizon and discounted reward. We have a Markov process taking values in a state space [math]\Sc[/math], where we can influence the evolution by taking an action from a set [math]\Ac[/math]. The aim is to optimize the expected discounted return from the system by choosing a policy, that is a sequence of actions through time. We formulate this mathematically by defining the value function [math]V^\star[/math] for each [math]s\in \Sc[/math] to be

[[math]] \begin{eqnarray} \label{equ: singlev} V^\star(s) = \sup_{\Pi} V^{\Pi}(s):= \sup_{\Pi}\E^{\Pi}\biggl[\sum_{t=0}^{\infty}\gamma^t r(s_t, a_t) \biggl| s_0 = s\biggl], \end{eqnarray} [[/math]]

subject to

[[math]] \begin{eqnarray} \label{equ: singledynamics} s_{t + 1} \sim P(s_t, a_t), \;\;\; a_t \sim \pi_t({s_t}). \end{eqnarray} [[/math]]

Here and throughout the paper, [math]\E^{\Pi}[/math] denotes the expectation under the policy [math]{\Pi}[/math], where the probability measure [math]\mathbb{P}[/math] describes both dynamics and the rewards in the MDP. We will write [math]\mathcal{P}(\mathcal{X})[/math] for the probability measures over a space [math]\mathcal{X}[/math]. The state space [math](\Sc, d_{\Sc})[/math] and the action space [math](\Ac, d_{\Ac})[/math] are both complete separable metric spaces, which includes the case of [math]\Sc[/math] and [math]\Ac[/math] being finite, as often seen in the RL literature; [math]\gamma[/math] [math]\in[/math] [math](0, 1)[/math] is a discount factor; [math]s_t \in \Sc[/math] and [math]a_t \in \Ac[/math] are the state and the action at time [math]t[/math]; [math]P: \Sc \times \Ac \to \Pc(\Sc)[/math] is the transition function of the underlying Markov process; the notation [math] s_{t + 1} \sim P(s_t, a_t)[/math] denotes that [math] s_{t + 1}[/math] is sampled from the distribution [math] P(s_t, a_t)\in \mathcal{P}(\mathcal{S})[/math]; the reward [math]r(s, a)[/math] is a random variable in [math]\R[/math] for each pair [math](s, a)\in \Sc \times \Ac[/math]; and the policy [math]\Pi =\{\pi_t\}_{t=0}^{\infty}[/math] is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, [math]\pi_t: \Sc \to \Ac[/math] maps the current state [math]s_t[/math] to a deterministic action. On the other hand, a randomized policy [math]\pi_t: \Sc\to \Pc(\Ac)[/math] maps the current state [math]s_t[/math] to a distribution over the action space. For a randomized policy [math]\pi_t[/math], we will use [math]\pi_t(s)\in \mathcal{P}(\mathcal{A})[/math] and [math]\pi_t(s,a)[/math] to represent, respectively, the distribution over the action space given state [math]s[/math] and the probability of taking action [math]a[/math] at state [math]s[/math]. In this case, with infinite-time horizon, we assume the reward [math]r[/math] and the transition dynamics [math]P[/math] are time homogeneous, which is a standard assumption in the MDP literature [10]. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward.

The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev};

[[math]] \begin{eqnarray} \label{equ:classicalV} V^\star(s) = \sup_{a \in \Ac} \Big\lbrace \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}[V^\star(s')] \Big\rbrace. \end{eqnarray} [[/math]]

We write the value function as

[[math]] V^\star(s) = \sup_{a \in \Ac} Q^\star(s, a), [[/math]]

where the [math]Q[/math]-function, one of the basic quantities used for RL, is defined to be

[[math]] \begin{eqnarray} \label{defsingleQ} Q^\star(s, a) = \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}[V^\star(s')], \end{eqnarray} [[/math]]

the expected reward from taking action [math]a[/math] at state [math]s[/math] and then following the optimal policy thereafter. There is also a Bellman equation for the [math]Q[/math]-function given by

[[math]] \begin{eqnarray} \label{equ: singleQ} Q^\star(s, a) = \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}\sup_{a' \in \Ac} Q^\star(s', a'). \end{eqnarray} [[/math]]

This allows us to retrieve the optimal (stationary) policy [math]\pi^*(s, a)[/math] (if it exists) from [math]Q(s, a)[/math], in that [math]\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)[/math]. An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer the reader to [11][12] for a more detailed discussion of RL with infinite-time horizon and average reward.


Finite Time Horizon. When there is a finite time horizon [math]T \lt \infty[/math] we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as

[[math]] \begin{eqnarray} \label{equ: singlev_fh} V^\star_t(s) = \sup_{\Pi} V_t^{\Pi}(s):= \sup_{\Pi}\E^{\Pi}\biggl[\sum_{u=t}^{T-1} r_u(s_u, a_u) +r_T(s_T)\biggl| s_t = s\biggl], \;\; \forall s\in \Sc, \end{eqnarray} [[/math]]

subject to

[[math]] \begin{eqnarray} \label{equ: singledynamics_fh} s_{u + 1} \sim P_u(s_u, a_u), \;\;\; a_u \sim \pi_u(s_u),\;\; t \leq u \leq T-1. \end{eqnarray} [[/math]]

As in the infinite horizon case, we denote by [math]s_u \in \Sc[/math] and [math]a_u \in \Ac[/math] the state and the action of the agent at time [math]u[/math]. However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now [math]P_u: \Sc \times \Ac \to \Pc(\Sc)[/math] denotes the transition function and [math]r_u(s, a)[/math] a real-valued random variable for each pair [math](s, a)\in \Sc \times \Ac[/math] for [math]t \leq u \leq T-1[/math]. Similarly [math]r_T(s)[/math], the terminal reward, is a real-valued random variable for all [math]s\in \Sc[/math]. Finally, the Markovian policy [math]\Pi =\{\pi_u\}_{t=0}^T[/math] can be either deterministic or randomized. The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as

[[math]] \begin{eqnarray} \label{equ:classicalV_fh} V^\star_t(s) = \sup_{a \in \Ac} \Big\lbrace \E[r_t(s, a)] + \E_{s' \sim P_{t}(s, a)}[V^\star_{t+1}(s')] \Big\rbrace, \end{eqnarray} [[/math]]

with the terminal condition [math]V^\star_T(s) = \mathbb{E}[r_{T}(s)][/math]. We write the value function as

[[math]] V^\star_t(s) = \sup_{a \in \Ac} Q^\star_t(s, a), [[/math]]

where the [math]Q^\star_t[/math] function is defined to be

[[math]] \begin{eqnarray} \label{defsingleQ_fh} Q^\star_t(s, a) = \E[r_t(s, a)] + \E_{s' \sim P_t(s, a)}[V^\star_t(s')]. \end{eqnarray} [[/math]]

There is also a Bellman equation for the [math]Q[/math]-function given by

[[math]] \begin{eqnarray} \label{equ: singleQ_fh} Q^\star_t(s, a) = \E[r_t(s, a)] + \E_{s' \sim P_t(s, a)}\left[\sup_{a' \in \Ac} Q^\star_{t+1}(s', a')\right], \end{eqnarray} [[/math]]

with the terminal condition [math]Q^\star_T(s,a) = \mathbb{E}[r_T(s)][/math] for all [math]a\in \mathcal{A}[/math]. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications.

For an infinite horizon MDP with finite state and action space, and finite reward [math]r[/math], a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists.

Theorem (Theorem 6.2.7 in [10])

Assume [math]|\mathcal{A}| \lt \infty[/math], [math]|\mathcal{S}| \lt \infty[/math] and [math]|r| \lt \infty[/math] with probability one. For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.

Theorem implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in [10]. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the best stationary policy. To this end, we write [math]\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})[/math] (without a time index) for a stationary policy throughout the paper.


Linear MDPs and Linear Functional Approximation. Linear MDPs have been extensively studied in the recent literature on theoretical RL in order to establish tractable and efficient algorithms. In a linear MDP, both the transition kernels and the reward function are assumed to be linear with respect to some feature mappings [13][14]. In the infinite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if there exist [math]d[/math] unknown (signed) measures [math]\mu = (\mu^{(1)},\cdots,\mu^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have

[[math]] \begin{eqnarray}\label{eq:linearMDP_infinite} P(\cdot|s,a) = \langle\, \phi(s,a),\mu(\cdot)\,\rangle,\quad r(s,a) = \langle\, \phi(s,a),\theta\,\rangle. \end{eqnarray} [[/math]]


Similarly for the finite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if for any [math]0\leq t \leq T[/math], there exist [math]d[/math] unknown (signed) measures [math]\mu_t = (\mu_t^{(1)},\cdots,\mu_t^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta_t \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have

[[math]] \begin{eqnarray}\label{eq:linearMDP_finite} P_t(\cdot|s,a) = \langle\, \phi(s,a),\mu_t(\cdot)\,\rangle,\quad r_t(s,a) = \langle\, \phi(s,a),\theta_t\,\rangle, \end{eqnarray} [[/math]]

Typically features are assumed to be known to the agent and bounded, namely, [math]\|\phi(s,a)\|\leq 1[/math] for all [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math]. The linear MDP framework is closely related to the literature on RL with linear functional approximation, where the value function is assumed to be of the form

[[math]] \begin{eqnarray}\label{eq:lfa_infinite} Q(s,a) = \langle\, \psi(s,a), \omega \,\rangle, \quad V(s) = \langle\, \xi(s), \eta \,\rangle \end{eqnarray} [[/math]]

for the infinite horizon case, and

[[math]] \begin{eqnarray}\label{eq:lfa_finite} Q_t(s,a) = \langle\, \psi(s,a), \omega_t \,\rangle, \quad V_t(s) = \langle\, \xi(s), \eta_t \,\rangle, \forall\, 0\leq t \leq T \end{eqnarray} [[/math]]

for the finite horizon case. Here [math]\psi: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}^d[/math] and [math]\xi: \mathcal{S}\rightarrow \mathbb{R}^d[/math] are known feature mappings and [math]\omega[/math], [math]\omega_t[/math] [math]\eta[/math], and [math]\eta_t[/math] are unknown vectors. It has been shown that the linear MDP (i.e., \eqref{eq:linearMDP_infinite} or \eqref{eq:linearMDP_finite}) and the linear functional approximation formulation (i.e., \eqref{eq:lfa_infinite} or \eqref{eq:lfa_finite}) are equivalent under mild conditions [15][16]. In addition to linear functional approximations, we refer the reader to [17] for a general discussion of RL with nonlinear functional approximations and to Section \ref{sec:deep_value_based} for neural network approximation, one of the most popular nonlinear functional approximations used in practice.

Nonlinear Functional Approximation. Compared to linear functional approximations, nonlinear functional approximations do not require knowledge of the kernel functions a priori. Nonlinear functional approximation could potentially address the mis-specification issue when the agent has an incorrect understanding of the functional space that the MDP lies in. The most popular nonlinear functional approximation approach is to use neural networks, leading to deep RL. Thanks to the universal approximation theorem, this is a theoretically sound approach and neural networks show promising performance for a wide range of applications. Meanwhile gradient-based algorithms for certain neural network architectures enjoy provable convergence guarantees. We defer the discussion of deep RL to Deep Value-based RL Algorithms and its financial applications to Applications in Finance.

From MDP to Learning

When the transition dynamics [math]P[/math] and the reward function [math]r[/math] for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy [math]\pi[/math] (if it exists) while simultaneously learning the unknown [math]P[/math] and [math]r[/math]. The learning of [math]P[/math] and [math]r[/math] can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as [18][19].

Agent-environment Interface. In the RL setting, the learner or the decision-maker is called the agent. The physical world that the agent operates in and interacts with, comprising everything outside the agent, is called the environment. The agent and environment interact at each of a sequence of discrete time steps, [math]t =0,1,2,3,\cdots[/math], in the following way. At the beginning of each time step [math]t[/math], the agent receives some representation of the environment's state, [math]s_t \in \mathcal{S}[/math] and selects an action [math]a_t\in \mathcal{A}[/math]. At the end of this time step, in part as a consequence of its action, the agent receives a numerical reward [math]r_t[/math] (possibly stochastic) and a new state [math]s_{t+1}[/math] from the environment. The tuple [math](s_t,a_t,r_t,s_{t+1})[/math] is called a sample at time [math]t[/math] and [math]h_t :=\{(s_u,a_u,r_u,s_{u+1})\}_{u=0}^t[/math] is referred to as the history or experience up to time [math]t[/math]. An RL algorithm is a finite sequence of well-defined policies for the agent to interact with the environment.


Exploration vs Exploitation. In order to maximize the accumulated reward over time, the agent learns to select her actions based on her past experiences (exploitation) and/or by trying new choices (exploration). Exploration provides opportunities to improve performance from the current sub-optimal solution to the ultimate globally optimal one, yet it is time consuming and computationally expensive as over-exploration may impair the convergence to the optimal solution. Meanwhile, pure exploitation, i.e., myopically picking the current solution based solely on past experience, though easy to implement, tends to yield sub-optimal global solutions. Therefore, an appropriate trade-off between exploration and exploitation is crucial in the design of RL algorithms in order to improve the learning and the optimization performance. See Section for a more detailed discussion.


Simulator. In the procedure described above on how the agent interacts with the environment, the agent does so in an online fashion. Namely, the initial state at time step [math]t+1[/math], [math]s_{t+1}[/math], is the state that the agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. This is a challenging setting since an efficient exploration scheme is needed. For instance, [math]Q[/math]-learning with the [math]\varepsilon[/math]-greedy policy (to be introduced in Section) may take exponentially many episodes to learn the optimal policy [20]. Some results assume access to a simulator [21] (a.k.a., a generative model [22]), which allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. This implies the agent can “restart” the system at any time. That is, the initial state at time step [math]t+1[/math] does not need to be the state that agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [22].

Randomized Policy vs Deterministic Policy. A randomized policy [math]\pi_t:\Sc \rightarrow \mathcal{P}(\Ac)[/math] is also known in the control literature as a relaxed control and in game theory as a mixed strategy. Despite the existence of a deterministic optimal policy as stated in Theorem for the infinite time horizon case, most of the RL algorithms adopt randomized policies to encourage exploration when RL agents are not certain about the environment.


An Example: Multi-armed bandits. We illustrate these points by discussing multi-armed bandit problems, a special case of RL problems. The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown. We try these arms in some order, which may depend on the sequence of rewards that have been observed so far and attempt to maximize our return. A common objective in this context is to find a policy for choosing the next arm to be tried, under which the sum of the expected rewards comes as close as possible to the ideal reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times. Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward. The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return. In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time [math]t[/math] can be interpreted as the personal information and purchasing behavior of a targeted client. The problem was first considered in the seminal work of Robbins [23], which derives policies that asymptotically attain an average reward that converges in the limit to the reward of the best arm. The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected reward, and adversarial setups. See Berry and Fristedt [24] for a review of the classical results on the multi-armed bandit problem.

Performance Evaluation

It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time [math]0[/math] and ends at the terminal time [math]T[/math], and the performance is evaluated in terms of the total number of samples in all episodes.[a] Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE [27] and the policy gradient method [28] (see policy based methods). Throughout the analysis, we use the notation [math]\widetilde{O}(\cdot)[/math] to hide logarithmic factors when describing the order of the scaling in [math]\varepsilon[/math], [math]\delta[/math], [math]|\mathcal{A}|[/math] and [math]|\mathcal{S}|[/math] (when these are finite), and other possible model parameters such as the discount rate [math]\gamma[/math] in the infinite horizon case or the dimension [math]d[/math] of the features when functional approximation is used. We write [math]\widetilde{\mathcal{O}}^{\prime}[/math] rather than [math]\widetilde{\mathcal{O}}[/math] to indicate that, although there may be other parameters, we only include the dependence on [math]\varepsilon[/math]. We denote by [math]\mbox{poly}(\cdot)[/math] a polynomial function of its arguments.

Sample Complexity. In RL, a sample [math](s,a,r,s^{\prime})[/math] is defined as a tuple consisting of a state [math]s[/math], an action [math]a[/math], an instantaneous reward received by taking action [math]a[/math] at state [math]s[/math], and the next state [math]s^{\prime}[/math] observed afterwards. Sample complexity is defined as the total number of samples required to find an approximately optimal policy. This evaluation criterion can be used for any kind of RL problem. For episodic RL, an algorithm returns, after the end of the [math](k-1)[/math]-th episode with [math]M_{k-1}[/math] samples used in this episode, a policy [math]\pi_k[/math] to be applied in the [math]k[/math]-th episode. The sample complexity of this algorithm is the minimum number [math]\sum_{k=1}^K M_{k-1}(\varepsilon)[/math] of samples such that for all [math]k \ge K[/math], [math]\pi_k[/math] is [math]\varepsilon[/math]-optimal with probability at least [math]1-\delta[/math], i.e., for [math]k \ge K[/math],

[[math]] \begin{eqnarray}\label{eq:sample_complexity_epi} \mathbb{P}\left(V_0^{\pi_k}\ge V_0^\star-\varepsilon\right) \ge 1-\delta. \end{eqnarray} [[/math]]


For discounted RL, an algorithm returns, after the end of the [math](n-1)[/math]-th step with [math]M_{n-1}[/math] samples used in this iteration, a policy [math]\pi_n[/math] to be applied in the [math]n[/math]-th step. There are several notions of sample complexity in this setting. The most commonly used ones are defined through the value function and the [math]Q[/math]-function with all input variables (i.e., [math]s\in \mathcal{S}[/math] for [math]V[/math] and [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] for [math]Q[/math]) and they are referred to as the [math]V[/math]-sample complexity and [math]Q[/math]-sample complexity in the literature. The [math]Q[/math]-sample complexity of a given algorithm is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]Q^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]Q^*[/math], that is

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_Q} \|Q^{(n)} - Q^*\|_{\infty}:=\sup_{s\in\mathcal{S},a\in \mathcal{A}}\Big|Q^*(s,a)-Q^{(n)}(s,a)\Big| \leq \varepsilon, \end{eqnarray} [[/math]]

holds either with high probability (that is [math]\mathbb{P}\left(\|Q^{(n)}-Q^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|Q^{(n)}-Q^*\|_{\infty}] \lt \varepsilon[/math]. Similarly, the [math]V[/math]-sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]V^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]V^*[/math], that is

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_V_2} \|V^{(n)} - V^*\|_{\infty}:=\sup_{s\in\mathcal{S}}\left|V^{(n)}(s)-V^*(s)\right|\leq \varepsilon, \end{eqnarray} [[/math]]

holds either with high probability (in that [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|V^{(n)}-V^*\|_{\infty}] \lt \varepsilon[/math]. Note that \eqref{eq:sample_complexity_infi_Q} implies \eqref{eq:sample_complexity_infi_V_2} since [math]V^{(n)}(s) = \sup_{a\in \mathcal{A}}Q^{(n)}(s,a)[/math] and [math]V^{*}(s) = \sup_{a\in \mathcal{A}}Q^*(s,a)[/math]. In our analysis we do not distinguish between whether the sample complexity bounds for \eqref{eq:sample_complexity_infi_Q} and \eqref{eq:sample_complexity_infi_V_2} are in the high probability sense or in the expectation sense. The second type of sample complexity, the sample complexity of exploration, is defined as the number of samples such that the non-stationary policy [math]\pi_n[/math] at time [math]n[/math] is not [math]\varepsilon[/math]-optimal for the current state [math]s_n[/math]. This measure counts the number of mistakes along the whole trajectory. Mathematically speaking, the sample complexity of exploration for a given algorithm is the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]\pi_n[/math] is [math]\varepsilon[/math]-optimal when starting from the current state [math]s_n[/math] with probability at least [math]1-\delta[/math], i.e., for [math]n \ge N[/math],

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_V} \mathbb{P}\left(\left|V^{\pi_n}(s_n)- V^\star(s_n)\right|\leq \varepsilon\right) \ge 1-\delta. \end{eqnarray} [[/math]]

Note that the condition [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math] associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. Another weaker criterion, the mean-square sample complexity, measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution [math]\mu[/math]. In this case, the mean-square sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math],

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_V_avg} \|V^{\pi_n} - V^*\|_{\mu}:=\sqrt{\int_{\mathcal{S}}(V^{\pi_n}(s)-V^*(s))^2\mu(ds)} \leq \varepsilon. \end{eqnarray} [[/math]]

Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the [math]V[/math] function over a state distribution [math]\mu[/math], it is implied by \eqref{eq:sample_complexity_infi_V_2}.

Rate of Convergence. To emphasize the convergence with respect to the number of iterations required, we introduce the notion of rate of convergence, which uses the relationship between the number of iterations/steps [math]N[/math] and the error term [math]\varepsilon[/math], to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from [math]\varepsilon[/math]) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to [math]\varepsilon[/math]. In particular, an algorithm is said to converge at a linear rate if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(1/\varepsilon))[/math]. Similarly, an algorithm is said to converge at a sublinear rate (slower than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^p)[/math] with [math]p\geq 1[/math], and is said to converge at a superlinear (faster than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(\log(1/\varepsilon))[/math].

Regret Analysis. The regret of a given policy [math]\pi[/math] is defined as the difference between the cumulative reward of the optimal policy and that gathered by [math]\pi[/math]. It quantifies the exploration-exploitation trade-off. For episodic RL with horizon [math]T[/math] in formulation \eqref{equ: singlev_fh}, the regret of an algorithm after [math]K [/math] episodes with [math]M=T\,K[/math] steps is

[[math]] \begin{eqnarray}\label{eq:regret_epi} R(M) = \left\| K \times V_0^{*} - \sum_{k=1}^K\mathbb{E}[V_0^{\pi_k}]\right\|_{\infty} = \sup_{s\in \mathcal{S}} \Big(K \times V_0^{*} - \sum_{k=1}^K\mathbb{E}[V_0^{\pi_k}]\Big). \end{eqnarray} [[/math]]

There is currently no regret analysis for RL problems with infinite time horizon and discounted reward. The natural definition of regret as given above is less meaningful in this case as the cumulative discounted reward is bounded and hence does not scale with [math]T[/math].

Asymptotic Convergence. In addition to sample complexity and regret analysis discussed above, asymptotic convergence is another weaker convergence guarantee, which requires the error term to decay to zero as [math]N[/math] goes to infinity (without specifying the order of [math]N[/math]). This is often a first step in the theoretical analysis of RL algorithms.

Classification of RL Algorithms

An RL algorithm includes one or more of the following components:

  • a representation of a value function that provides a prediction of how good each state or each state/action pair is,
  • a direct representation of the policy [math]\pi(s)[/math] or [math]\pi(s,a)[/math],
  • a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).

The first two components are related to what is called model-free RL. When the latter component is used, the algorithm is referred to as model-based RL. In the MDP setting, model-based algorithms maintain an approximate MDP model by estimating the transition probabilities and the reward function, and derive a value function from the approximate MDP. The policy is then derived from the value function. Examples include [29][20][30] and [31]. Another line of model-based algorithms make structural assumptions on the model, using some prior knowledge, and utilize the structural information for algorithm design. For example, see [32] for learning linear-quadratic regulators over an infinite time horizon and [33][34] for the finite time horizon case. Unlike the model-based method, model-free algorithms directly learn a value (or state-value) function or the optimal policy without inferring the model. Model-free algorithms can be further divided into two categories, value-based methods and policy-based methods. Policy-based methods explicitly build a representation of a policy and keep it in memory during learning. Examples include policy gradient methods and trust-region policy optimization methods. As an alternative, value-based methods store only a value function without an explicit policy during the learning process. In this case, the policy is implicit and can be derived directly from the value function (by picking the action with the best value). In our discussion of methodology, we focus on model-free RL algorithms for MDP with infinite horizon and discounted reward. In particular, we introduce some classical value-based and policy-based methods in Sections and respectively. For the episodic setting and model-based algorithms, see the discussion in Section.

Value-based Methods

In this section, we introduce the methodologies of several classical value-based methods in the setting of an infinite time horizon with discounting, finite state and action spaces ([math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]), and stationary policies. The setting with finite state and action spaces is also referred to as the tabular case in the RL literature. For more general setups with an infinite number of actions and states, we refer the reader to Section for a discussion of linear functional approximations and to Section for neural network approximations. Recall that for reinforcement learning, the distribution of the reward function [math]r[/math] and the transition function [math]P[/math] are unknown to the agent. Therefore the expectations in the Bellman equations \eqref{equ:classicalV} and \eqref{equ: singleQ} cannot be calculated directly. On the other hand, the agents can observe samples [math](s,a,r,s^{\prime})[/math] by interacting with the system without a model of the system's dynamics [math]P[/math] or any prior knowledge of [math]r[/math]. The agents can then learn the value function or [math]Q[/math]-function using the samples. Following this idea, we next introduce the classical temporal-difference learning algorithm, with [math]Q[/math]-learning and State-Action-Reward-State-Action (SARSA) as two special cases.

Temporal-Difference Learning

Given some samples [math](s,a,r,s')[/math] obtained by following a policy [math]\pi[/math], the agent can update her estimate of the value function [math]V^{\pi}[/math] (defined in \eqref{equ: singlev}) at the [math](n+1)[/math]-th iteration by

[[math]] \begin{eqnarray}\label{eq:TD_update} V^{\pi,(n+1)}(s) \leftarrow (1-\beta_n(s,a))\underbrace{V^{\pi,(n)}(s)}_{\text{current estimate}} + \beta_n(s,a) \underbrace{[r+\gamma V^{\pi,(n)}(s')]}_{\text{new estimate}}, \end{eqnarray} [[/math]]

with some initialization of the value function [math]V^{\pi,(0)}[/math]. Here [math]\beta_n(s,a)[/math] is the learning rate at iteration [math]n+1[/math] which balances the weighting between the current estimate and the new estimate. The quantity [math]\beta_n[/math] can be a constant, can depend on the current state [math]s[/math], or even depend on the observed samples up to iteration [math]n+1[/math]. Algorithm provides some pseudo-code for an implementation of the TD method. Equation \eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD[math](\lambda)[/math] method (for some [math]\lambda \in [0,1][/math]) which is a combination of TD learning (with weight [math]\lambda[/math]) and a Monte Carlo method (with weight [math]1-\lambda[/math]). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See Section for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. {\color{blue} I still dont understand. A Monte Carlo method is a simulation method for calculating expectations of functions of a stochastic process. Here we have a method, TD, which uses a single sample to update the value function. We could use more samples to update with some average of the samples from a given [math](s,a)[/math] - is this what you mean? Or do we run sample the policy for a longer time period and average these to get an approximation using the definition of [math]V^{\pi}[/math] as an expectation over an infinite discounted sum?} Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in [18] for more details. The TD update in \eqref{eq:TD_update} can also be written as

[[math]] \begin{eqnarray}\label{eq:TD_update_v2} V^{\pi,(n+1)}(s) \leftarrow V^{\pi,(n)}(s) + \beta_n(s,a) \underbrace{\left[\overbrace{r+\gamma V^{\pi,(n)}(s')}^{\text{TD target}}-V^{\pi,(n)}(s)\right]}_{\text{TD error $\delta_n$}}. \end{eqnarray} [[/math]]

The term [math]\delta_n[/math], commonly called the TD error, measures the difference between the estimated value at [math]s[/math] and the better estimate [math]r+\gamma V^{\pi,(n)}(s')[/math], which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature.

TD(0) Method for estimating $V^{\pi}$
input: total number of iterations [math]N[/math]; the policy [math]\pi[/math] used to sample observations; rule to set learning rate [math]\beta_n \in (0,1][/math] [math](0\leq n \leq N-1)[/math]
Initialize [math]V^{\pi,(0)}(s)[/math] for all [math]s\in \mathcal{S}[/math]
Initialize [math]s[/math]
if [math]n=0,\cdots,N-1[/math] then
Sample action [math]a[/math] according to [math]\pi(s)[/math]
Observe [math]r[/math] and [math]s'[/math] after taking action [math]a[/math]
Update [math]V^{\pi,(n+1)}[/math] according to \eqref{eq:TD_update} with [math](s,a,r,s')[/math]
[math]s\leftarrow s'[/math]
end for


[math]Q[/math]-learning Algorithm

A version of TD learning is [math]Q[/math]-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the [math]Q[/math]-function with samples observed by the agent. At iteration [math]n[/math] ([math]1 \leq n \leq N-1[/math]), the [math]Q[/math]-function is updated using a sample [math](s,a,r,s^{\prime})[/math] where [math]s' \sim P(s,a)[/math],

[[math]] \begin{eqnarray}\label{eq:Q_learning_update} Q^{(n+1)}(s,a)\leftarrow (1-\beta_n(s,a))\underbrace{Q^{(n)}(s,a)}_{\text{current estimate}}+\beta_n(s,a)\underbrace{\left[r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')\right]}_{\text{new estimate}}. \end{eqnarray} [[/math]]

Here [math]\beta_n(s,a)[/math] is the learning rate which balances the weight between the current estimate [math]Q^{(n)}[/math] from iteration [math]n[/math] and the new estimate [math]r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')[/math] calculated using the sample [math](s,a,r,s^{\prime})[/math]. Algorithm provides pseudo-code for an implementation of Q-learning.

$Q$-learning with samples from a policy $\pi$
input: total number of iterations [math]N[/math]; the policy [math]\pi[/math] to be evaluated; rule to set learning rate [math]\beta_n \in (0,1][/math] [math](0 \leq n\leq N-1)[/math]
Initialize [math]Q^{(0)}(s,a)[/math] for all [math]s\in \mathcal{S}[/math] and [math]a\in \mathcal{A}[/math]
Initialize [math]s[/math]
if [math]n=0,\cdots,N-1[/math] then
Sample action [math]a[/math] according to [math]\pi(s)[/math]
Observe [math]r[/math] and [math]s'[/math] after taking action [math]a[/math]
Update [math]Q^{(n+1)}[/math] according to \eqref{eq:Q_learning_update} with sample [math](s,a,r,s')[/math]
[math]s\leftarrow s'[/math]
end for


The following theorem guarantees the asymptotic convergence of the [math]Q[/math]-learning update \eqref{eq:Q_learning_update} to the [math]Q[/math]-function defined in \eqref{defsingleQ}.

Theorem (Watkins and Dayan (1992) [35])

Assume [math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]. Let [math]n^i(s, a)[/math] be the index of the [math]i[/math]-th time that the action [math]a[/math] is used in state [math]s[/math]. Let [math]R \lt \infty[/math] be a constant. Given bounded rewards [math]|r|\leq R[/math], learning rate [math]0\leq\beta_n \lt 1[/math] and

[[math]] \begin{eqnarray}\label{exploration} \sum_{i=1}^{\infty}\beta_{n^i(s,a)} = \infty,\,\,\sum_{i=1}^{\infty}(\beta_{n^i(s,a)})^2 \lt \infty, \,\,\forall s,a, \end{eqnarray} [[/math]]
then [math]Q^{(n)}(s,a)\rightarrow Q^\star(s,a)[/math] as [math]n\rightarrow \infty[/math], [math]\forall s,a[/math] with probability [math]1[/math].

This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence [math]\beta[/math]. Theorem implies that eventually [math]Q^{(n)}[/math] will convergence to the true value function [math]Q^\star[/math] when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for [math]Q[/math]-learning related algorithms are discussed in Discussion.

SARSA

In contrast to the [math]Q[/math]-learning algorithm, which takes samples from any policy [math]\pi[/math] as the input where these samples could be collected in advance before performing the [math]Q[/math]-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the [math]Q[/math]-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm.

SARSA: On-policy TD Learning
input: total number of iterations [math]N[/math], the learning rate [math]\beta\in(0,1)[/math]\footnotemark[1], and small parameter [math]\varepsilon \gt 0[/math].
Initialize [math]Q^{(0)}(s, a)[/math] for all [math]s\in \mathcal{S}[/math] and [math]a\in \mathcal{A}[/math]
Initialize [math]s\in \mathcal{S}[/math]
Initialize [math]a[/math]: Choose [math]a[/math] when in [math]s[/math] using a policy derived from [math]Q^{(0)}[/math] (e.g, [math]\varepsilon[/math]-greedy)
if [math]n=0,1,\cdots,N-1[/math] then
Take action [math]a[/math], observe reward [math]r[/math] and the next step [math]s'[/math]
Choose [math]a'[/math] when in [math]s'[/math] using a policy derived from [math]Q^{(n)}[/math] (e.g., [math]\varepsilon[/math]-greedy)

[[math]] \begin{equation}\label{eq:SARSA_update} Q^{(n+1)}(s,a) \leftarrow (1-\beta)\underbrace{ Q^{(n)}(s,a)}_{\text{current estimate}} +\beta\underbrace{(r+\gamma Q^{(n)}(s',a'))}_{\text{new estimate}} \end{equation} [[/math]]

[math]s\leftarrow s'[/math], [math]a\leftarrow a'[/math]
end for


[b] The policy derived from [math]Q^{(n)}[/math] using [math]\varepsilon[/math]-greedy (line 4 and line 8 in Algorithm) suggests the following choice of action [math]a[/math] when in state [math]s[/math]:

[[math]] \begin{eqnarray}\label{epsilon_policy} \begin{cases} \text{ select from }\arg\max_{a'} Q^{(n)}(s,a') & \text{with probability } 1-\varepsilon,\\ \text{ uniformly sample from } \mathcal{A} & \text{with probability } \varepsilon. \end{cases} \end{eqnarray} [[/math]]


In Algorithm, SARSA uses the [math]Q^{(n)}[/math] function with an [math]\epsilon[/math]-greedy policy to choose [math]a'[/math] and to perform exploration. In contrast, Q-learning uses the maximum value of [math]Q^{(n)}[/math] over all possible actions for the next step, which is equivalent to setting [math]\varepsilon=0[/math] when choosing [math]a'[/math].

In addition to the [math]\varepsilon[/math]-greedy policy, other exploration methods such as the softmax operator (resulting in a Boltzmann policy) [36][37][38] can also be applied. See Section for a more detailed discussion.


On-policy Learning vs. Off-policy Learning. An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. [math]Q[/math]-learning is an off-policy agent as the samples [math](s,a,r,s')[/math] used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the [math]Q[/math]-function. The convergence of the [math]Q[/math]-function relies on the exploration scheme of the policy [math]\pi[/math] (see Theorem). In contrast to [math]Q[/math]-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the [math]Q[/math]-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of [math]Q[/math]-learning \eqref{eq:Q_learning_update}, we have [math]\max_{a'} Q^{(n)}(s',a')[/math] which is the maximum value of the [math]Q[/math]-function at a given state [math]s'[/math]. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes [math]Q^{(n)}(s',a')[/math] where [math]a'[/math] is selected based on a policy derived from [math]Q^{(n)}[/math] (via the [math]\varepsilon[/math]-greedy policy \eqref{epsilon_policy}).

Discussion

In this section, we provide a brief overview of sample complexity results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to Section. There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, [39] showed that the mean-squared error converges at a rate of [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})[/math] with decaying learning rate [math]\beta_n= \frac{1}{(1+n)^{\sigma}}[/math] for some [math]\sigma\in (0,1)[/math] where i.i.d. samples are drawn from a stationary distribution; [40] provided an improved [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] bound for iterate-averaged TD(0) with constant step-size; and [41] reached the same mean-square sample complexity [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory. A similar analysis was applied by [42] to SARSA and [math]Q[/math]-learning algorithms for a continuous state space using linear function approximation (see the end of Section).

The [math]Q[/math]-sample complexity for the [math]Q[/math]-learning algorithm of [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\mbox{poly}(\log \delta^{-1})\right)[/math] was first established in [43] with decaying learning rate [math]\beta_n = \frac{1}{(1+n)^{4/5}}[/math] and access to a simulator. This was followed by [44] which showed that the same order of [math]Q[/math]-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)[/math] could be achieved with a constant learning rate [math]\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}[/math] [math](n\leq N)[/math]. Without access to a simulator, delayed [math]Q[/math]-learning [45] was shown to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] assuming a known reward. An improvement to this result to [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy [46] and a single trajectory.


In addition, sample complexities of [math]Q[/math]-learning variants with (linear) function approximations have been established in several recent papers. [16] assumed a linear MDP framework (see the end of Section) and the authors provided a V-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math] with [math]d[/math] denoting the dimension of the features. This result matches the information-theoretic lower bound up to [math]\log(\cdot)[/math] factors. Later work, [47] considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order [math]\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math]. Although the dependence on [math]\gamma[/math] is not as good as in [16], this framework can be applied to a more general class of models which are “near” linear. Although [math]Q[/math]-learning is one of the most successful algorithms for finding the best action-value function [math]Q[/math], its implementation often suffers from large high biased estimates of the [math]Q[/math]-function values incurred by random sampling. The double [math]Q[/math]-learning algorithm proposed in [48] tackled this high estimation issue by randomly switching the update between two [math]Q[/math]-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double [math]Q[/math]-learning algorithm was established in [49] with a Q-sample complexity on the order of [math]\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)[/math] and learning rate [math]\beta_n = \frac{1}{n^{w}}[/math] for some constant [math]\omega \in (0,1)[/math].


Policy-based Methods

So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the [math]\varepsilon[/math]-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods [50][51][52]. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE [53], Actor-Critic methods [54], Trust Region Policy Optimization (TRPO) [55], Proximal Policy Optimization (PPO) [56] among others. We first parametrize the policy [math]\pi[/math] by a vector [math]\theta[/math], thus we define [math]\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})[/math], the probability distribution parameterized by [math]\theta[/math] over the action space given the state [math]s[/math] at time [math]t[/math]. Here we will use [math]\pi(s,a;\theta)[/math] and [math]\pi_{\theta}[/math] interchangeably to represent the policy parameterized by [math]\theta[/math]. We write [math]\mu^{\pi_{\theta}}[/math] for the stationary distribution of the state under policy [math]\pi(s,a;\theta)[/math]. Then the policy objective function is defined to be

[[math]] J(\theta):=\int_{\mathcal{S}}V^{\pi_{\theta}}(s)\mu^{\pi_{\theta}}(ds) [[/math]]

for RL with infinite time horizon, or

[[math]] J(\theta):=\int_{\mathcal{S}}V_0^{\pi_{\theta}}(s)\mu^{\pi_{\theta}}(ds) [[/math]]

for RL with finite time horizon. In order to maximize this function, the policy parameter [math]\theta[/math] is updated using the gradient ascent rule:

[[math]] \begin{equation}\label{eqn:grad_ascent} \theta^\prime = \theta + \beta\,\widehat{\nabla_{\theta}J(\theta)}, \end{equation} [[/math]]

where [math]\beta[/math] is the learning rate, and [math]\widehat{\nabla_{\theta}J(\theta)}[/math] is an estimate of the gradient of [math]J(\theta)[/math] with respect to [math]\theta[/math].

Policy Gradient Theorem

Our task now is to estimate the gradient [math]\nabla_{\theta}J(\theta)[/math]. The objective function [math]J(\theta)[/math] depends on both the policy and the corresponding stationary distribution [math]\mu^{\pi_{\theta}}[/math], and the parameter [math]\theta[/math] affects both of them. Calculating the gradient of the action with respect to [math]\theta[/math] is straightforward given the parametrization [math]\pi(s,a;\theta)[/math], whereas it might be challenging to analyze the effect of [math]\theta[/math] on the state when the system is unknown. Fortunately the well-known policy gradient theorem provides an expression for [math]\nabla_{\theta}J(\theta)[/math] which does not involve the derivatives of the state distribution with respect to [math]\theta[/math]. We write the [math]Q[/math]-function under policy [math]\pi(s,a;\theta)[/math] as [math]Q^{\pi_{\theta}}(s,a)[/math], then we have the following theorem (for more details see [18]).

Theorem (Policy Gradient Theorem [57])

Assume that [math]\pi(s,a;\theta)[/math] is differentiable with respect to [math]\theta[/math] and that there exists [math]\mu^{\pi_{\theta}}[/math], the stationary distribution of the dynamics under policy [math]\pi_{\theta}[/math], which is independent of the initial state [math]s_0[/math]. Then the policy gradient is

[[math]] \begin{equation}\label{eqn:policy_gra_thm} \nabla_{\theta}J(\theta) = \mathbb{E}_{s\sim \mu^{\pi_{\theta}},a\sim \pi_{\theta}}[\nabla_{\theta}\ln\pi(s,a;\theta)Q^{\pi_{\theta}}(s,a)]. \end{equation} [[/math]]

For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity [58].

REINFORCE: Monte Carlo Policy Gradient

Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard Monte Carlo method that repeatedly simulates a trajectory of length [math]M[/math] within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE [53] or Monte Carlo policy gradient (see Algorithm). We use a simple empirical return function [math]G_t[/math] to approximate the [math]Q[/math]-function in \eqref{eqn:policy_gra_thm}. The return [math]G_t[/math] is defined as the sum of the discounted rewards and given by [math]G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}[/math]. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time [math]t[/math], it uses all future rewards up until the end of the trajectory (see line 7 in Algorithm). Within the [math]n[/math]-iteration the policy parameter [math]\theta^{(n+1)}[/math] is updated [math]M[/math] times with [math]G_t[/math] for [math]t=0,1,\cdots,M-1[/math]. As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a baseline, which is a function of the state [math]s[/math], from the return [math]G_t[/math]. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, [math]\widehat{V}(s)[/math], which can be learnt using one of the methods introduced in Section. Replacing the [math]G_t[/math] in \eqref{eqn:alg_REINFORCE_update} by [math]G_t-\widehat{V}(s)[/math] gives a REINFORCE algorithm with baseline.

REINFORCE: Monte Carlo Policy Gradient
input: total number of iterations [math]N[/math], learning rate [math]\beta[/math], a differentiable policy parametrization [math]\pi(s,a;\theta)[/math], finite length of the trajectory [math]M[/math].
Initialize policy parameter [math]\theta^{(0)}[/math].
if [math]n=0,1,\ldots,N-1[/math] then
Sample a trajectory [math]s_0,a_0,r_1,\cdots,s_{T-1},a_{T-1},r_T\sim\pi(s,a;\theta^{(n)})[/math]
Initialize [math]\theta^{(n+1)} = \theta^{(n)}[/math]
if [math]t=0,\cdots,M-1[/math] then
Calculate the return: [math]G_t=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}[/math]
Update the policy parameter

[[math]] \begin{equation}\label{eqn:alg_REINFORCE_update} \theta^{(n+1)} \leftarrow \theta^{(n+1)} + \beta\gamma^t G_t\nabla_{\theta}\ln\pi(s_t,a_t;\theta^{(n+1)}) \end{equation} [[/math]]

end for
end for


Actor-Critic Methods

The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., [59] and [60]) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. In Actor-Critic methods the value function is parametrized by [math]w[/math] and the policy is parametrized by [math]\theta[/math]. These methods consist of two models: a critic which updates the value function or [math]Q[/math]-function parameter [math]w[/math], and an actor which updates the policy parameter [math]\theta[/math] based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in Section for the critic; for example SARSA or [math]Q[/math]-learning. We give the pseudocode for a simple Actor-Critic method in Algorithm. Here for the critic, we approximate the [math]Q[/math]-function by a linear combination of features, that is, [math]Q(s,a;w)=\phi(s,a)^\top w[/math] for some known feature [math]\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d[/math], and use TD(0) to update the parameter [math]w[/math] to minimize the least-square error of the TD error with the gradient of the least-square error taking the form [math]\delta \phi(s,a)[/math] (see lines 9-10 in Algorithm). For the actor, we use the (vanilla) policy-based method to update the policy parameter [math]\theta[/math] (see line 11 in Algorithm). There are three main ways to execute the algorithm. In the nested-loop setting (see, e.g. [61][62]), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time-scale setting (see, e.g. [63]), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the single-scale setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. [64][65]).

Actor-Critic Algorithm
input: A differentiable policy parametrization [math]\pi(s,a;\theta)[/math], a differentiable [math]Q[/math]-function parameterization [math]Q(s,a;w)[/math], learning rates [math]\beta^{\theta} \gt 0[/math] and [math]\beta^{w} \gt 0[/math], number of iterations [math]N[/math]
Initialize policy parameter [math]\theta^{(0)}[/math] and [math]Q[/math]-function parameter [math]w^{(0)}[/math]
Initialize [math]s[/math]
if [math]n=0,1,\ldots,N-1[/math] then
Sample [math]a\sim\pi(s,a;\theta^{(n)})[/math]
Take action [math]a[/math], observe state [math]s^\prime[/math] and reward [math]r[/math]
Sample action [math]a^\prime\sim\pi(s^\prime,\cdot;\theta^{(n)})[/math]
Compute the TD error: [math]\delta\leftarrow r+\gamma Q(s^\prime,a^\prime;w^{(n)})-Q(s,a;w^{(n)})[/math]
Update [math]w[/math]: [math]w^{(n+1)}\leftarrow w^{(n)}+\beta^{w}\delta\phi(s,a)[/math]
[math]\theta^{(n+1)}\leftarrow \theta^{(n)} + \beta^{\theta}Q(s,a;w^{(n)})\nabla_{\theta}\ln\pi(s,a;\theta^{(n)})[/math]
[math]s\leftarrow s^\prime[/math], [math]a\leftarrow a^\prime[/math]
end for


Discussion

Variants of Policy-based Methods. There have been many variations of policy-based methods with improvements in various directions. For example, the natural policy-based algorithms [66][67] modify the search direction of the vanilla version by involving a Fisher information matrix in the gradient ascent updating rule, which speeds the convergence significantly. To enhance the stability of learning, we may request the updated policy estimate to be not too far away from the previous estimate. Along this line, TRPO [55] uses the Kullback-Leibler (KL)-divergence to measure the change between the consecutive updates as a constraint, while PPO [56] incorporate an adaptive KL-penalty or a clipped objective in the objective function to eliminate the constraint. When using the KL-penalty, PPO can be viewed as a Lagrangian relaxation of the TRPO algorithm. To reduce the sample complexity, Deterministic Policy Gradient (DPG) [68] uses a deterministic function (rather than the stochastic policy [math]\pi[/math]) to represent the policy to avoid sampling over actions (although enough exploration needs to be guaranteed in other ways); Actor-Critic with Experience Replay (ACER) [69] reuses some experience (tuples of data collected in previous iterations/episodes, see the discussion of Deep [math]Q[/math]-Networks in Section) which improves the sample efficiency and decreases the data correlation. In addition we mention Asynchronous Advantage Actor-Critic (A3C) and Advantage Actor-Critic (A2C), two popular Actor-Critic methods with a special focus on parallel training [70]. The latter is a synchronous, deterministic version of the former. For continuous-time framework, see developments in [71][72].

Convergence Guarantees. Now we discuss the convergence guarantees of policy-based algorithms. [57] provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. For more examples of asymptotic convergence, see for example [54] and [73]. Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [61] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates [math]k[/math] required to attain [math]\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 \lt \varepsilon[/math]. We denote this smallest number as [math]K[/math]. When the actor uses a policy gradient method, the critic achieves [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)[/math] by employing TD(0), [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})[/math] by employing the Gradient Temporal Difference, and [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})[/math] by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. [74] investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. For more examples of sample complexity analysis for convergence to a stationary point, see for example [75][76][77][78][79]. The global optimality of stationary points was studied in [80] where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. Recently there have been results on the sample complexity analysis of the convergence to the global optimum. [81] proved that the entropy regularized natural policy gradient method achieves a [math]Q[/math]-sample complexity (and [math]\varepsilon[/math]-optimal policies) with linear convergence rate. [82] showed that with high probability at least [math]1-\delta[/math], the TRPO algorithm has sublinear rate [math] \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)[/math] in the unregularized (tabular) MDPs, and has an improved rate [math]\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)[/math] in MDPs with entropy regularization. [63] gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order [math]\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})[/math]. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both [64] and [65]. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see [27] for the regret analysis of the REINFORCE algorithm for discounted MDPs and [83][84] for policy gradient methods in the setting of known model parameters.

General Discussion

So far we have focused on model-free RL with discounting over an infinite time horizon. In this section, we discuss two other cases: RL over a finite time horizon with both model-based and model-free methods, and model-based RL with an infinite time horizon.

RL with Finite Time Horizon

As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. [85] proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an [math]\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})[/math] expected regret bound. [25] proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has [math]V[/math]-sample complexity[c] of order [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})[/math] assuming known rewards. [86] considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of [math]\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]) and [math]\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T^3|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]), respectively. [26] proposed an Upper Bounding the Expected Next State Value (UBEV) algorithm which achieves a sample complexity\footnotemark[2] of [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^5}{\varepsilon^2}{\rm polylog}(\frac{1}{\delta})\big)[/math]. They also proved that the algorithm has a regret bound of [math]\widetilde{\mathcal{O}}(T^2\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] with [math]M\ge |\mathcal{S}|^5|\mathcal{A}|^3[/math]. [87] proved that [math]Q[/math]-learning with UCB exploration achieves regret of [math]\widetilde{\mathcal{O}}(\sqrt{T^3|\mathcal{S}|\,|\mathcal{A}|M})[/math] without requiring access to a simulator. The above regret bounds depend on the size of the state and action space and thus may suffer from the curse of dimensionality as [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math] increase. To tackle this issue, [88] learned a low-dimensional representation of the probability transition model and proved that their MatrixRL algorithm achieves a regret bound of [math]\widetilde{\mathcal{O}}(T^2\,d\sqrt{M})[/math] which depends on the number of features [math]d[/math] rather than [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math]. [15] provided a [math]\widetilde{\mathcal{O}}(\sqrt{d^3T^3M})[/math] regret bound with high probability for a modified version of the Least-Squares Value Iteration (LSVI) algorithm with UCB exploration.

Model-based RL with Infinite Time Horizon

To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, [29] proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)[/math] by [89]. The Model-based Interval Estimation (MBIE) in [90] was proved to achieve a sample complexity of exploration of the same order as R-MAX. [31] further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)[/math]. [91] proved that the model-based [math]Q[/math]-Value Iteration (QVI) achieves the [math]Q[/math]-sample complexity of [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})[/math] for some small [math]\varepsilon[/math] with high probability at least [math]1-\delta[/math]. [92] studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as [91], but with a larger range of [math]\varepsilon[/math]. See also [45][20][30] for model-based RL along this line. Linear-quadratic (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) linear-quadratic regulator (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example [32] and [33] for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also [34] and [93] for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time.

Exploration Exploitation Trade-offs

As mentioned earlier, exploitation versus exploration is a critical topic in RL. RL agents want to find the optimal solution as fast as possible, meanwhile committing to solutions too fast without enough exploration could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while efficient exploration remains a challenging and less understood topic. For multi-armed bandit problems or MDP with finite state and action spaces, classic exploration algorithms such as [math]\varepsilon[/math]-greedy, UCB, Boltzmann exploration and Thompson sampling show promising performance in different settings. For the [math]\varepsilon[/math]-greedy algorithm [94][95], the agent randomly explores occasionally with probability [math]\varepsilon[/math] and takes the optimal action most of the time with probability [math]1-\varepsilon[/math]. For example, see Equation \eqref{epsilon_policy} for the [math]\varepsilon[/math]-greedy method combined with a value-based algorithm. The UCB type of algorithms are mainly designed for value-based algorithms [46][86][87][15]. The agent selects the greediest action to maximize the upper confidence bound [math]Q(s,a) + U(s,a)[/math], where [math]Q(s,a)[/math] is the Q value function and [math]U(s,a)[/math] is a function inversely proportional to how many times the state-action pair [math](s,a)[/math] has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration [36][37][38], the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method [96][97][98], the agent keeps track of a belief of the optimal action via a distribution over the set of admissible actions (for each given state).

For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space.

Regularization in Reinforcement Learning

Many recent developments in RL benefit from regularization, which has been shown to improve not only the exploration but also the robustness. Examples include both policy-based methods and value-based methods in various settings. Here we provide a general overview of different regularization techniques and their advantages along with several examples. For example, TRPO and PPO use the KL-divergence between two consecutive policies as a penalty in policy improvement [55][56]. Soft-[math]Q[/math]-learning uses Shannon entropy as a penalty in value iteration [37]. The authors confirmed in simulated experiments that entropy regularization helps to improve exploration and allows transferring skills between tasks. [99] proposed a unified framework to analyze the above algorithms via a regularized Bellman operator. [81] showed that entropy regularization can also help to speed up the convergence rate from a theoretical perspective. It is worth noting that [38] explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently [100] and [101] extended this idea to a general control framework beyond the linear-quadratic case. [102] extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors [103][104]. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for [math]Q[/math]-functions) and the [math]l_1[/math] norm (for parameters).

Reinforcement Learning in Non-stationary Environment

Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce the mathematical formulations of non-stationary RL (in both episodic and infinite horizon settings) below.

Non-stationary Episodic RL. For the episodic (or finite-time horizon) setting, an agent interacts with a non-stationary MDP for [math]K[/math] episodes, with each episode containing [math]T[/math] timestamps. Let [math](t, k)[/math] denote a (time,episode) index for the [math]t[/math]-th timestamp in the [math]k[/math]-th episode. The non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, T, \pmb{P}, \pmb{r})[/math], where [math]\mathcal{S}[/math] is the state space, [math]\mathcal{A}[/math] is the action space, [math]T[/math] is the number of timestamps in one episode, [math]\pmb{P} = \{P_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of transition kernels with [math]P_{k,t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})[/math] for all [math]1\leq k\leq K[/math] and [math]0\leq t\leq T[/math], and [math]\pmb{r} = \{r_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of reward functions with [math]r_{k,t}(s,a)[/math] a random variable in [math]\mathbb{R}[/math] (depending on both the time-step and the episode number) for each pair [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] and for [math]0\leq t\leq T-1[/math]. Similarly, the terminal reward [math]r_{k,T}(s)[/math] is a real-valued random variable for each [math]s\in\mathcal{S}[/math]. Denote by [math]\Pi=\{\pi_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T-1}[/math] the Markovian policy in which [math]\pi_{k,t}[/math] can be either deterministic or randomized. The value function for this policy at the [math]k[/math]-th episode can be written as

[[math]] \begin{eqnarray} V^{\Pi}_{k,t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^{T-1} r_{k,u}(s_u,a_u)+r_{k,T}(s_T)\right|s_t = s\right], \end{eqnarray} [[/math]]

where [math]s_{u+1} \sim P_{k,u}(s_u,a_u)[/math] and [math]a_u\sim \pi_{k,u}(s_u)[/math].

Non-stationary RL with Infinite Time Horizon. For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon [math]T[/math] become infinite. Then the non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})[/math], with the obvious changes from the episodic case. Again we write [math]\Pi=\{\pi_{t} \}_{t \ge 0}[/math] for the Markovian policy in which [math]\pi_{t}[/math] can be either deterministic or randomized and consider the associated value function with a discount factor [math]\gamma[/math]:

[[math]] \begin{eqnarray} V^{\Pi}_{t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^\infty \gamma^{u-t} r_{u}(s_u,a_u)\right|s_t = s\right], \end{eqnarray} [[/math]]

where [math]s_{u+1} \sim P_{u}(\cdot|s_u,a_u)[/math] and [math]a_u\sim \pi_{u}(s_u)[/math].

Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces [105][106], episodic MDPs with finite state and action spaces [107], and episodic linear MDPs [108]. However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. To relax this assumption, [109] proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result [106]. However, it introduces extra overheads and leads to suboptimal regret. More recently, [110] proposes a general approach that is applicable to various reinforcement learning settings (including both episodic MDPs and infinite-horizon MDPs) and achieves optimal dynamic regret without any prior knowledge of the degree of non-stationarity.

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

Notes

  1. Sample complexity with respect to the number of episodes has also been adopted in the literature [25][26]. The translation between these two criteria is straightforward.
  2. The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate [math]\beta[/math]. In the convergence analysis, we use [math]\beta_n[/math] to denote a learning rate which depends on the number of iterations.
  3. In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by [math]T[/math] to match the definition of sample complexity in this paper.

References

  1. D.SILVER, T.HUBERT, J.SCHRITTWIESER, I.ANTONOGLOU, M.LAI, A.GUEZ, M.LANCTOT, L.SIFRE, D.KUMARAN, T.GRAEPEL, etAL., Mastering chess and shogi by self-play with a general reinforcement learning algorithm, arXiv preprint arXiv:1712.01815, (2017).
  2. O.VINYALS, I.BABUSCHKIN, W.M. Czarnecki, M.MATHIEU, A.DUDZIK, J.CHUNG, D.H. Choi, R.POWELL, T.EWALDS, P.GEORGIEV, etAL., Grandmaster level in starcraft ii using multi-agent reinforcement learning, Nature, 575 (2019), pp.350--354.
  3. A.RADFORD, J.WU, R.CHILD, D.LUAN, D.AMODEI, I.SUTSKEVER, etAL., Language models are unsupervised multitask learners, OpenAI blog, 1 (2019), p.9.
  4. T.P. Lillicrap, J.J. Hunt, A.PRITZEL, N.HEESS, T.EREZ, Y.TASSA, D.SILVER, and D.WIERSTRA, Continuous control with deep reinforcement learning, in 4th International Conference on Learning Representations (ICLR), 2016.
  5. I.GOODFELLOW, Y.BENGIO, and A.COURVILLE, Deep learning, MIT press, 2016.
  6. V.MNIH, K.KAVUKCUOGLU, D.SILVER, A.GRAVES, I.ANTONOGLOU, D.WIERSTRA, and M.RIEDMILLER, Playing atari with deep reinforcement learning, arXiv preprint arXiv:1312.5602, (2013).
  7. H.VANHASSELT, A.GUEZ, and D.SILVER, Deep reinforcement learning with double Q-learning, in Proceedings of the AAAI Conference on Artificial Intelligence, vol.30, 2016.
  8. S.GU, T.LILLICRAP, I.SUTSKEVER, and S.LEVINE, Continuous deep Q-learning with model-based acceleration, in International Conference on Machine Learning, PMLR, 2016, pp.2829--2838.
  9. J.FAN, Z.WANG, Y.XIE, and Z.YANG, A theoretical analysis of deep Q-learning, in Learning for Dynamics and Control, PMLR, 2020, pp.486--489.
  10. 10.0 10.1 10.2 M.L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014.
  11. Y.ABBASI-Yadkori, P.BARTLETT, K.BHATIA, N.LAZIC, C.SZEPESVARI, and G.WEISZ, Politex: Regret bounds for policy iteration using expert prediction, in International Conference on Machine Learning, PMLR, 2019, pp.3692--3702.
  12. C.-Y. Wei, M.J. Jahromi, H.LUO, H.SHARMA, and R.JAIN, Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes, in International Conference on Machine Learning, PMLR, 2020, pp.10170--10180.
  13. S.J. Bradtke and A.G. Barto, Linear least-squares algorithms for temporal difference learning, Machine learning, 22 (1996), pp.33--57.
  14. F.S. Melo and M.I. Ribeiro, Q-learning with linear function approximation, in International Conference on Computational Learning Theory, Springer, 2007, pp.308--322.
  15. 15.0 15.1 15.2 C.JIN, Z.YANG, Z.WANG, and M.I. Jordan, Provably efficient reinforcement learning with linear function approximation, in Conference on Learning Theory, PMLR, 2020, pp.2137--2143.
  16. 16.0 16.1 16.2 L.YANG and M.WANG, Sample-optimal parametric Q-learning using linearly additive features, in International Conference on Machine Learning, PMLR, 2019, pp.6995--7004.
  17. B.DAI, A.SHAW, L.LI, L.XIAO, N.HE, Z.LIU, J.CHEN, and L.SONG, SBEED: Convergent reinforcement learning with nonlinear function approximation, in International Conference on Machine Learning, PMLR, 2018, pp.1125--1134.
  18. 18.0 18.1 18.2 R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT press, 2018.
  19. W.B. Powell, Reinforcement learning and stochastic optimization, 2021.
  20. 20.0 20.1 20.2 M.KEARNS and S.SINGH, Near-optimal reinforcement learning in polynomial time, Machine Learning, 49 (2002), pp.209--232.
  21. S.KOENIG and R.G. Simmons, Complexity analysis of real-time reinforcement learning, in AAAI, 1993, pp.99--107.
  22. 22.0 22.1 M.G. Azar, R.MUNOS, and B.KAPPEN, On the sample complexity of reinforcement learning with a generative model, arXiv preprint arXiv:1206.6461, (2012).
  23. H.ROBBINS, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58 (1952), pp.527--535.
  24. D.A. Berry and B.FRISTEDT, Bandit problems: Sequential allocation of experiments (Monographs on Statistics and Applied Probability), London: Chapman and Hall, 5 (1985), pp.7--7.
  25. 25.0 25.1 C.DANN and E.BRUNSKILL, Sample complexity of episodic fixed-horizon reinforcement learning, in NIPS'15, Cambridge, MA, USA, 2015, MIT Press, pp.2818--2826.
  26. 26.0 26.1 C.DANN, T.LATTIMORE, and E.BRUNSKILL, Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning, in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, 2017, pp.5717--5727.
  27. 27.0 27.1 J.ZHANG, J.KIM, B.O'Donoghue, and S.BOYD, Sample efficient reinforcement learning with REINFORCE, in Proceedings of the AAAI Conference on Artificial Intelligence, vol.35, 2021, pp.10887--10895.
  28. Y.LIU, K.ZHANG, T.BASAR, and W.YIN, An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods, in NeurIPS, 2020.
  29. 29.0 29.1 R.I. Brafman and M.TENNENHOLTZ, R-max-a general polynomial time algorithm for near-optimal reinforcement learning, Journal of Machine Learning Research, 3 (2002), pp.213--231.
  30. 30.0 30.1 T.LATTIMORE and M.HUTTER, PAC bounds for discounted MDPs, in International Conference on Algorithmic Learning Theory, Springer, 2012, pp.320--334.
  31. 31.0 31.1 I.SZITA and C.SZEPESV\'ari, Model-based reinforcement learning with nearly tight exploration complexity bounds, in ICML, 2010, pp.1031--1038.
  32. 32.0 32.1 M.FAZEL, R.GE, S.M. Kakade, and M.MESBAHI, Global convergence of policy gradient methods for the linear quadratic regulator, in International Conference on Machine Learning, PMLR, 2018, pp.1467--1476.
  33. 33.0 33.1 B.HAMBLY, R.XU, and H.YANG, Policy gradient methods for the noisy linear quadratic regulator over a finite horizon, SIAM Journal on Control and Optimization, 59 (2021), pp.3359--3391.
  34. 34.0 34.1 M.BASEI, X.GUO, A.HU, and Y.ZHANG, Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon, Available at SSRN 3848428, (2021).
  35. C.J. Watkins and P.DAYAN, Q-learning, Machine learning, 8 (1992), pp.279--292.
  36. 36.0 36.1 K.ASADI and M.L. Littman, An alternative softmax operator for reinforcement learning, in International Conference on Machine Learning, PMLR, 2017, pp.243--252.
  37. 37.0 37.1 37.2 T.HAARNOJA, H.TANG, P.ABBEEL, and S.LEVINE, Reinforcement learning with deep energy-based policies, in International Conference on Machine Learning, PMLR, 2017, pp.1352--1361.
  38. 38.0 38.1 38.2 H.WANG, T.ZARIPHOPOULOU, and X.ZHOU, Exploration versus exploitation in reinforcement learning: A stochastic control approach, J. Mach. Learn. Res., 21 (2020), pp.1--34.
  39. G.DALAL, B.SZ{\"o}r{\'e}nyi, G.THOPPE, and S.MANNOR, Finite sample analyses for TD(0) with function approximation, in 32th AAAI Conference on Artificial Intelligence, 2018.
  40. C.LAKSHMINARAYANAN and C.SZEPESVARI, Linear stochastic approximation: How far does constant step-size and iterate averaging go?, in International Conference on Artificial Intelligence and Statistics, PMLR, 2018, pp.1347--1355.
  41. J.BHANDARI, D.RUSSO, and R.SINGAL, A finite time analysis of temporal difference learning with linear function approximation, in Conference on Learning Theory, PMLR, 2018, pp.1691--1692.
  42. S.ZOU, T.XU, and Y.LIANG, Finite-sample analysis for SARSA with linear function approximation, Advances in Neural Information Processing Systems, 32 (2019), pp.8668--8678.
  43. E.EVEN-Dar, Y.MANSOUR, and P.BARTLETT, Learning rates for Q-learning, Journal of Machine Learning Research, 5 (2003).
  44. C.L. Beck and R.SRIKANT, Error bounds for constant step-size Q-learning, Systems & Control Letters, 61 (2012), pp.1203--1208.
  45. 45.0 45.1 A.L. Strehl, L.LI, and M.L. Littman, Reinforcement learning in finite MDPs: PAC analysis, Journal of Machine Learning Research, 10 (2009).
  46. 46.0 46.1 Y.WANG, K.DONG, X.CHEN, and L.WANG, Q-learning with UCB exploration is sample efficient for infinite-horizon MDP, in International Conference on Learning Representations, 2020.
  47. T.LATTIMORE, C.SZEPESVARI, and G.WEISZ, Learning with good feature representations in bandits and in RL with a generative model, in International Conference on Machine Learning, PMLR, 2020, pp.5662--5670.
  48. H.HASSELT, Double Q-learning, Advances in Neural Information Processing Systems, 23 (2010), pp.2613--2621.
  49. H.XIONG, L.ZHAO, Y.LIANG, and W.ZHANG, Finite-time analysis for double Q-learning, Advances in Neural Information Processing Systems, 33 (2020).
  50. M.SEWAK, Policy-based reinforcement learning approaches, in Deep Reinforcement Learning, Springer, 2019, pp.127--140.
  51. M.YU and S.SUN, Policy-based reinforcement learning for time series anomaly detection, Engineering Applications of Artificial Intelligence, 95 (2020), p.103919.
  52. K.DAB{\'e}rius, E.GRANAT, and P.KARLSSON, Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks, Available at SSRN 3374766, (2019).
  53. 53.0 53.1 R.J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 8 (1992), pp.229--256.
  54. 54.0 54.1 V.R. Konda and J.N. Tsitsiklis, Actor-critic algorithms, in Advances in Neural Information Processing Systems, 2000, pp.1008--1014.
  55. 55.0 55.1 55.2 J.SCHULMAN, S.LEVINE, P.ABBEEL, M.JORDAN, and P.MORITZ, Trust region policy optimization, in International Conference on Machine Learning, PMLR, 2015, pp.1889--1897.
  56. 56.0 56.1 56.2 J.SCHULMAN, F.WOLSKI, P.DHARIWAL, A.RADFORD, and O.KLIMOV, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347, (2017).
  57. 57.0 57.1 R.S. Sutton, D.A. McAllester, S.P. Singh, and Y.MANSOUR, Policy gradient methods for reinforcement learning with function approximation, in Advances in Neural Information Processing Systems, 2000, pp.1057--1063.
  58. V.KONDA, Actor-critic algorithms, PhD thesis, MIT, 2002.
  59. V.FRAN{\cc}ois-Lavet, G.RABUSSEAU, J.PINEAU, D.ERNST, and R.FONTENEAU, On overfitting and asymptotic bias in batch reinforcement learning with partial observability, Journal of Artificial Intelligence Research, 65 (2019), pp.1--30.
  60. U.VONLUXBURG and B.SCH{\"o}lkopf, Statistical learning theory: Models, concepts, and results, in Handbook of the History of Logic, vol.10, Elsevier, 2011, pp.651--706.
  61. 61.0 61.1 H.KUMAR, A.KOPPEL, and A.RIBEIRO, On the sample complexity of actor-critic method for reinforcement learning with function approximation, arXiv preprint arXiv:1910.08412, (2019).
  62. T.XU, Z.WANG, and Y.LIANG, Improving sample complexity bounds for (natural) actor-critic algorithms, in Advances in Neural Information Processing Systems, vol.33, 2020, pp.4358--4369.
  63. 63.0 63.1 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms, arXiv preprint arXiv:2005.03557, (2020).
  64. 64.0 64.1 Z.FU, Z.YANG, and Z.WANG, Single-timescale actor-critic provably finds globally optimal policy, in International Conference on Learning Representations, 2021.
  65. 65.0 65.1 T.XU, Z.YANG, Z.WANG, and Y.LIANG, Doubly robust off-policy actor-critic: Convergence and optimality, arXiv preprint arXiv:2102.11866, (2021).
  66. S.M. Kakade, A natural policy gradient, Advances in Neural Information Processing Systems, 14 (2001).
  67. J.PETERS and S.SCHAAL, Natural actor-critic, Neurocomputing, 71 (2008), pp.1180--1190.
  68. D.SILVER, G.LEVER, N.HEESS, T.DEGRIS, D.WIERSTRA, and M.RIEDMILLER, Deterministic policy gradient algorithms, in International Conference on Machine Learning, PMLR, 2014, pp.387--395.
  69. Z.WANG, V.BAPST, N.HEESS, V.MNIH, R.MUNOS, K.KAVUKCUOGLU, and N.FREITAS, Sample efficient actor-critic with experience replay, in International Conference on Learning Representations (ICLR), 2017, pp.1--13.
  70. V.MNIH, A.P. Badia, M.MIRZA, A.GRAVES, T.LILLICRAP, T.HARLEY, D.SILVER, and K.KAVUKCUOGLU, Asynchronous methods for deep reinforcement learning, in International Conference on Machine Learning, PMLR, 2016, pp.1928--1937.
  71. Y.JIA and X.Y. Zhou, Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms, arXiv preprint arXiv:2111.11232, (2021).
  72. \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach, Journal of Machine Learning Research, 23 (2022), pp.1--55.
  73. S.BHATNAGAR, An actor--critic algorithm with function approximation for discounted cost constrained Markov decision processes, Systems & Control Letters, 59 (2010), pp.760--766.
  74. K.ZHANG, A.KOPPEL, H.ZHU, and T.BASAR, Global convergence of policy gradient methods to (almost) locally optimal policies, SIAM Journal on Control and Optimization, 58 (2020), pp.3586--3612.
  75. M.PAPINI, D.BINAGHI, G.CANONACO, M.PIROTTA, and M.RESTELLI, Stochastic variance-reduced policy gradient, in International Conference on Machine Learning, PMLR, 2018, pp.4026--4035.
  76. P.XU, F.GAO, and Q.GU, An improved convergence analysis of stochastic variance-reduced policy gradient, in Uncertainty in Artificial Intelligence, PMLR, 2020, pp.541--551.
  77. P.XU, F.GAO, and Q.GU, Sample efficient policy gradient methods with recursive variance reduction, in International Conference on Learning Representations, 2020.
  78. Z.SHEN, A.RIBEIRO, H.HASSANI, H.QIAN, and C.MI, Hessian aided policy gradient, in International Conference on Machine Learning, PMLR, 2019, pp.5729--5738.
  79. H.XIONG, T.XU, Y.LIANG, and W.ZHANG, Non-asymptotic convergence of adam-type reinforcement learning algorithms under Markovian sampling, Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), pp.10460--10468.
  80. J.BHANDARI and D.RUSSO, Global optimality guarantees for policy gradient methods, arXiv preprint arXiv:1906.01786, (2019).
  81. 81.0 81.1 S.CEN, C.CHENG, Y.CHEN, Y.WEI, and Y.CHI, Fast global convergence of natural policy gradient methods with entropy regularization, arXiv preprint arXiv:2007.06558, (2020).
  82. L.SHANI, Y.EFRONI, and S.MANNOR, Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs, in Proceedings of the AAAI Conference on Artificial Intelligence, vol.34, 2020, pp.5668--5675.
  83. A.AGARWAL, S.M. Kakade, J.D. Lee, and G.MAHAJAN, On the theory of policy gradient methods: Optimality, approximation, and distribution shift, Journal of Machine Learning Research, 22 (2021), pp.1--76.
  84. J.MEI, C.XIAO, C.SZEPESVARI, and D.SCHUURMANS, On the global convergence rates of softmax policy gradient methods, in International Conference on Machine Learning, PMLR, 2020, pp.6820--6829.
  85. O.IAN, V.R. Benjamin, and R.DANIEL, (More) Efficient reinforcement learning via posterior sampling, in Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS'13, 2013, pp.3003--3011.
  86. 86.0 86.1 M.G. Azar, I.OSBAND, and R.MUNOS, Minimax regret bounds for reinforcement learning, in International Conference on Machine Learning, PMLR, 2017, pp.263--272.
  87. 87.0 87.1 C.JIN, Z.ALLEN-Zhu, S.BUBECK, and M.I. Jordan, Is Q-learning provably efficient?, in Advances in Neural Information Processing Systems, vol.31, 2018.
  88. L.YANG and M.WANG, Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound, in International Conference on Machine Learning, PMLR, 2020, pp.10746--10756.
  89. L.LI, A unifying framework for computational reinforcement learning theory, Rutgers The State University of New Jersey-New Brunswick, 2009.
  90. A.L. Strehl and M.L. Littman, A theoretical analysis of model-based interval estimation, in Proceedings of the 22nd International Conference on Machine Learning, ICML'05, Association for Computing Machinery, 2005, pp.856--863.
  91. 91.0 91.1 M.G. Azar, R.MUNOS, and H.J. Kappen, Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model, Machine Learning, 91 (2013), pp.325--349.
  92. A.AGARWAL, S.KAKADE, and L.F. Yang, Model-based reinforcement learning with a generative model is minimax optimal, in Conference on Learning Theory, PMLR, 2020, pp.67--83.
  93. X.GUO, A.HU, and Y.ZHANG, Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls, arXiv preprint arXiv:2104.09311, (2021).
  94. C.DANN, Y.MANSOUR, M.MOHRI, A.SEKHARI, and K.SRIDHARAN, Guarantees for epsilon-greedy reinforcement learning with function approximation, in International Conference on Machine Learning, PMLR, 2022, pp.4666--4689.
  95. W.DABNEY, G.OSTROVSKI, and A.BARRETO, Temporally-extended $\epsilon$-greedy exploration, arXiv preprint arXiv:2006.01782, (2020).
  96. W.R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), pp.285--294.
  97. Y.OUYANG, M.GAGRANI, A.NAYYAR, and R.JAIN, Learning unknown markov decision processes: A thompson sampling approach, Advances in neural information processing systems, 30 (2017).
  98. A.GOPALAN and S.MANNOR, Thompson sampling for learning parameterized markov decision processes, in Conference on Learning Theory, PMLR, 2015, pp.861--898.
  99. M.GEIST, B.SCHERRER, and O.PIETQUIN, A theory of regularized Markov decision processes, in International Conference on Machine Learning, PMLR, 2019, pp.2160--2169.
  100. X.GAO, Z.Q. Xu, and X.Y. Zhou, State-dependent temperature control for langevin diffusions, arXiv preprint arXiv:2011.07456, (2020).
  101. W.TANG, P.Y. Zhang, and X.Y. Zhou, Exploratory HJB equations and their convergence, arXiv preprint arXiv:2109.10269, (2021).
  102. J.GRAU-Moya, F.LEIBFRIED, and P.VRANCX, Soft Q-learning with mutual-information regularization, in International Conference on Learning Representations, 2018.
  103. A.MASSOUD Farahmand, M.GHAVAMZADEH, C.SZEPESV\`ari, and S.MANNOR, Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems, in 2009 American Control Conference, IEEE, 2009, pp.725--730.
  104. A.M. Farahmand, M.GHAVAMZADEH, C.SZEPESV\`ari, and S.MANNOR, Regularized policy iteration, in Advances in Neural Information Processing Systems 21 - Proceedings of the 2008 Conference, 01 2008, pp.441--448.
  105. P.GAJANE, R.ORTNER, and P.AUER, A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions, arXiv preprint arXiv:1805.10066, (2018).
  106. 106.0 106.1 W.C. Cheung, D.SIMCHI-Levi, and R.ZHU, Learning to optimize under non-stationarity, in The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, 2019, pp.1079--1087.
  107. W.MAO, K.ZHANG, R.ZHU, D.SIMCHI-Levi, and T.BA{\cs}ar, Model-free non-stationary rl: Near-optimal regret and applications in multi-agent rl and inventory control, arXiv preprint arXiv:2010.03161, (2020).
  108. A.TOUATI and P.VINCENT, Efficient learning in non-stationary linear markov decision processes, arXiv preprint arXiv:2010.12870, (2020).
  109. \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism, in International Conference on Machine Learning, PMLR, 2020, pp.1843--1854.
  110. C.-Y. Wei and H.LUO, Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach, in Conference on Learning Theory, PMLR, 2021, pp.4300--4354.