Deep Reinforcement Learning

[math] \newcommand*{\rom}[1]{\expandafter\@slowromancap\romannumeral #1@} \DeclareMathOperator*{\dprime}{\prime \prime} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\N}{\mathbb{N}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\Sc}{\mathcal{S}} \DeclareMathOperator{\Ac}{\mathcal{A}} \DeclareMathOperator{\Pc}{\mathcal{P}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\sx}{\underline{\sigma}_{\pmb{X}}} \DeclareMathOperator{\sqmin}{\underline{\sigma}_{\pmb{Q}}} \DeclareMathOperator{\sqmax}{\overline{\sigma}_{\pmb{Q}}} \DeclareMathOperator{\sqi}{\underline{\sigma}_{Q,\textit{i}}} \DeclareMathOperator{\sqnoti}{\underline{\sigma}_{\pmb{Q},-\textit{i}}} \DeclareMathOperator{\sqfir}{\underline{\sigma}_{\pmb{Q},1}} \DeclareMathOperator{\sqsec}{\underline{\sigma}_{\pmb{Q},2}} \DeclareMathOperator{\sru}{\underline{\sigma}_{\pmb{R}}^{u}} \DeclareMathOperator{\srv}{\underline{\sigma}_{\pmb{R}}^v} \DeclareMathOperator{\sri}{\underline{\sigma}_{R,\textit{i}}} \DeclareMathOperator{\srnoti}{\underline{\sigma}_{\pmb{R},\textit{-i}}} \DeclareMathOperator{\srfir}{\underline{\sigma}_{\pmb{R},1}} \DeclareMathOperator{\srsec}{\underline{\sigma}_{\pmb{R},2}} \DeclareMathOperator{\srmin}{\underline{\sigma}_{\pmb{R}}} \DeclareMathOperator{\srmax}{\overline{\sigma}_{\pmb{R}}} \DeclareMathOperator{\HH}{\mathcal{H}} \DeclareMathOperator{\HE}{\mathcal{H}(1/\varepsilon)} \DeclareMathOperator{\HD}{\mathcal{H}(1/\varepsilon)} \DeclareMathOperator{\HCKI}{\mathcal{H}(C(\pmb{K}^0))} \DeclareMathOperator{\HECK}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HECKI}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} \DeclareMathOperator{\HC}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HCK}{\mathcal{H}(C(\pmb{K}))} \DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}))} \DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}^0))} \DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} \newcommand{\mathds}{\mathbb}[/math]

In the setting where the state and action spaces are very large, or high dimensional or they are continuous it is often challenging to apply classical value-based and policy-based methods. In this context, parametrized value functions ([math]Q(s,a;\theta)[/math] and [math]V(s,a;\theta)[/math]) or policies ([math]\pi(s,a;\theta)[/math]) are more practical. Here we focus on neural-network-based parametrizations and functional approximations, which have the following advantages:

  • Neural networks are well suited for dealing with high-dimensional inputs such as time series data, and, in practice, they do not require an exponential increase in data when adding extra dimensions to the state or action space.
  • In addition, they can be trained incrementally and make use of additional samples obtained as learning happens.

In this section, we will introduce deep RL methods in the setting of an infinite time horizon with discounting, finite state and action spaces ([math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]), and stationary policies.

Neural Networks

A typical neural network is a nonlinear function, represented by a collection of neurons. These are typically arranged as a number of layers connected by operators such as filters, poolings and gates, that map an input variable in [math]\mathbb{R}^{n} [/math] to an output variable in [math]\mathbb{R}^{m}[/math] for some [math]n,m \in \mathbb{Z}^+[/math]. In this subsection, we introduce three popular neural network architectures including fully-connected neural networks, convolutional neural networks [1] and recurrent neural networks [2].


Fully-connected Neural Networks. A fully-connected Neural Network (FNN) is the simplest neural network architecture where any given neuron is connected to all neurons in the previous layer. To describe the setup we fix the number of layers [math]L \in \mathbb{Z}^{+}[/math] in the neural network and the width of the [math]l[/math]-th layer [math]n_l \in \mathbb{Z}^+[/math] for [math]l=1,2,\ldots, L[/math]. Then for an input variable [math]\mathbf{z} \in \mathbb{R}^{n}[/math], the functional form of the FNN is

[[math]] \begin{eqnarray}\label{eq:generator} F(\mathbf{z};\pmb{W},\pmb{b}) = \mathbf{W}_L \cdot \sigma \left( \mathbf{W}_{L-1} \cdots \sigma (\mathbf{W}_1 \mathbf{z}+\mathbf{b}_1)\cdots +\mathbf{b}_{L-1}\right) +\mathbf{b}_L, \end{eqnarray} [[/math]]

in which [math](\mathbf{W},\mathbf{b})[/math] represents all the parameters in the neural network, with [math]\mathbf{W} = (\mathbf{W}_1,\mathbf{W}_2,\ldots,\mathbf{W}_L)[/math] and [math]\mathbf{b} = (\mathbf{b}_1,\mathbf{b}_2,\ldots,\mathbf{b}_L)[/math]. Here [math]\mathbf{W}_l \in \mathbb{R}^{n_l \times n_{l-1}}[/math] and [math]\mathbf{b}_l \in \mathbb{R}^{n_l \times 1}[/math] for [math]l=1,2,\ldots,L[/math], where [math]n_0 = n[/math] is set as the dimension of the input variable. The operator [math]\sigma(\cdot)[/math] takes a vector of any dimension as input, and applies a function component-wise. Specifically, for any [math]q\in \mathbb{Z}^+[/math] and any vector [math]\mathbf{u}=(u_1,u_2,\cdots,u_q)^{\top} \in \mathbb{R}^{q}[/math], we have that

[[math]] \begin{eqnarray*} \sigma(\mathbf{u}) = (\sigma(u_1),\sigma(u_2),\cdots,\sigma(u_q))^{\top}. \end{eqnarray*} [[/math]]

In the neural networks literature, the [math]\mathbf{W}_l[/math]'s are often called the weight matrices, the [math]\mathbf{b}_l[/math]'s are called bias vectors, and [math]\sigma(\cdot)[/math] is referred to as the activation function. Several popular choices for the activation function include ReLU with [math]\sigma(u) = \max(u,0)[/math], Leaky ReLU with [math]\sigma(u) = a_1\,\max(u,0) - a_2\, \max(-u,0)[/math] where [math]a_1,a_2 \gt 0[/math], and smooth functions such as [math]\sigma(\cdot) = \tanh(\cdot)[/math]. In \eqref{eq:generator}, information propagates from the input layer to the output layer in a feed-forward manner in the sense that connections between the nodes do not form a cycle. Hence \eqref{eq:generator} is also referred to as fully-connected feed-forward neural network in the deep learning literature.

Convolutional Neural Networks. In addition to the FNNs above, convolutional neural networks (CNNs) are another type of feed-foward neural network that are especially popular for image processing. In the finance setting CNNs have been successfully applied to price prediction based on inputs which are images containing visualizations of price dynamics and trading volumes [3]. The CNNs have two main building blocks -- convolutional layers and pooling layers. Convolutional layers are used to capture local patterns in the images and pooling layers are used to reduce the dimension of the problem and improve the computational efficiency. Each convolutional layer uses a number of trainable filters to extract features from the data. We start with the simple case of a single filter [math]\pmb{H}[/math], which applies the following convolution operation [math]\pmb{z}\ast \pmb{H}:\mathbb{R}^{n_x\times n_y}\times \mathbb{R}^{k_x\times k_y}\rightarrow \mathbb{R}^{(n_x-k_x+1)\times(n_y-k_y+1)}[/math] to the input [math]\pmb{z}[/math] through

[[math]] [\pmb{z}\ast \pmb{H}]_{i,j} = \sum_{i^\prime=1}^{k_x}\sum_{j^\prime=1}^{k_y}[\pmb{z}]_{i+i^\prime-1,j+j^\prime-1}[\pmb{H}]_{i^\prime,j^\prime}. [[/math]]

The output of the convolutional layer [math]\widehat{\pmb{z}}[/math] is followed by the activation function [math]\sigma(\cdot)[/math], that is

[[math]] [\widehat{\pmb{z}}]_{i,j}=\sigma([\pmb{z}\ast \pmb{H}]_{i,j}). [[/math]]

The weights and bias introduced for FNNs can also be incorporated in this setting thus, in summary, the above simple CNN can be represented by

[[math]] F(\pmb{z};\pmb{H},\pmb{W},\pmb{b}) = \pmb{W}\cdot\sigma(\pmb{z}\ast \pmb{H})+\pmb{b}. [[/math]]

This simple convolutional layer can be generalized to the case of multiple channels with multiple filters, where the input is a 3D tensor [math]\pmb{z}\in\mathbb{R}^{n_x\times n_y\times n_z}[/math] where [math]n_z[/math] denotes the number of channels. Each channel represents a feature of the input, for example, an image as an input typically has three channels, red, green, and blue. Then each filter is also represented by a 3D tensor [math]\pmb{H}_k\in\mathbb{R}^{k_x\times k_y\times n_z}[/math] ([math]k=1,\ldots,K[/math]) with the same third dimension [math]n_z[/math] as the input and [math]K[/math] is the total number of filters. In this case, the convolution operation becomes

[[math]] [\pmb{z}\ast \pmb{H}_k]_{i,j} = \sum_{i^\prime=1}^{k_x}\sum_{j^\prime=1}^{k_y}\sum_{l=1}^{n_z}[\pmb{z}]_{i+i^\prime-1,j+j^\prime-1,l}[\pmb{H}]_{i^\prime,j^\prime,l}, [[/math]]

and the output is given by

[[math]] [\widehat{\pmb{z}}]_{i,j,k}=\sigma([\pmb{z}\ast \pmb{H}_k]_{i,j}). [[/math]]

Pooling layers are used to aggregate the information and reduce the computational cost, typically after the convolutional layers. A commonly used pooling layer is the max pooling layer, which computes the maximum value of a small neighbourhood in the spatial coordinates. Note that in addition, fully-connected layers, as introduced above, are often used in the last few layers of a CNN.

Recurrent Neural Networks. Recurrent Neural Networks (RNNs) are a family of neural networks that are widely used in processing sequential data, including speech, text and financial time series data. Unlike feed-forward neural networks, RNNs are a class of artificial neural networks where connections between units form a directed cycle. RNNs can use their internal memory to process arbitrary sequences of inputs and hence are applicable to tasks such as sequential data processing. For RNNs, our input is a sequence of data [math]\pmb{z}_1,\pmb{z}_2,\ldots,\pmb{z}_T[/math]. An RNN models the internal state [math]\pmb{h}_t[/math] by a recursive relation

[[math]] \pmb{h}_t = F(\pmb{h}_{t-1},\pmb{z}_t;\theta), [[/math]]

where [math]F[/math] is a neural network with parameter [math]\theta[/math] (for example [math]\theta=(\pmb{W},\pmb{b})[/math]). Then the output is given by

[[math]] \widehat{\pmb{z}}_t = G(\pmb{h}_{t-1},\pmb{z}_t;\phi), [[/math]]

where [math]G[/math] is another neural network with parameter [math]\phi[/math] ([math]\phi[/math] can be the same as [math]\theta[/math]). There are two important variants of the vanilla RNN introduced above -- the long short-term memory (LSTM) and the gated recurrent units (GRUs). Compared to vanilla RNNs, LSTM and GRUs are shown to have better performance in handling sequential data with long-term dependence due to their flexibility in propagating information flows. Here we introduce the LSTM. Let [math]\odot[/math] denote element-wise multiplication. The LSTM network architecture is given by

[[math]] \begin{eqnarray*} \pmb{f}_t &=& \sigma(\pmb{U}^{f}\pmb{z}_t+\pmb{W}^f \pmb{h}_{t-1}+\pmb{b}^f) \\ \pmb{g}_t &=& \sigma(\pmb{U}^{g}\pmb{z}_t+\pmb{W}^g \pmb{h}_{t-1}+\pmb{b}^g) \\ \pmb{q}_t &=& \sigma(\pmb{U}^{q}\pmb{z}_t+\pmb{W}^q \pmb{h}_{t-1}+\pmb{b}^q) \\ \pmb{c}_t &=& \pmb{f}_t\odot \pmb{c}_{t-1}+\pmb{g}_t\odot\sigma(\pmb{U}\pmb{z}_t+\pmb{W}\pmb{h}_{t-1}+\pmb{b}) \\ \pmb{h}_t &=& \tanh(\pmb{c}_t)\odot \pmb{q}_t \end{eqnarray*} [[/math]]

where [math]\pmb{U}, \pmb{U}^f, \pmb{U}^g, \pmb{U}^q, \pmb{W}, \pmb{W}^f, \pmb{W}^g, \pmb{W}^q[/math] are trainable weights matrices, and [math]\pmb{b}, \pmb{b}^f, \pmb{b}^g, \pmb{b}^q[/math] are trainable bias vectors. In addition to the internal state [math]\pmb{h}_t[/math], the LSTM also uses the gates and a cell state [math]\pmb{c}_t[/math]. The forget gate [math]\pmb{f}_t[/math] and the input gate [math]\pmb{g}_t[/math] determine how much information can be transmitted from the previous cell state [math]\pmb{c}_{t-1}[/math] and from the update [math]\sigma(\pmb{U}\pmb{z}_t+\pmb{W}\pmb{h}_{t-1}+\pmb{b})[/math], to the current cell state [math]\pmb{c}_t[/math]. The output gate [math]\pmb{q}_t[/math] controls how much [math]\pmb{c}_t[/math] reveals to [math]\pmb{h}_t[/math]. For more details see [2] and [4].

Training of Neural Networks. Mini-batch stochastic gradient descent is a popular choice for training neural networks due to its sample and computational efficiency. In this approach the parameters [math]\theta=(\pmb{W},\pmb{b})[/math] are updated in the descent direction of an objective function [math]\mathcal{L}(\theta)[/math] by selecting a mini-batch of samples at random to estimate the gradient of the objective function [math]\nabla_{\theta}\mathcal{L}(\theta)[/math] with respect to the parameter [math]\theta[/math]. For example, in supervised learning, we aim to learn the relationship between an input [math]X[/math] and an output [math]Y[/math], and the objective function [math]\mathcal{L}(\theta)[/math] measures the distance between the model prediction (output) and the actual observation [math]Y[/math]. Assume that the dataset contains [math]M[/math] samples of [math](X,Y)[/math]. Then the mini-batch stochastic gradient descent method is given as

[[math]] \begin{equation}\label{eqn:mini_batch_gd} \theta^{(n+1)} = \theta^{(n)} - \beta \widehat{\nabla_{\theta}\mathcal{L}(\theta^{(n)})}, \qquad n=0,\ldots,N-1, \end{equation} [[/math]]

where [math]\beta[/math] is a constant learning rate and [math]\widehat{\nabla_{\theta}\mathcal{L}(\theta^{(n)})}[/math] is an estimate of the true gradient [math]\nabla_{\theta}\mathcal{L}(\theta^{(n)})[/math], which is computed by averaging over [math]m[/math] ([math]m\ll M[/math]) samples of [math](X,Y)[/math]. It is called (vanilla) stochastic gradient descent when [math]m=1[/math],and it is the (traditional) gradient descent when [math]m=M[/math]. Compared with gradient descent, mini-batch stochastic gradient descent is noisy but computationally efficient since it only uses a subset of the data, which is advantageous when dealing with large datasets. It is also worth noting that, in addition to the standard mini-batch stochastic gradient descent methods \eqref{eqn:mini_batch_gd}, momentum methods are popular extensions which take into account the past gradient updates in order to accelerate the learning process. We give the updating rules for two examples of gradient descent methods with momentum -- the standard momentum and the Nesterov momentum [5],

[[math]] \begin{equation*} \begin{cases} \underbrace{z_{n+1}=\beta z_{n}+\nabla_{\theta}\mathcal{L}(\theta^{(n)})}_{\text{the standard momentum}}\quad \text{or} \quad \underbrace{z_{n+1}=\beta z_{n}+\nabla_{\theta} \mathcal{L}(\theta^{(n)}-\alpha z_{n})}_{\text{the Nesterov momentum}}\\ \theta^{(n+1)}=\theta^{(n)}-\alpha z_{n+1}, \end{cases} \end{equation*} [[/math]]

where [math]\alpha[/math] and [math]\beta[/math] are constant learning rates. Such methods are particularly useful when the algorithms enter into a region where the gradient changes dramatically and thus the learned parameters can bounce around the region which slows down the progress of the search. Additionally, there are many other variants such as RMSprop [6] and ADAM [7], both of which employ adaptive learning rates.

Deep Value-based RL Algorithms

In this section, we introduce several [math]Q[/math]-learning algorithms with neural network approximations. We refer the reader to [8] for other deep value-based RL algorithms such as Neural TD learning and Dueling [math]Q[/math]-Networks.


Neural Fitted [math]Q[/math]-learning. Fitted [math]Q[/math]-learning [9] is a generalization of the classical [math]Q[/math]-learning algorithm with functional approximations and it is applied in an off-line setting with a pre-collected dataset in the form of tuples [math](s, a, r, s')[/math] with [math]s'\sim P(s,a)[/math] and [math]r = r(s,a)[/math] which is random. When the class of approximation functionals is constrained to neural networks, the algorithm is referred to as Neural Fitted [math]Q[/math]-learning and the [math]Q[/math]-function is parameterized by [math]Q(s,a;\theta) = F((s,a);\theta)[/math] with [math]F[/math] a neural network function parameterized by [math]\theta[/math] [10]. For example, [math]F[/math] can be set as \eqref{eq:generator} with [math]\theta = (\pmb{W},\pmb{b})[/math]. In Neural Fitted [math]Q[/math]-learning, the algorithm starts with some random initialization of the [math]Q[/math]-values [math]Q(s, a; \theta^{(0)})[/math] where [math]\theta^{(0)}[/math] refers to the initial parameters. Then, an approximation of the [math]Q[/math]-values at the [math]n[/math]-th iteration [math]Q(s,a;\theta^{(n)})[/math] is updated towards the target value

[[math]] \begin{eqnarray}\label{eq:Y_update} Y^Q_n = r + \gamma \max_{a'\in \mathcal{A}}Q(s',a';\theta^{(n)}), \end{eqnarray} [[/math]]

where [math]\theta^{(n)}[/math] are the neural network parameters in the [math]n[/math]-th iteration, updated by stochastic gradient descent (or a variant) by minimizing the square loss:

[[math]] \begin{eqnarray} \mathcal{L}_{NFQ}(\theta^{(n)}) = \|Q(s,a;\theta^{(n)})-Y^Q_n\|_2^2. \end{eqnarray} [[/math]]


Thus, the [math]Q[/math]-learning update amounts to updating the parameters:

[[math]] \begin{eqnarray}\label{eq:deep_q_update} \theta^{(n+1)} = \theta^{(n)} +\beta \left(Y_n^Q-Q(s,a;\theta^{(n)})\right) \nabla_{\theta^{(n)}} Q(s,a;\theta^{(n)}) \end{eqnarray} [[/math]]

where [math]\beta[/math] is a learning rate. This update resembles stochastic gradient descent, updating the current value [math]Q(s, a; \theta^{(n)})[/math] towards the target value [math]Y_n^Q[/math]. When neural networks are applied to approximate the [math]Q[/math]-function, it has been empirically observed that Neural Fitted [math]Q[/math]-learning may suffer from slow convergence or even divergence [10]. In addition, the approximated [math]Q[/math]-values tend to be overestimated due to the max operator [11].

Deep [math]Q[/math]-Network (DQN). To overcome the instability issue and the risk of overestimation mentioned above, [12] proposed a Deep [math]Q[/math]-Network (DQN) algorithm in an online setting with two novel ideas. One idea is the slow-update of the target network and the second is the use of ‘experience replay’. Both these ideas dramatically improve the empirical performance of the algorithm and DQN has been shown to have a strong performance for a variety of ATARI games [13]. We first discuss the use of experience replay [14]. That is we introduce a replay memory [math]\mathcal{B}[/math]. At each time [math]t[/math], the tuple [math](s_t, a_t, r_t,s_{t+1})[/math] is stored in [math]\mathcal{B}[/math] and a mini-batch of {[math]B[/math]} independent samples is randomly selected from [math]\mathcal{B}[/math] to train the neural network via stochastic gradient descent. Since the trajectory of an MDP has strong temporal correlation, the goal of experience replay is to obtain uncorrelated samples that are more similar to the i.i.d data (often assumed in many optimization algorithms), which can give more accurate gradient estimation for the stochastic optimization problem and enjoy better convergence performance. For experience replay, the replay memory size is usually very large in practice. For example, the replay memory size is [math]10^6[/math] in [12]. Moreover, DQN uses the [math]\varepsilon[/math]-greedy policy, which enables exploration over the state-action space [math]\mathcal{S}\times \mathcal{A}[/math]. Thus, when the replay memory is large, experience replay is close to sampling independent transitions from an explorative policy. This reduces the variance of the gradient which is used to update [math]\theta[/math]. Thus, experience replay stabilizes the training of DQN, which benefits the algorithm in terms of computational efficiency.

We now discuss using a target network [math]Q(\cdot,\cdot;{\theta}^{-})[/math] with parameter [math]{\theta}^{-}[/math] (the current estimate of the parameter). With independent samples [math]\{(s_{(i)}, a_{(i)}, r_{(i)}, s_{(i)}^{\prime})\}_{i=0}^{B}[/math] from the replay memory (we use [math]s_{(i)}^{\prime}[/math] instead of [math]s_{(i+1)}[/math] for the state after [math](s_{(i)},a_{(i)})[/math] to avoid notational confusion with the next independent sample [math]s_{(i+1)}[/math] in the state space), to update the parameter [math]\theta[/math] of the [math]Q[/math]-network, we compute the target

[[math]] \begin{eqnarray}\label{eq:Y_update2} \widetilde{Y}_i^Q = r_{(i)} + \gamma \max_{a'\in\mathcal{A}}Q(s_{(i)}',a';\theta^{(n)-}), \end{eqnarray} [[/math]]

and update [math]\theta[/math] using the gradient of

[[math]] \begin{eqnarray} \mathcal{L}_{DQN}(\theta^{(n)},\theta^{(n)-}) =\frac{1}{B}\sum_{i=1}^B \left\|Q(s_{(i)},a_{(i)};\theta^{(n)})-\widetilde{Y}_i^Q\right\|_2^2. \end{eqnarray} [[/math]]

Whereas the parameter [math]\theta^{(n)-}[/math] is updated once every [math]T_{\rm target}[/math] steps by letting [math]\theta^{(n)-} = \theta^{(n)}[/math], if [math]n = m T_{\rm target}[/math] for some [math]m\in \mathbb{Z}^{+}[/math], and [math]\theta^{(n)-} = \theta^{(n-1)-}[/math] otherwise. That is, the target network is held fixed for [math]T_{\rm target}[/math] steps and then updated using the current weights of the [math]Q[/math]-network. The introduction of a target network prevents the rapid propagation of instabilities and it reduces the risk of divergence The idea of target networks can be seen as an instantiation of Fitted [math]Q[/math]-learning, where each period between target network updates corresponds to a single Fitted [math]Q[/math]-iteration.

Double Deep [math]Q[/math]-network (DQN). The max operator in Neural Fitted [math]Q[/math]-learning and DQN, in \eqref{eq:Y_update} and \eqref{eq:Y_update2}, uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in over optimistic value estimates. To prevent this, Double [math]Q[/math]-learning [15] decouples the selection from the evaluation and this is further extended to the neural network setting [11]. In double [math]Q[/math]-learning [15] and double deep [math]Q[/math]-network [11], two value functions are learned by assigning experiences randomly to update one of the two value functions, resulting in two sets of weights, [math]\theta[/math] and [math]\eta[/math]. For each update, one set of weights is used to determine the greedy policy and the other to determine its value. For a clear comparison, we can untangle the selection and evaluation in Neural Fitted [math]Q[/math]-learning and rewrite its target \eqref{eq:Y_update} as

[[math]] \begin{eqnarray}\label{eq:Y_update_prime} Y^Q_n = r + \gamma\, Q(s',\arg\max_{a\in \mathcal{A}}Q(s',a;\theta^{(n)});\theta^{(n)}). \end{eqnarray} [[/math]]

The target of Double (deep) [math]Q[/math]-learning can then be written as

[[math]] \begin{eqnarray}\label{eq:Y_update3} \overline{Y}^Q_n = r + \gamma\, Q(s',\arg\max_{a\in \mathcal{A}}Q(s',a;\theta^{(n)});\eta^{(n)}). \end{eqnarray} [[/math]]

Notice that the selection of the action, in the argmax, is still due to the online weights [math]\theta^{(n)}[/math]. This means that, as in [math]Q[/math]-learning, we are still estimating the value of the greedy policy according to the current values, as defined by [math]\theta^{(n)}[/math]. However, we use the second set of weights [math]\eta^{(n)}[/math] to fairly compute the value of this policy. This second set of weights can be updated symmetrically by switching the roles of [math]\theta[/math] and [math]\eta[/math].


Convergence Guarantee. For DQN, [16] characterized the approximation error of the [math]Q[/math]-function by the sum of a statistical error and an algorithmic error, and the latter decays to zero at a geometric rate as the algorithm proceeds. The statistical error characterizes the bias and variance arising from the [math]Q[/math]-function approximation using the neural network. [17] parametrized the [math]Q[/math]-function by a two-layer neural network and provided a mean-squared sample complexity with sublinear convergence rate for neural TD learning. The two-layer network in [17] with width [math]m[/math] is given by

[[math]] \begin{equation}\label{eqn:two_layer_NN} F\big((s,a);\pmb{W}\big)=\frac{1}{\sqrt{m}}\sum_{l=1}^{m} c_l\cdot \sigma\big(\pmb{W}_l^\top (s,a) \big), \end{equation} [[/math]]

where the activation function [math]\sigma(\cdot)[/math] is set to be the ReLU function, and the parameter [math]\pmb{c}=(c_1,\ldots,c_m)[/math] is fixed at the initial parameter during the training process and only the weights [math]\pmb{W}[/math] are updated. [18] considered a more challenging setting than [17], where the input data are non-i.i.d and the neural network has multiple (more than two) layers and obtained the same sublinear convergence rate. Furthermore, [19] also employed the two-layer neural network in \eqref{eqn:two_layer_NN} and proved that two algorithms used in practice, the projection-free and max-norm regularized [20] neural TD, achieve mean-squared sample complexity of [math]\widetilde{\mathcal{O}}^\prime(1/\varepsilon^6)[/math] and [math]\widetilde{\mathcal{O}}^\prime(1/\varepsilon^4)[/math], respectively.

Deep Policy-based RL Algorithms

In this section we focus on deep policy-based methods, which are extensions of policy-based methods using neural network approximations. We parameterize the policy [math]\pi[/math] by a neural network [math]F[/math] with parameter [math]\theta=(\pmb{W}, \pmb{b})[/math], that is, [math]a\sim\pi(s,a;\theta)=f(F((s,a);\theta))[/math] for some function [math]f[/math]. A popular choice of [math]f[/math] is given by

[[math]] f(F((s,a);\theta))=\frac{\exp(\tau F((s,a);\theta))}{\sum_{a^\prime\in\mathcal{A}}\exp(\tau F((s,a^\prime);\theta))}, [[/math]]

for some parameter [math]\tau[/math], which gives an energy-based policy (see, e.g. [21][22]). The policy parameter [math]\theta[/math] is updated using the gradient ascent rule given by

[[math]] \theta^{(n+1)} = \theta^{(n)} + \beta \widehat{\nabla_{\theta}J(\theta^{(n)})}, \qquad n=0,\ldots,N-1, [[/math]]

where [math]\widehat{\nabla_{\theta}J(\theta^{(n)})}[/math] is an estimate of the policy gradient.

Using neural networks to parametrize the policy and/or value functions in the vanilla version of policy-based methods discussed in Section leads to neural Actor-Critic algorithms [22], neural PPO/TRPO [23], and deep DPG (DDPG) [24]. In addition, since introducing an entropy term in the objective function encourages policy exploration [21] and speeds the learning process [25][26] (as discussed in Section), there have been some recent developments in (off-policy) soft Actor-Critic algorithms [25][27] using neural networks, which solve the RL problem with entropy regularization. Below we introduce the DDPG algorithm, which is one of the most popular deep policy-based methods, and which has been applied in many financial problems.

DDPG. DDPG is a model-free off-policy Actor-Critic algorithm, first introduced in [24], which combines the DQN and DPG algorithms. Since its structure is more complex than DQN and DPG, we provide the pseudocode for DDPG in Algorithm. DDPG extends the DQN to continuous action spaces by incorporating DPG to learn a deterministic strategy. To encourage exploration, DDPG uses the following action

[[math]] a_t \sim \pi^D(s_t;\theta_t) + \epsilon, [[/math]]

where [math]\pi^D[/math] is a deterministic policy and [math]\epsilon[/math] is a random variable sampled from some distribution [math]\mathcal{N}[/math], which can be chosen according to the environment. Note that the algorithm requires a small learning rate [math]\bar{\beta}\ll 1[/math] (see line 14 in Algorithm) to improve the stability of learning the target networks. In the same way as DQN (see Section), the DDPG algorithm also uses a replay buffer to improve the performance of neural networks.

The DDPG Algorithm
input: an actor [math]\pi^D(s;\theta)[/math], a critic network [math]Q(s,a;\phi)[/math], learning rates [math]\beta[/math] and [math]\bar{\beta}[/math], initial parameters [math]\theta^{(0)}[/math] and [math]\phi^{(0)}[/math]
Initialize target networks parameters [math]\bar{\phi}^{(0)}\leftarrow \phi^{(0)}[/math] and [math]\bar{\theta}^{(0)}\leftarrow \theta^{(0)}[/math],
Initialize replay buffer [math]\mathcal{B}[/math]
if [math]n=0,\ldots,N-1[/math] then
Initialize state [math]s_1[/math]
if [math]t=0,\ldots,M-1[/math] then
Select action [math]a_t\sim\pi^D(s_t;\theta^{(n)})+\epsilon_t[/math] with [math]\epsilon_t\sim\mathcal{N}[/math]
Execute action [math]a_t[/math] and observe reward [math]r_t[/math] and observe new state [math]s_{t+1}[/math]
Store transition [math](s_t,a_t,r_t,s_{t+1})[/math] in [math]\mathcal{B}[/math]
if [math]|\mathcal{B}| \gt B[/math] then
Sample a random mini-batch of [math]B[/math] transitions [math]\{(s_{(i)},a_{(i)},r_{(i)},s_{(i+1)})\}_{i=1}^B[/math] from [math]\mathcal{B}[/math]
Set the target [math]Y_i = r_i+\gamma Q(s_{i+1},\pi^D(s_{i+1};\bar{\theta}^{(n)});\bar{\phi}^{(n)})[/math].
Update the critic by minimizing the loss: [math]\phi^{(n+1)}=\phi^{(n)}-\beta\nabla_{\phi}\mathcal{L}_{\rm DDPG}(\phi^{(n)})[/math] with [math]\mathcal{L}_{\rm DDPG}(\phi^{(n)})=\frac{1}{B}\sum_{i=1}^B (Y_i-Q(s_i,a_i;\phi^{(n)}))^2[/math].
Update the actor by using the sampled policy gradient: [math] \theta^{(n+1)}=\theta^{(n)}+\beta\nabla_{\theta}J(\theta^{(n)})[/math] with

[[math]] \nabla_{\theta}J(\theta^{(n)})\approx \frac{1}{B}\sum_{i=1}^B\nabla_a Q(s,a;\phi^{(n)})|_{s=s_i,a=a_i}\nabla_{\theta}\pi^D(s;\theta^{(n)})|_{s=s_i}. [[/math]]

Update the target networks:

[[math]] \begin{eqnarray*} \bar{\phi}^{(n+1)}&\leftarrow& \bar{\beta} \phi^{(n+1)} + (1-\bar{\beta}) \bar{\phi}^{(n)}\\ \bar{\theta}^{(n+1)}&\leftarrow& \bar{\beta} \theta^{(n+1)} + (1-\bar{\beta}) \bar{\theta}^{(n)} \end{eqnarray*} [[/math]]

end if
end for
end for


Convergence Guarantee. By parameterizing the policy and/or value functions using a two-layer neural network given in \eqref{eqn:two_layer_NN}, [23] provided a mean-squared sample complexity for neural PPO and TRPO algorithms with sublinear convergence rate; [22] studied neural Actor-Critic methods where the actor updates using (1) vanilla policy gradient or (2) natural policy gradient, and in both cases the critic updates using TD(0). They proved that in case (1) the algorithm converges to a stationary point at a sublinear rate and they also established the global optimality of all stationary points under mild regularity conditions. In case (2) the algorithm was proved to achieve a mean-squared sample complexity with sublinear convergence rate. To the best of our knowledge, no convergence guarantee has been established for the DDPG (and DPG) algorithms.

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

References

  1. Y.LECUN and Y.BENGIO, Convolutional networks for images, speech, and time series, The Handbook of Brain Theory and Neural Networks, 3361 (1995), p.1995.
  2. 2.0 2.1 I.SUTSKEVER, J.MARTENS, and G.E. Hinton, Generating text with recurrent neural networks, in ICML, 2011.
  3. J.JIANG, B.T. Kelly, and D.XIU, (Re-) Imag (in) ing price trends, Chicago Booth Research Paper, (2020).
  4. J.FAN, C.MA, and Y.ZHONG, A selective overview of deep learning, Statistical Science, 36 (2021).
  5. Y.E. Nesterov, A method for solving the convex programming problem with convergence rate o (1/k\^{} 2), in Dokl. akad. nauk Sssr, vol.269, 1983, pp.543--547.
  6. T.TIELEMAN and G.HINTON, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012), pp.26--31.
  7. D.P. Kingma and J.BA, Adam: A method for stochastic optimization, in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
  8. V.FRAN{\cc}ois-Lavet, P.HENDERSON, R.ISLAM, M.G. Bellemare, and J.PINEAU, An introduction to deep reinforcement learning, Foundations and Trends in Machine Learning, 11 (2018), pp.219--354.
  9. G.J. Gordon, Stable fitted reinforcement learning, in Advances in Neural Information Processing Systems, 1996, pp.1052--1058.
  10. 10.0 10.1 M.RIEDMILLER, Neural fitted Q iteration--first experiences with a data efficient neural reinforcement learning method, in European Conference on Machine Learning, Springer, 2005, pp.317--328.
  11. 11.0 11.1 11.2 H.VANHASSELT, A.GUEZ, and D.SILVER, Deep reinforcement learning with double Q-learning, in Proceedings of the AAAI Conference on Artificial Intelligence, vol.30, 2016.
  12. 12.0 12.1 V.MNIH, K.KAVUKCUOGLU, D.SILVER, A.A. Rusu, J.VENESS, M.G. Bellemare, A.GRAVES, M.RIEDMILLER, A.K. Fidjeland, G.OSTROVSKI, etAL., Human-level control through deep reinforcement learning, Nature, 518 (2015), pp.529--533.
  13. M.G. Bellemare, Y.NADDAF, J.VENESS, and M.BOWLING, The Arcade Learning Environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research, 47 (2013), pp.253--279.
  14. L.-J. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine Learning, 8 (1992), pp.293--321.
  15. 15.0 15.1 H.HASSELT, Double Q-learning, Advances in Neural Information Processing Systems, 23 (2010), pp.2613--2621.
  16. J.FAN, Z.WANG, Y.XIE, and Z.YANG, A theoretical analysis of deep Q-learning, in Learning for Dynamics and Control, PMLR, 2020, pp.486--489.
  17. 17.0 17.1 17.2 Q.CAI, Z.YANG, J.LEE, and Z.WANG, Neural temporal-difference learning converges to global optima, in Advances in Neural Information Processing Systems, 2019.
  18. P.XU and Q.GU, A finite-time analysis of Q-learning with neural network function approximation, in International Conference on Machine Learning, PMLR, 2020, pp.10555--10565.
  19. S.CAYCI, S.SATPATHI, N.HE, and R.SRIKANT, Sample complexity and overparameterization bounds for projection-free neural TD learning, arXiv preprint arXiv:2103.01391, (2021).
  20. I.GOODFELLOW, D.WARDE-Farley, M.MIRZA, A.COURVILLE, and Y.BENGIO, Maxout networks, in International Conference on Machine Learning, PMLR, 2013, pp.1319--1327.
  21. 21.0 21.1 T.HAARNOJA, H.TANG, P.ABBEEL, and S.LEVINE, Reinforcement learning with deep energy-based policies, in International Conference on Machine Learning, PMLR, 2017, pp.1352--1361.
  22. 22.0 22.1 22.2 L.WANG, Q.CAI, Z.YANG, and Z.WANG, Neural policy gradient methods: Global optimality and rates of convergence, in International Conference on Learning Representations, 2020.
  23. 23.0 23.1 B.LIU, Q.CAI, Z.YANG, and Z.WANG, Neural trust region/proximal policy optimization attains globally optimal policy, in Advances in Neural Information Processing Systems, vol.32, 2019.
  24. 24.0 24.1 T.P. Lillicrap, J.J. Hunt, A.PRITZEL, N.HEESS, T.EREZ, Y.TASSA, D.SILVER, and D.WIERSTRA, Continuous control with deep reinforcement learning, in 4th International Conference on Learning Representations (ICLR), 2016.
  25. 25.0 25.1 T.HAARNOJA, A.ZHOU, P.ABBEEL, and S.LEVINE, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in International Conference on Machine Learning, PMLR, 2018, pp.1861--1870.
  26. J.MEI, C.XIAO, C.SZEPESVARI, and D.SCHUURMANS, On the global convergence rates of softmax policy gradient methods, in International Conference on Machine Learning, PMLR, 2020, pp.6820--6829.
  27. T.HAARNOJA, A.ZHOU, K.HARTIKAINEN, G.TUCKER, S.HA, J.TAN, V.KUMAR, H.ZHU, A.GUPTA, P.ABBEEL, etAL., Soft actor-critic algorithms and applications, arXiv preprint arXiv:1812.05905, (2018).