Further Developments for Mathematical Finance and Reinforcement Learning

The general RL algorithms developed in the machine learning literature are good starting points for use in financial applications. A possible drawback is that such general RL algorithms tend to overfit, using more information than is actually required for a particular application. On the other hand, the stochastic control approach to many financial decision-making problems may suffer from the risk of model mis-specification. However, it may capture the essential features of a given financial application from a modeling perspective, in terms of the dynamics and the reward function. One promising direction for RL in finance is to develop an even closer integration of the modeling techniques (the domain knowledge) from the stochastic control literature and key components of a given financial application (for example the adverse selection risk for market-making problems and the execution risk for optimal liquidation problems) with the learning power of the RL algorithms. This line of developing a more integrated framework is interesting from both theoretical and applications perspectives. From the application point of view, a modified RL algorithm, with designs tailored to one particular financial application, could lead to better empirical performance. This could be verified by comparison with existing algorithms on the available datasets. In addition, financial applications motivate potential new frameworks and playgrounds for RL algorithms. Carrying out the convergence and sample complexity analysis for these modified algorithms would also be a meaningful direction in which to proceed. Many of the papers referenced in this review provide great initial steps in this direction. We list the following future directions that the reader may find interesting.


Risk-aware or Risk-sensitive RL. Risk arises from the uncertainties associated with future events, and is inevitable since the consequences of actions are uncertain at the time when a decision is made. Many decision-making problems in finance lead to trading strategies and it is important to account for the risk of the proposed strategies (which could be measured for instance by the maximum draw-down, the variance or the 5\ Hence it would be interesting to include risk measures in the design of RL algorithms for financial applications. The challenge of risk-sensitive RL lies both in the non-linearity of the objective function with respect to the reward and in designing a risk-aware exploration mechanism. RL with risk-sensitive utility functions has been studied in several papers without regard to specific financial applications. The work of [1] proposes TD(0) and [math]Q[/math]-learning-style algorithms that transform temporal differences instead of cumulative rewards, and proves their convergence. Risk-sensitive RL with a general family of utility functions is studied in [2], which also proposes a [math]Q[/math]-learning algorithm with convergence guarantees. The work of [3] studies a risk-sensitive policy gradient algorithm, though with no theoretical guarantees. [4] considers the problem of risk-sensitive RL with exponential utility and proposes two efficient model-free algorithms, Risk-sensitive Value Iteration (RSVI) and Risk-sensitive [math]Q[/math]-learning (RSQ), with a near-optimal sample complexity guarantee. [5] developed a martingale approach to learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios. Another line of work focuses on constrained RL problems with different risk criteria [6][7][8][9][10][11]. Very recently, [12] proposed a robust risk-aware reinforcement learning framework via robust optimization and with a rank dependent expected utility function. Financial applications such as statistical arbitrage and portfolio optimization are discussed with detailed numerical examples. [13] develops a framework combining policy-gradient-based RL method and dynamic convex risk measures for solving time-consistent risk-sensitive stochastic optimization problems. However, there is no sample complexity or asymptotic convergence studied for the proposed algorithms in [12][13].

Offline Learning and Online Exploration. Online learning requires updating of algorithm parameters in real-time and this is impractical for many financial decision-making problems, especially in the high-frequency regime. The most plausible setting is to collect data with a pre-specified exploration scheme during trading hours and update the algorithm with the new collected data after the close of trading. This is closely related to the translation of online learning to offline regression [14] and RL with batch data [15][16][17][18]. However, these developments focus on general methodologies without being specifically tailored to financial applications.


Learning with a Limited Exploration Budget. Exploration can help agents to find new policies to improve their future cumulative rewards. However, too much exploration can be both time consuming and computation consuming, and in particular, it may be very costly for some financial applications. Additionally, exploring black-box trading strategies may need a lot of justification within a financial institution and hence investors tend to limit the effort put into exploration and try to improve performance as much as possible within a given budget for exploration. This idea is similar in spirit to conservative RL where agents explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time [19]. This is also related to the problem of information acquisition with a cost which has been studied for economic commodities [20] and operations management [21]. It may also be interesting to investigate such costs for decision-making problems in financial markets.

Learning with Multiple Objectives. In finance, a common problem is to choose a portfolio when there are two conflicting objectives - the desire to have the expected value of portfolio returns be as high as possible, and the desire to have risk, often measured by the standard deviation of portfolio returns, be as low as possible. This problem is often represented by a graph in which the efficient frontier shows the best combinations of risk and expected return that are available, and in which indifference curves show the investor's preferences for various risk-expected return combinations. Decision makers sometimes combine both criteria into a single objective function consisting of the difference of the expected reward and a scalar multiple of the risk. However, it may well not be in the best interest of a decision maker to combine relevant criteria in a linear format for certain applications. For example, market makers on the OTC market tend to view criteria such as turn around time, balance sheet constraints, inventory cost, profit and loss as separate objective functions. The study of multi-objective RL is still at a preliminary stage and relevant references include [22] and [23].


Learning to Allocate Across Lit Pools and Dark Pools. Online optimization methods explored in [24] and [25] for dark pool allocations can be viewed as a single-period RL algorithm and the Bayesian framework developed in [26] for allocations across lit pools may be classified as a model-based RL approach. However, there is currently no existing work on applying multi-period and model-free RL methods to learn how to route orders across both dark pools and lit pools. We think this might be an interesting direction to explore as agents sometimes have access to both lit pools and dark pools and these two contrasting pools have quite different information structures and matching mechanisms.


Robo-advising in a Model-free Setting. As introduced in Section, [27] considered learning within a set of [math]m[/math] pre-specified investment portfolios, and [28] and [29] developed learning algorithms and procedures to infer risk preferences, respectively, under the framework of Markowitz mean-variance portfolio optimization. It would be interesting to consider a model-free RL approach where the robo-advisor has the freedom to learn and improve decisions beyond a pre-specified set of strategies or the Markowitz framework.


Sample Efficiency in Learning Trading Strategies. In recent years, sample complexity has been studied extensively to understand modern reinforcement learning algorithms (see The Basics of Reinforcement Learning-Deep Value-based RL Algorithms). However, most RL algorithms still require a large number of samples to train a decent trading algorithm, which may exceed the amount of relevant available historical data. Financial time series are known to be non-stationary [30], and hence historical data that are further away in time may not be helpful in training efficient learning algorithms for the current market environment. This leads to important questions of designing more sample-efficient RL algorithms for financial applications or developing good market simulators that could generate (unlimited) realistic market scenarios [31].


Transfer Learning and Cold Start for Learning New Assets. Financial institutions or individuals may change their baskets of assets to trade over time. Possible reasons may be that new assets (for example cooperative bonds) are issued from time to time or the investors may switch their interest from one sector to another. There are two interesting research directions related to this situation. When an investor has a good trading strategy, trained by an RL algorithm for one asset, how should they transfer the experience to train a trading algorithm for a “similar” asset with fewer samples? This is closely related to transfer learning [32][33]. To the best of our knowledge, no study for financial applications has been carried out along this direction. Another question is the cold-start problem for newly issued assets. When we have very limited data for a new asset, how should we initialize an RL algorithm and learn a decent strategy using the limited available data and our experience (i.e., the trained RL algorithm or data) with other longstanding assets?

Acknowledgement We thank Xuefeng Gao, Anran Hu, Xiao-Yang Liu, Wenpin Tang, Ziyi Xia, Zhuoran Yang, Junzi Zhang and Zeyu Zheng for helpful discussions and comments on this survey.

Potential Danger: Algorithmic Collusion

  • Artificial intelligence, algorithmic pricing, and collusion,Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S., 2020. American Economic Review, 110(10), pp.3267-97.
  • Algorithmic collusion with imperfect monitoring, Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S., 2021. International Journal of Industrial Organization, p.102712

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

References

  1. O.MIHATSCH and R.NEUNEIER, Risk-sensitive reinforcement learning, Machine Learning, 49 (2002), pp.267--290.
  2. Y.SHEN, M.J. Tobia, T.SOMMER, and K.OBERMAYER, Risk-sensitive reinforcement learning, Neural Computation, 26 (2014), pp.1298--1328.
  3. H.ERIKSSON and C.DIMITRAKAKIS, Epistemic risk-sensitive reinforcement learning, arXiv preprint arXiv:1906.06273, (2019).
  4. Y.FEI, Z.YANG, Y.CHEN, Z.WANG, and Q.XIE, Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret, in NeurIPS, 2020.
  5. N.VADORI, S.GANESH, P.REDDY, and M.VELOSO, Risk-sensitive reinforcement learning: A martingale approach to reward uncertainty, arXiv preprint arXiv:2006.12686, (2020).
  6. J.ACHIAM, D.HELD, A.TAMAR, and P.ABBEEL, Constrained policy optimization, in International Conference on Machine Learning, PMLR, 2017, pp.22--31.
  7. Y.CHOW, M.GHAVAMZADEH, L.JANSON, and M.PAVONE, Risk-constrained reinforcement learning with percentile risk criteria, The Journal of Machine Learning Research, 18 (2017), pp.6070--6120.
  8. Y.CHOW, A.TAMAR, S.MANNOR, and M.PAVONE, Risk-sensitive and robust decision-making: A CVaR optimization approach, in NIPS'15, MIT Press, 2015, pp.1522--1530.
  9. D.DING, X.WEI, Z.YANG, Z.WANG, and M.JOVANOVIC, Provably efficient safe exploration via primal-dual policy optimization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp.3304--3312.
  10. A.TAMAR, Y.CHOW, M.GHAVAMZADEH, and S.MANNOR, Policy gradient for coherent risk measures, in Advances in Neural Information Processing Systems, vol.28, 2015.
  11. L.ZHENG and L.RATLIFF, Constrained upper confidence reinforcement learning, in Learning for Dynamics and Control, PMLR, 2020, pp.620--629.
  12. 12.0 12.1 S.JAIMUNGAL, S.M. Pesenti, Y.S. Wang, and H.TATSAT, Robust risk-aware reinforcement learning, Available at SSRN 3910498, (2021).
  13. 13.0 13.1 A.COACHE and S.JAIMUNGAL, Reinforcement learning with dynamic convex risk measures, arXiv preprint arXiv:2112.13414, (2021).
  14. D.SIMCHI-Levi and Y.XU, Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability, Available at SSRN 3562765, (2020).
  15. J.CHEN and N.JIANG, Information-theoretic considerations in batch reinforcement learning, in International Conference on Machine Learning, PMLR, 2019, pp.1042--1051.
  16. Z.GAO, Y.HAN, Z.REN, and Z.ZHOU, Batched multi-armed bandits problem, in Advances in Neural Information Processing Systems, vol.32, 2019.
  17. E.GARCELON, M.GHAVAMZADEH, A.LAZARIC, and M.PIROTTA, Conservative exploration in reinforcement learning, in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp.1431--1441.
  18. Z.REN and Z.ZHOU, Dynamic batch learning in high-dimensional sparse linear contextual bandits, arXiv preprint arXiv:2008.11918, (2020).
  19. Y.WU, R.SHARIFF, T.LATTIMORE, and C.SZEPESV{\'a}ri, Conservative bandits, in International Conference on Machine Learning, PMLR, 2016, pp.1254--1262.
  20. L.POMATTO, P.STRACK, and O.TAMUZ, The cost of information, arXiv preprint arXiv:1812.04211, (2018).
  21. T.T. Ke, Z.-J.M. Shen, and J.M. Villas-Boas, Search for information on multiple products, Management Science, 62 (2016), pp.3576--3603.
  22. D.ZHOU, J.CHEN, and Q.GU, Provable multi-objective reinforcement learning with generative models, arXiv preprint arXiv:2011.10134, (2020).
  23. R.YANG, X.SUN, and K.NARASIMHAN, A generalized algorithm for multi-objective reinforcement learning and policy adaptation, in Advances in Neural Information Processing Systems, vol.32, 2019.
  24. A.AGARWAL, P.BARTLETT, and M.DAMA, Optimal allocation strategies for the dark pool problem, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp.9--16.
  25. K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, Censored exploration and the dark pool problem, Communications of the ACM, 53 (2010), pp.99--107.
  26. B.BALDACCI and I.MANZIUK, Adaptive trading strategies across liquidity pools, arXiv preprint arXiv:2008.07807, (2020).
  27. H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, Robo-advising: Learning investors' risk preferences via portfolio choices, Journal of Financial Econometrics, 19 (2021), pp.369--392.
  28. H.WANG and S.YU, Robo-advising: Enhancing investment with inverse optimization and deep reinforcement learning, arXiv preprint arXiv:2105.09264, (2021).
  29. S.YU, H.WANG, and C.DONG, Learning risk preferences from investment portfolios using inverse optimization, arXiv preprint arXiv:2010.01687, (2020).
  30. N.E. Huang, M.-L. Wu, W.QU, S.R. Long, and S.S. Shen, Applications of Hilbert--Huang transform to non-stationary financial time series analysis, Applied Stochastic Models in Business and Industry, 19 (2003), pp.245--268.
  31. M.WIESE, R.KNOBLOCH, R.KORN, and P.KRETSCHMER, Quant GANs: Deep generation of financial time series, Quantitative Finance, 20 (2020), pp.1419--1440.
  32. L.TORREY and J.SHAVLIK, Transfer learning, in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, IGI global, 2010, pp.242--264.
  33. S.J. Pan and Q.YANG, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, 22 (2009), pp.1345--1359.