Documentos de Académico
Documentos de Profesional
Documentos de Cultura
1
Optimal Portfolio by CAPM Profits to calculate the wealth:
0.025
T
WT = W0 (1 + Rt )
t =1 (2)
0.02
T
= W0 (1 + (1 Ft 1 )rt f + Ft 1rt )(1 Ft Ft 1 )
t =1
0.015
In which WT is the accumulative wealth in our account
Return
f
0.01 up to time T, and W0 is the initial wealth, rt is the risk
free rate, such as interest rate for cash or T-bills. When the
interest rate rt is ignored ( rt = 0 ), a simplified
0.005
f f
as Ft = G (t ; Ft 1 , I t ) , in which, t is the learned , is the effective portfolio weight of asset i before adjusting.
weights between risky and riskless assets, and I t is the 2.3. Performance Criterion
information filtration including security price series z and
Traditionally, we use utility function to express peoples
other market information y over the time up to
welfare in the position of certain wealth[13]:
t: I t = ( zt , zt 1 ,...; yt , yt 1 ,...) . U (Wt ) = Wt / for 0,
(5)
The Ft mentioned above denotes the asset allocation U 0 (Wt ) = log Wt for = 0
between risky and riskless assets in the portfolio at time t.
Moreover, from time t-1 to time t, the risky asset, which is
the market portfolio, also changes relative weights among When = 1 , the investors only care about the absolute
all the securities because of their price/volume fluctuations. wealth or profit, and be viewed as risk neutral; > 1
We also define the market return of asset i during time indicates that people prefer to take more risk, and described
period t is: rt = ( zt / zt 1 1) . At any time step, the as risk-seeking; < 1 corresponds to the case of risk
i i i
weights of risky assets sum up to one: averse, which means investors are more sensitive to loss
m than gains. By the assumption of CAPM, the rational
F
i =1
i
=1 (1) investors are falling into this category. We take = 0.5
in our simulation.
Besides cumulative wealth, risk adjusted return are more
2.2. Profit and Wealth for the Portfolios widely used in the financial industry as the inputs for the
A natural selection of performance criterion is the value function. As suggested in the modern portfolio theory,
cumulative profits of the investment accounts. In discrete Sharp Ratio is most widely accepted measure of risk
time series, Additive Profits are applied for the accounts adjusted return. Denote the time series returns as Rt , we
whose risky asset is in the form of standard contracts, such have the Sharp Ratio definition:
as future contracts of USD/EUR. While in our case, since
we use market portfolio as the risky asset, with each
security fluctuates all the time, we use Multiplicative
2
Mean( Rt ) there are cases that the Markov Decision Processes do not
ST =
Std . Deviation( Rt ) converge. Q-learning also suffers from instability for
(6) optimal policy selection even when tiny noise exists in the
1 T
Rt
T t =1
datasets [27~30].
= In our simulation experiments, we find that the policy
1 T 1 T search algorithm is more stable and efficient in stochastic
T t =1
( Rt Rt ) 2
T t =1 optimal control. A compelling advantage of RRL is to
Notice that in both cases we talk about above, the wealth produce real valued actions (portfolio weights) naturally
and Sharp Ratio can be expressed as the function of Rt , so without resorting to the discretization method in the
Q-learning. Our experiment results also indicate that the
the performance criterion we talked so far can be expressed RRL are more robust than value based search in noisy
as: datasets, and thus more adaptable to the true market
U ( RT ,..., Rt ,...R1 ;W0 ) (7) conditions.
More specifically, our recurrent reinforcement learning can Where ( x, a) is the probability of taking action a in
be illustrated in Figure 2. state x , p xy ( a ) is the transitional probability from
state x to state y under action a . D ( x, y , a) is the
intermediate reward, and r is the discount factor weighting
relative importance of future rewards to current rewards.
Target Price The value iteration method is defined as:
Vt +1 ( x) = max pxy (a){D( x, y, a) + rVt ( y )} (9)
Reinforcement a
learning U() y
Input Series
Trading
System
Profits/Losses U () This iteration will converge to the optimal value function
Trades/Portf
olio Weights
defined as:
Delay
V * ( x) = max V ( x) (10)
Transaction Cost , which satisfies the Bellmans optimality equation:
V * ( x) = max pxy (a){D( x, y, a) + rV * ( y )} (11)
a
y
well in many application cases with convergence theorems , and the update rule for training a function approximator
existing at certain conditions [19]. is based on the gradient of the error:
However, the value function approach has many 1
limitations. The most important shortcoming of value ( D( x, y, a) + r max Q( y, b) Q( x, a))2 (14)
2 b
iteration is Bellmans curse of dimensionality due to the
, which will lead to the optimal choice of action:
property of discrete states and action spaces[18,19]. Also,
when Q-learning is extended to functional approximations,
3
a* = arg max{Q( x, a )} (15) dU t ( ) dU t dRt dFt dRt dFt 1
a = { + } (21)
In Q-learning, we need to discretize the states, and at d dRt dFt d t dFt 1 d t
each time step, we have the price state either be rise or fall, , and parameters are updated online using:
i.e {rise, fall}, and we have the corresponding action set be dU t (t )
{buy, sell}. Due to the fixed combination at any time t with t = (22)
d t
relative prices of securities and risk aversion assumption in
the formalization of the problem, we have the transitional
probability matrix as the following: 4. Results and Discussions
rise / buy rise / sell A B 4.1. Risky Asset and Trading Signal
= (16)
fall / buy fall / sell C D Simply saying, trading signal is buying or selling
, where, decisions of investors, and it is reversely determined by
A = D = b
Ft 1b ( zt b / zt 1b 1) 1{zt b / zt 1b 1} maximizing the expectations of value functions in each
time step. In our trading model, the buy action is assigned
0.8
As we described above, the RRL algorithm will make the
trading system Ft ( ) be real valued, and we can find the 0 200 400 600 800 1000 1200
time step
optimal parameter to maximize criterion U T after T 1: buying, -1: selling
1
sequence of time step[2,15]: 0.5
trading s ignal
dUT ( )
=
1st element of parameter
(19) 1
d
weight of risky asset
dFt
Notice that is total derivative that is path 0.5
d
dependent, and we use the approach of Back-Propagation 0
Through Time(BPTT) to approximate the total derivative 0 200 400 600 800 1000 1200
dFt Ft Ft dFt 1
weight of riskiless asset
= + (20)
d Ft 1 d 0.8
4
4.2. Q-learning with Different Value Functions
We conduct a group of simulation using different value
functions. The results of cumulative profits by Q-Learning
are presented in the Figure 4. Each graph stands for a
cumulative profit gains using interval profit, sharp ratio,
and derivative sharp ratio as the value functions under one
time series simulation of risky asset. We can see the
cumulative profit varies considerably when different
functions are applied. This is consistent with the instable
property of Q-learning. Empirical results indicate that the
change of value functions or the noisy of dataset may
greatly change the policy selection and thus affect the final
performance. Figure 4. Trading Performance using Q-Learning by
maximizing Derivative Sharp Ratio, Sharp Ratio, and
Internal Profit
These simulations are carried out using 1200 time
intervals, with underlying risky asset to be artificial
normalized market portfolio price produced by geometric
brown motion. The simulation results indicate that the
derivative sharp ratio outperform in accumulating account
wealth compared to the sharp ratio and internal profit.
Economically speaking, the derivative sharp ratio is
analogous to the marginal utility in terms of willingness to
bear how much risk for one unit increment of sharp ratio.
This value function incorporates not only the variance of
the dataset but also the risk aversion of the investor, and
thus more reasonable in probabilistic modeling.
When Q-learning algorithm applied to the noisy datasets,
properly choosing value function plays key role in the
stability of performance. As we discussed, in the cases of
using sharp ratio and interval profit, the algorithm trading is
not profitable and varies considerably when exposed to the
different value scenarios of underlying asset. Another point
worth noticing here is that the transaction cost may accrue.
5
4.4. Recurrent Reinforcement Learning Using
Different Length of Backward Steps
Formula (18) gives the stochastic batch gradient ascent
algorithm. In practice, we do not include all the derivative
terms up to time T, instead we just take several previous
temporal steps for computational savings. We studied the
effects of this omission on the invest performance by taking
different number of backward time steps in the
optimization.
5. Conclusion
We have proposed and tested a trading system via
reinforcement learning method. We use value iteration and
policy iteration respectively to test the algorithm trading
performance using Q-learning and Recurrent
Reinforcement Learning respectively. In the implementing
of Q-learning, we use interval profit, sharp ratio, derivative
sharp ratio as the value functions subject to optimize at
each time step and tested their performance. Our empirical
study indicates that value function selection is very
Figure 5. Investment Performance of Recurrent important to achieve stable learning algorithm in the value
Reinforcement Learning using Stochastic Batch Gradient iteration especially applied to the noisy dataset. Value
Ascent. function with information on the noisy (variance) and
dynamic properties (marginal utility) has better
6
performance in the learning algorithm than that with [14] L. Brock, J. Lakonishok, and B. LeBaron. Simple technical
stationary and fixed value functions. trading rules and the stochastic properties of stock returns.
In general, RRL with policy iteration outperform than Journal of Finance, 47:17311764, 1992.
Q-Learning in the sense of stability and computational [15] Moody, J. & Wu, L. (1996), Optimization of trading
systems and portfolios, Neural Networks in the Capital
convenience. RRL algorithm provides more flexibility in Markets Conference Record, Caltech, Pasadena.
choosing objective functions to optimize using stochastic [16] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber.
batch gradient ascent. More inclusion of backward Recurrent Policy Gradients. Journal of Algorithms, 2009.
information helps the learning process, but increment of [17] Neuneier, R. (1996), Optimal asset allocation using
performance tends to converge as more previous temporal adaptive dynamic programming, Advances in Neural
steps are incorporated. Information Processing Systems 8, MIT Press.
[18] John Moody, Lizhong Wu, Yuansong Liao, and Matthew
Saffell, Performance functions and reinforcement learning
References
for trading systems and portfolios, Journal of Forecasting,
[1] Moody, J., Saffell, M., Liao, Y. & Wu, L. (1998), vol. 17, pp. 441-470, 1998.
Reinforcement learning for trading systems and portfolios: [19] R. Sutton and A. Barto. Reinforcement Learning. MIT
Immediate vs future rewards, Press., Cambridge, MA., 1998.
[2] V.S. Borkar and S.P. Meyn. The O.D.E method for [20] C. J. Watkins and P. Dayan, Technical note: Q-Learning,
convergence of stochstic approximation and reinforcement Machine Learning, vol. 8, no. 3, pp. 279-292, 1992.
learning. Siam J. Control, 38 (2):44769, 2000. [21] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W.
[3] Michael Kearns and Satinder P. Singh. Finite-sample Moore, Reinforcement learning: A survey, Journal of
convergence rates for Q-learning and indirect algorithms. Artificial Intelligence Research, vol. 4, 1996.
[4] F. Beleznay, T. Grobler, and C. Szepesvari. Comparing [22] Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., &
value-function estimation algorithms in undiscounted Littman, M. (2008). An Analysis of Linear Models, Linear
problems. Technical Report TR-99-02, Mindmaker Ltd, Value-Function Approximation, and Feature Selection for
1999. 24 Reinforcement Learning. International Conference of
[5] F. Sehnke, C. Osendorfer, T. Rckstiess, A. Graves, J. Machine Learning (pp. 752759).
Peters, and J. Schmidhuber. Policy gradients with [23] Robert H. Crites and Andrew G. Barto, Improving
parameter-based exploration for control. In J. Koutnik V. elevator performance using reinforcement learning, in
Kurkova, R. Neruda, editors, Proceedings of the Advances in Neural Information Processing Systems,
International Conference on Artificial Neural Networks [24] Peter Marbach and John N. Tsitsiklis, Simulation-based
ICANN-2008 optimization of Markov reward processes, in IEEE
[6] J. N. Tsitsiklis and B. Van Roy, Optimal stopping of Conference on Decision and Control, 1998.
Markov processes: Hilbert space theory, approximation [25] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber.
algorithms, and an application to pricing high-dimensional Recurrent Policy Gradients. Journal of Algorithms, 2009.
financial derivatives, IEEE Transactions on Automatic [26] Leemon Baird and Andrew Moore, Gradient descent for
Control, vol. 44, no. 10, pp. 1840-1851, October 1999. general reinforcement learning, in Advances in Neural
[7] F. J. Gomez, J. Togelius, J. Schmidhuber. Measuring and Information Processing Systems, Eds. 1999, MIT Press.
Optimizing Behavioral Complexity for Evolutionary [27] A- W. Moore and C. G. Atkeson, Prioritized sweeping:
Reinforcement Learning . Proceedings of the 19th Reinforcement learning with less data and less real time,
International Conference on Artificial Neural Networks Machine Learning, vol. 13, pp. 103-130, 1993.
(ICANN-09), Cyprus, 2009. [28] Mahadevan, S., & Maggioni, M. (2006). Proto-value
[8] J. N. Tsitsiklis and B. Van Roy, Regression methods for Functions: A Laplacian Framework for Learning
pricing complex American-style options, IEEE Representation and Control in Markov Decision Processes
Transactions on Neural Networks, vol. 12, no. 4, July 2001, (Technical Report). University of Massachusetts.
[9] D. Nawrocki, Optimal algorithms and lower partial [29] Richard S. Sutton, David McAllester, Satinder Singh, and
moment: Ex post results, Applied Economics, vol. 23, pp. Yishay Mansour, Policy gradient methods for
465-470, 1991. reinforcement learning with function approximation, in
[10] Martin L. Puterman. Markov Decision Advances in Neural Information Processing Systems, Eds.
ProcessesDiscrete Stochastic Dynamic Programming. 2000, vol. 12, pp. 1057-1063, MIT Press.
John Wiley and Sons, Inc., New York, NY, 1994. [30] John Moody and Matthew Saffell, Reinforcement
[11] Xu, X., Hu, D., & Lu, X. (2007). Kernel-Based Least learning for trading, in Advances in Neural Information
Squares Policy Iteration for Reinforcement Learning. IEEE Processing Systems, Eds. 1999, vol. 11, pp. 917-923, MIT
Transactions on Neural Networks (pp. 973992). Press.
[12] Kang, J., Choey, M. & Weigend, A. (1997), Nonlinear
trading models and asset allocation based on Sharpe ratio,
Decision Technology for Financial Engineering,World
Scientific, London.
[13] Darrell Duffie, Dynamic Asset Pricing Theory, Princeton
University Press, 2 edition, 1996.