We propose to train trading systems by optimizing financial objective functions via reinforcement learning. The performance functions that we consider as value functions are profit or wealth, the Sharpe ratio and our recently proposed differential Sharpe ratio for online learning. In Moody & Wu (1997), we presented empirical results in controlled experiments that demonstrated the advantages of reinforcement learning relative to supervised learning. Here we extend our previous work to compare Q-Learning to a reinforcement learning technique based on real-time recurrent learning (RTRL) that maximizes immediate reward.
Our simulation results include a spectacular demonstration of the presence of predictability in the monthly Standard and Poors 500 stock index for the 25 year period 1970 through 1994. Our reinforcement trader achieves a simulated out-of-sample profit of over 4000% for this period, compared to the return for a buy and hold strategy of about 1300% (with dividends reinvested). This superior result is achieved with substantially lower risk.
[1]
Gerald Tesauro,et al.
Neurogammon Wins Computer Olympiad
,
1989,
Neural Computation.
[2]
Ben J. A. Kröse,et al.
Learning from delayed rewards
,
1995,
Robotics Auton. Syst..
[3]
Thomas G. Dietterich,et al.
High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network
,
1995,
NIPS 1995.
[4]
Andrew G. Barto,et al.
Improving Elevator Performance Using Reinforcement Learning
,
1995,
NIPS.
[5]
Ralph Neuneier,et al.
Optimal Asset Allocation using Adaptive Dynamic Programming
,
1995,
NIPS.
[6]
Lizhong Wu,et al.
Optimization of trading systems and portfolios
,
1997,
Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).
[7]
J. Moody,et al.
Performance functions and reinforcement learning for trading systems and portfolios
,
1998
.