A Cultural Algorithm for POMDPs from Stochastic Inventory Control

Reinforcement Learning algorithms such as SARSA with an eligibility trace, and Evolutionary Computation methods such as genetic algorithms, are competing approaches to solving Partially Observable Markov Decision Processes (POMDPs) which occur in many fields of Artificial Intelligence. A powerful form of evolutionary algorithm that has not previously been applied to POMDPs is the cultural algorithm, in which evolving agents share knowledge in a belief space that is used to guide their evolution. We describe a cultural algorithm for POMDPs that hybridises SARSA with a noisy genetic algorithm, and inherits the latter's convergence properties. Its belief space is a common set of state-action values that are updated during genetic exploration, and conversely used to modify chromosomes. We use it to solve problems from stochastic inventory control by finding memoryless policies for nondeterministic POMDPs. Neither SARSA nor the genetic algorithm dominates the other on these problems, but the cultural algorithm outperforms the genetic algorithm, and on highly non-Markovian instances also outperforms SARSA.

[1]  Roberto Iglesias,et al.  Improving Reinforcement Learning through a Better Exploration Strategy and an Adjustable Representation of the Environment , 2007, EMCR.

[2]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[5]  Phillip D. Stroud,et al.  Kalman-extended genetic algorithm for search in nonstationary environments with noisy fitness evaluations , 2001, IEEE Trans. Evol. Comput..

[6]  Robert G. Reynolds,et al.  A Cultural Algorithm Framework to Evolve Multi-Agent Cooperation with Evolutionary Programming , 1997, Evolutionary Programming.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  R. Reynolds AN INTRODUCTION TO CULTURAL ALGORITHMS , 2008 .

[9]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[10]  Hongwei Liu,et al.  Integration of Genetic Algorithm and Cultural Algorithms for Constrained Optimization , 2006, ICONIP.

[11]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[12]  J. Fitzpatrick,et al.  Genetic Algorithms in Noisy Environments , 2005, Machine Learning.

[13]  David E. Goldberg,et al.  Optimal Sampling For Genetic Algorithms , 1996 .

[14]  Carlos A. Coello Coello,et al.  Culturizing differential evolution for constrained optimization , 2004, Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004..

[15]  David E. Goldberg,et al.  Optimal sampling in a noisy genetic algorithm for risk-based remediation design , 2001 .

[16]  Hans-Georg Beyer,et al.  Local performance of the (1 + 1)-ES in a noisy environment , 2002, IEEE Trans. Evol. Comput..

[17]  Gerald Heisig,et al.  Comparison of (s,S) and (s,nQ) inventory control rules with respect to planning stability , 2001 .

[18]  T. Kovacs,et al.  A Proposal for Population-Based Reinforcement Learning , 2003 .

[19]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[20]  H. Crichton-Miller Adaptation , 1926 .

[21]  Carlos A. Coello Coello,et al.  A Cultural Algorithm with Differential Evolution to Solve Constrained Optimization Problems , 2004, IBERAMIA.

[22]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[23]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[24]  C. R. Sox,et al.  Adaptive Inventory Control for Nonstationary Demand and Partial Information , 2002, Manag. Sci..

[25]  Carlos A. Coello Coello,et al.  Cultural algorithms, an alternative heuristic to solve the job shop scheduling problem , 2007 .

[26]  Mark D. Pendrith,et al.  An Analysis of Direct Reinforcement Learning in Non-Markovian Domains , 1998, ICML.

[27]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[28]  David F. Pyke,et al.  Inventory management and production planning and scheduling , 1998 .

[29]  Jesus A. Gonzalez,et al.  Advances in Artificial Intelligence – IBERAMIA 2004 , 2004, Lecture Notes in Computer Science.

[30]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[31]  E. Postma,et al.  Evolutionary Learning Outperforms Reinforcement Learning on Non-Markovian Tasks , 2005 .

[32]  Robert G. Reynolds,et al.  Evolutionary Programming VI , 1997, Lecture Notes in Computer Science.

[33]  Darrell Whitley,et al.  Genitor: a different genetic algorithm , 1988 .

[34]  David E. Goldberg,et al.  Optimal sampling in a noisy genetic algorithm for risk-based remediation design , 2001 .

[35]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[36]  Robert G. Reynolds,et al.  Cultural algorithms: theory and applications , 1999 .

[37]  Haitao Liu,et al.  On Partially Observable Markov Decision Processes Using Genetic Algorithm Based Q-Learning , 2007 .

[38]  Brad L. Miller,et al.  Noise, sampling, and efficient genetic algorthms , 1997 .