Optimal Policies for a Class of Restless Multiarmed Bandit Scheduling Problems with Applications to Sensor Management

Consider the Markov decision problems (MDPs) arising in the areas of intelligence, surveillance, and reconnaissance in which one selects among different targets for observation so as to track their position and classify them from noisy data [9], [10]; medicine in which one selects among different regimens to treat a patient [1]; and computer network security in which one selects different computer processes for observation so as to find ones exhibiting malicious behavior [6]. These MDPs all have a special structure. Specifically, they are discrete-time MDPs in which one controls the evolution of a set of Markov processes. There are two possible transition probability functions for the processes. The control at a given time selects a subset of processes, which then transition independently according to the controlled transition probability; the remaining processes transition independently according to the uncontrolled transition probability. Rewards are additive across processes and accumulated over time. The control problem is one of determining a policy to select controls so as to maximize expected rewards. MDPs with this structure have been termed restless bandit problems [15]. Our particular interest in such problems is in developing methods for deriving optimal solutions to them. Such solutions may be important of themselves as a control solution or may be useful for analyzing a problem in the process of developing a good suboptimal controller. Restless bandits problems are a variation of a classical stochastic scheduling problem called a multiarmed bandit problem. It differs from the restless bandits problems considered here in two key respects. The first is that the states of the unselected process in the multiarmed bandit problem do not change. Second, the rewards in a multiarmed bandit problem are accumulated over an infinite horizon, discounting future rewards. Note that this is a significant difference because the time remaining in the horizon is essentially a component of the state and does not change for a multiarmed bandit problem but does change for the finite horizon restless bandit problems considered here. A number of techniques have been previously developed for computing solutions to restless bandits problems. For example, index rules have been shown to optimally solve classical multiarmed bandit scheduling problems [2], [4]. Generalizations of this result have been conjectured, and some of them have been proven to apply to other classes of restless bandit problems [14], [15]. Proofs establishing the optimality of controls for finite-horizon restless bandit problems with particular reward structures have also been presented [1], [3], [5]. Each of these results describes a set of conditions for a control to be optimal for a restless bandit problem. This paper introduces a set of novel conditions that are sufficient for a control policy to be optimal for a finite-horizon MDP. The conditions are readily verified

[1]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[2]  D. Simon,et al.  Kalman filtering with state equality constraints , 2002 .

[3]  David Q. Mayne,et al.  Constrained state estimation for nonlinear discrete-time systems: stability and moving horizon approximations , 2003, IEEE Trans. Autom. Control..

[4]  Robin J. Evans,et al.  Simulation-Based Optimal Sensor Scheduling with Application to Observer Trajectory Planning , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[5]  Christof Paar,et al.  Comparison of arithmetic architectures for Reed-Solomon decoders in reconfigurable hardware , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[7]  Robin J. Evans,et al.  Hidden Markov model multiarm bandits: a methodology for beam scheduling in multitarget tracking , 2001, IEEE Trans. Signal Process..

[8]  David L. Kleinman,et al.  An Investigation of ISR Coordination and Information Presentation Strategies to Support Expeditionary Strike Groups , 2007 .

[9]  M. Ulmke,et al.  Multi hypothesis track extraction and maintenance of GMTI sensor data , 2005, 2005 7th International Conference on Information Fusion.

[10]  EWRD,et al.  Optimal Policy for Scheduling of Gauss-Markov Systems , 2004 .

[11]  Dimitri P. Bertsekas,et al.  An Auction Algorithm for Shortest Paths , 1991, SIAM J. Optim..

[12]  Dimitri P. Bertsekas,et al.  Network optimization : continuous and discrete models , 1998 .

[13]  Ingemar J. Cox,et al.  On Finding Ranked Assignments With Application to Multi-Target Tracking and Motion Correspondence , 1995 .

[14]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[15]  Darryl Morrell,et al.  Sensor scheduling and efficient algorithm implementation for target tracking , 2006 .

[16]  M.K. Schneider,et al.  Closing the loop in sensor fusion systems: stochastic dynamic programming approaches , 2004, Proceedings of the 2004 American Control Conference.

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  S. Ferrari,et al.  Demining sensor modeling and feature-level fusion by Bayesian networks , 2006, IEEE Sensors Journal.

[19]  K. G. Murty An Algorithm for Ranking All the Assignment in Order of Increasing Cost , 1968 .

[20]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[21]  Erik Blasch,et al.  GRoup IMM Tracking utilizing Track and Identification Fusion , 2001 .

[22]  Branko Ristic,et al.  A variable structure multiple model particle filter for GMTI tracking , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[23]  M.L. Miller,et al.  Optimizing Murty's ranked assignment method , 1997, IEEE Transactions on Aerospace and Electronic Systems.

[24]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[25]  David M. Lin,et al.  Constrained Optimization for Joint Estimation of Channel Biases and Angles of Arrival for Small GPS Antenna Arrays , 2004 .

[26]  Brian J. Noe,et al.  Variable structure interacting multiple-model filter (VS-IMM) for tracking targets with transportation network constraints , 2000, SPIE Defense + Commercial Sensing.

[27]  Erik Blasch,et al.  Constrained Estimation For GPS/Digital Map Integration , 2007 .

[28]  S. Musick,et al.  Chasing the elusive sensor manager , 1994, Proceedings of National Aerospace and Electronics Conference (NAECON'94).

[29]  Krishna R. Pattipati,et al.  Anomaly Detection via Feature-Aided Tracking and Hidden Markov Models , 2007, 2007 IEEE Aerospace Conference.

[30]  O. Patrick Kreidl,et al.  Feedback control applied to survivability: a host-based autonomic defense system , 2004, IEEE Transactions on Reliability.

[31]  Chandrika Kamath,et al.  Robust techniques for background subtraction in urban traffic video , 2004, IS&T/SPIE Electronic Imaging.

[32]  Lawrence A. Klein,et al.  Sensor Technologies and Data Requirements for Its , 2001 .

[33]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[34]  Fabio Gagliardi Cozman,et al.  Adaptive Online Learning of Bayesian Network Parameters , 2001 .

[35]  J.L. Massey,et al.  Theory and practice of error control codes , 1986, Proceedings of the IEEE.

[36]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[37]  Peter J. Shea,et al.  Improved state estimation through use of roads in ground tracking , 2000, SPIE Defense + Commercial Sensing.

[38]  R. Weber,et al.  ON AN INDEX POLICY FOR RESTLESS BANDITS , 1990 .

[39]  David A. Castañón Optimal search strategies in dynamic hypothesis testing , 1995, IEEE Trans. Syst. Man Cybern..

[40]  Krishna R. Pattipati,et al.  Rollout strategies for sequential fault diagnosis , 2002, Proceedings, IEEE AUTOTESTCON.

[41]  Krishna R. Pattipati,et al.  Ground target tracking with variable structure IMM estimator , 2000, IEEE Trans. Aerosp. Electron. Syst..

[42]  Parag K. Lala,et al.  Fault tolerant and fault testable hardware design , 1985 .

[43]  D. Kleinman,et al.  Optimal measurement scheduling for state estimation , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[44]  Stig K. Andersen,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[45]  Dimitri P. Bertsekas,et al.  Auction algorithms for network flow problems: A tutorial introduction , 1992, Comput. Optim. Appl..

[46]  Chun Yang,et al.  Kalman Filtering with Nonlinear State Constraints , 2009 .

[47]  Vikram Krishnamurthy,et al.  Algorithms for optimal scheduling and management of hidden Markov model sensors , 2002, IEEE Trans. Signal Process..

[48]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[49]  Subhash Challa,et al.  An Introduction to Bayesian and Dempster-Shafer Data Fusion , 2003 .

[50]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[51]  Josep Roure Alcobé,et al.  Incremental methods for Bayesian network learning , 1999 .

[52]  Chee Chong,et al.  A rollout algorithm to coordinate multiple sensor resources to track and discriminate targets , 2006, SPIE Defense + Commercial Sensing.

[53]  R. B. Washburn,et al.  Stochastic dynamic programming based approaches to sensor resource management , 2002, Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997).

[54]  Alf Isaksson,et al.  On sensor scheduling via information theoretic criteria , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[55]  D.A. Castanon,et al.  Rollout Algorithms for Stochastic Scheduling Problems , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[56]  C. Yang,et al.  Nonlinear constrained tracking of targets on roads , 2005, 2005 7th International Conference on Information Fusion.

[57]  Y. Zhang,et al.  Active and dynamic information fusion for multisensor systems with dynamic bayesian networks , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[58]  Robert R. Bitmead,et al.  State estimation for linear systems with state equality constraints , 2007, Autom..

[59]  Dan Simon,et al.  A game theory approach to constrained minimax state estimation , 2006, IEEE Transactions on Signal Processing.