Reward-based Monte Carlo-Bayesian reinforcement learning for cyber preventive maintenance

Abstract This article considers a preventive maintenance problem related to cyber security in universities. A Bayesian Reinforcement Learning (BRL) problem is formulated using limited data from scan results and intrusion detection system warnings. The median estimated learning time (MELT) measure is introduced to evaluate the speed at which a control system effectively eliminates parametric uncertainty and probability is concentrated on a single scenario. It is demonstrated that the Monte Carlo BRL with enhancements including Latin hypercube sampling (LHS) to generate scenarios, identical systems multi-task learning, and reward-based learning achieves shorter MELT values, i.e., “faster” learning, and improved objective values compared with alternatives in a numerical study. Rigorous results establish the optimality of the derived control strategies and the fact that optimal learning is possible under steady state assumptions. Also, the real-world case study of policies for patching Linux critical server cyber vulnerabilities generates insights including the potential to reduce expenditure per host by mandating compensating controls for critical vulnerabilities.

[1]  Theodore T. Allen Introduction to Discrete Event Simulation and Agent-based Modeling: Voting Systems, Health Care, Military, and Manufacturing , 2011 .

[2]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[3]  Frank E. Harrell,et al.  A new distribution-free quantile estimator , 1982 .

[4]  Ling Li,et al.  Joint optimization of lot sizing and condition-based maintenance for multi-component production systems , 2017, Comput. Ind. Eng..

[5]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[6]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[7]  Theodore T. Allen,et al.  Data-Driven Cyber-Vulnerability Maintenance Policies , 2014 .

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[10]  Prasad Tadepalli,et al.  Model-Based Reinforcement Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  Ajith Kumar Parlikad,et al.  Value of condition monitoring in infrastructure maintenance , 2013, Comput. Ind. Eng..

[12]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[13]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[14]  Oguzhan Alagöz,et al.  Optimally solving Markov decision processes with total expected discounted reward function: Linear programming revisited , 2015, Comput. Ind. Eng..

[15]  M. D. McKay,et al.  A comparison of three methods for selecting values of input variables in the analysis of output from a computer code , 2000 .

[16]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[17]  Boxin Tang Orthogonal Array-Based Latin Hypercubes , 1993 .

[18]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[19]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[20]  Guy Shani,et al.  Prioritizing Point-Based POMDP Solvers , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[21]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[22]  Jun S. Liu,et al.  STATISTICAL APPLICATIONS OF THE POISSON-BINOMIAL AND CONDITIONAL BERNOULLI DISTRIBUTIONS , 1997 .

[23]  Viliam Makis,et al.  Optimal lot-sizing and maintenance policy for a partially observable production system , 2016, Comput. Ind. Eng..

[24]  Theodore T. Allen,et al.  Control charting methods for autocorrelated cyber vulnerability data , 2016 .

[25]  R. Bellman Dynamic programming. , 1957, Science.

[26]  Masashi Sugiyama,et al.  Conic Programming for Multitask Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[27]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[28]  Theodore T. Allen,et al.  Timely Decision Analysis Enabled by Efficient Social Media Modeling , 2017, Decis. Anal..

[29]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..