Commentary - Perspectives on Stochastic Optimization Over Time

Motivated by the discussion in Powell (2010), I offer a few comments on the interactions and merging of stochastic optimization research in artificial intelligence (AI) and operations research (OR), a process that has been ongoing for more than a decade. In a broad sense, decision making over time and under uncertainty is a core subject in several fields that can perhaps be described collectively as the “information and decision sciences,” which includes operations research, systems and control theory, and artificial intelligence. These different fields and communities have much to offer to each other. Operations research and systems and control theory have been close for a long time. In both fields, the predominant description of uncertainty involves probabilistic models, and the goal is usually one of optimizing an objective function subject to constraints. Any differences between these two fields are due to “culture” (different departments and conferences), motivating applications (physics-based versus service-oriented systems), and technical taste (e.g., discrete versus continuous state and time), and yet the legacy of Bellman is equally strong on both sides. AI is a little different. Originally driven by the lofty goal of understanding and reproducing “intelligence,” AI involves an eclectic mix of logic, discrete mathematics, heuristics, and computation, with a focus on problems too complex to be amenable to mainstream methods such as linear or dynamic programming. Today, however, there is a notable convergence of the “modern approach” to AI (as exemplified by Russell and Norvig 1995) and the more traditional methodologies of applied mathematics. Quite often, the clever heuristic approaches developed in AI to deal with complex problems are best understood, and then enhanced, by deploying suitably adapted classical tools. Decision making over time and under uncertainty is a prominent example of such convergence: indeed, the methods of “reinforcement learning” are best understood as methods of approximate dynamic programming (ADP). This connection is certainly intellectually satisfying. More important, this connection is valuable because insights and approaches developed in one field or community can (and have been) transferred to another. A central idea connecting the two fields is the “heuristic evaluation function,” initially introduced in AI game-playing programs. The ideal evaluation function, leading to optimal play, is nothing but Bellman’s optimal value function, in principle computable by dynamic programming (DP) algorithms and their extensions to the context of Markov games. For difficult problems where the optimal value function is practically impossible to compute, value function approximations become useful, potentially leading to near-optimal performance. Such approximations can be developed in an ad hoc manner or through suitable approximate dynamic programming methods. The latter approach has opened up a vast range of possibilities and an active research area. Having identified the common foundation, it is worth elaborating on some differences of emphasis in the different communities. One key distinction concerns “online” and “offline” methods. Reinforcement learning has been motivated in terms of agents that act over time, observe the consequences of their decisions, and try to improve their decision-making rule (or “policy,” in DP language) on the basis of accumulated experience. (As such, reinforcement learning is also closely related to the problem of adaptive control in systems and control theory.) A typical example is provided by a poorly modeled robot operating in a poorly modeled environment that “learns” online and incrementally improves its policy and performance. Learning online is unavoidable in “model-free” problems, where an analytical or simulation model is absent. On the other hand, most operations research applications of ADP are not of the online or model-free