Reinforcement learning for job shop scheduling

This dissertation studies applying reinforcement learning algorithms to discover good domain-specific heuristics automatically for job-shop scheduling. It focuses on the NASA space shuttle payload processing problem. The problem involves scheduling a set of tasks to satisfy a set of temporal and resource constraints while also seeking to minimize the total length (makespan) of the schedule. The approach described in the dissertation employs a repair-based scheduling problem space that starts with a critical-path schedule and incrementally repairs constraint violations with the goal of finding a short conflict-free schedule. The temporal difference (TD) learning algorithm TD($\lambda$) is applied to train a neural network to learn a heuristic evaluation function for choosing repair actions over schedules. This learned evaluation function is used by a one-step lookahead search procedure to find solutions to new scheduling problems. Several important issues that affect the success and the efficiency of learning have been identified and deeply studied. These issues include schedule representation, network architectures, and learning strategies. A number of modifications to the TD($\lambda$) algorithm are developed to improve learning performance. Learning is investigated based on both hand-engineered features and raw features. For learning from raw features, a time-delay neural network architecture is developed to extract features from irregular-length schedules. The learning approach is evaluated on synthetic problems and on problems from a NASA space shuttle payload processing task. The evaluation function is learned on small problems and then applied to solve larger problems. Both learning-based schedulers (using hand-engineered features and raw features respectively) perform better than the best existing algorithm for this task--Zweben's iterative repair method. It is important to understand why TD learning works in this application. Several performance measures are employed to investigate learning behavior. We verified that TD learning works properly in capturing the evaluation function. It is concluded that TD learning along with a set of good features and a proper neural network is the key to this success. The success shows that reinforcement learning methods have the potential for quickly finding high-quality solutions to other combinatorial optimization problems.