A Better Model for Task Assignment in Server Farms: How Replication can Help

An age-old problem in the design of server farms is the choice of the task assignment policy. This is the algorithm that determines how to assign incoming jobs to servers. Popular policies include Round-Robin assignment, Join-the-Shortest-Queue, Join-Queue-with-Least-Work, and so on. While much research has studied assignment policies, little has taken into account server-side variability -- the fact that the server we choose might be temporarily and unpredictably slow. We show that when server-side variability dominates runtime, replication of jobs can be very beneficial. We introduce the Replication-d algorithm that replicates each arrival to d servers chosen at random, where the job is considered "done" as soon as the first replica completes. We provide an exact closed-form analysis of Replication-d. We next introduce a much more general model, one which takes both the inherent job size distribution and the server-side variability into account. This is a departure from traditional queueing models which only allow for one "size" distribution. We propose and analyze a new task assignment policy, Replicate-Idle-Queue (RIQ), which is designed to perform well given these dual sources of variability.