Resource management for large scale unreliable distributed systems

Parallel processing is a way to use resources efficiently by processing several jobs simultaneously on different servers. In a well controlled environment where the status of the servers and the jobs is well known, everything is nearly deterministic and replicating jobs on different servers is obviously a waste of resources. However, in a poorly controlled environment where the servers are unreliable and/or their capacity is highly variable, it is desirable to design a system that is robust in the sense that it is not affected by the poorly performing servers. By replicating jobs and assigning them to several different servers simultaneously, we not only achieve robustness but can also make the system more efficient under certain conditions so that the jobs are processed at a faster rate overall. When managing a large pool of unreliable resources such as a computational Grid, the scheduling mechanism must be efficient and scalable. In this thesis, we propose stochastic scheduling policies for Grid computing that take advantage of the random behavior of resources by using the option of replicating jobs, and study how the performance of different “degrees” of replication, ranging from no replication to full replication, affects the performance of a system of parallel servers. We have found that in general, more variability in service time favors more replication.