A probabilistic approach to distributed system management

Large-scale distributed systems are playing an increasing role in computational research, production operations, information processing, and application hosting. The continuous management of such systems is a critical consideration when focusing on reliability, availability, and security. As the number of commodity components within these systems continue to grow, it becomes increasingly difficult to track the multitude of parameters required to ensure optimal performance from the system, especially in those systems that have been built through expansion and not as an initial purchase of identical nodes. In this paper, we discuss the use of statistical inference, specifically Markov Logic Networks, in a distributed multi-agent system to provide the most effective means of managing these parameters. We showcase an architecture that provides services to manage a system's configuration throughout its life-cycle, and is capable of resolving differences after identifying potential mis-configurations using conflict discovery and resolution modules.