Automated agents for management and control of the ALICE Computing Grid
暂无分享,去创建一个
A complex software environment such as the ALICE Computing Grid infrastructure requires permanent control and management for the large set of services involved. Automating control procedures reduces the human interaction with the various components of the system and yields better availability of the overall system. In this paper we will present how we used the MonALISA framework to gather, store and display the relevant metrics in the entire system from central and remote site services. We will also show the automatic local and global procedures that are triggered by the monitored values. Decision-taking agents are used to restart remote services, alert the operators in case of problems that cannot be automatically solved, submit production jobs, replicate and analyze raw data, resource load-balance and other control mechanisms that optimize the overall work flow and simplify day-to-day operations. Synthetic graphical views for all operational parameters, correlations, state of services and applications as well as the full history of all monitoring metrics are available for the ent ire system that now encompasses 85 sites all over the world, mo re than 14000 CPU cores and 10PB of storage.
[1] Harvey B. Newman,et al. Global Platform for Rich Media Conferencing and Collaboration , 2003, ArXiv.
[2] Federico Carminati,et al. AliEn: ALICE environment on the GRID , 2008 .