Experiment Dashboard for Monitoring of the LHC Distributed Computing Systems

LHC experiments are currently taking collisions data. A distributed computing model chosen by the four main LHC experiments allows physicists to benefit from resources spread all over the world. The distributed model and the scale of LHC computing activities increase the level of complexity of middleware, and also the chances of possible failures or inefficiencies in involved components. In order to ensure the required performance and functionality of the LHC computing system, monitoring the status of the distributed sites and services as well as monitoring LHC computing activities are among the key factors. Over the last years, the Experiment Dashboard team has been working on a number of applications that facilitate the monitoring of different activities: including following up jobs, transfers, and also site and service availabilities. This presentation describes Experiment Dashboard applications used by the LHC experiments and experience gained during the first months of data taking.

[1]  Ciprian Dobre,et al.  MonALISA: An agent based, dynamic service system to monitor, control and optimize distributed systems , 2009, Comput. Phys. Commun..

[2]  T Maeno,et al.  PanDA: distributed production and distributed analysis system for ATLAS , 2008 .

[3]  Pablo Saiz,et al.  Experiment Dashboard for Monitoring Computing Activities of the LHC Virtual Organizations , 2010, Journal of Grid Computing.

[4]  Benjamin Gaidioz,et al.  CMS Dashboard Task Monitoring: A user-centric monitoring view , 2010 .

[5]  J Andreeva,et al.  Designing and developing portable large-scale JavaScript web applications within the Experiment Dashboard framework , 2012 .

[6]  L Sargsyan,et al.  hBrowse - Generic framework for hierarchical data visualization , 2012 .

[7]  J Andreeva,et al.  Job monitoring on the WLCG scope: Current status and new strategy , 2010 .

[8]  J Andreeva,et al.  Measuring and understanding computer resource utilization in CMS , 2011 .

[9]  Emir Imamagic,et al.  Evolution of SAM in an enhanced model for Monitoring WLCG services , 2010 .

[10]  James Casey,et al.  Monitoring the efficiency of user jobs , 2010 .

[11]  Jean-Philippe Baud,et al.  Data Management in EGEE , 2010 .

[12]  Ulf Mjörnmark,et al.  ATLAS computing: Technical Design Report , 2005 .

[13]  Jose M Hernandez,et al.  The commissioning of CMS sites: Improving the site reliability , 2010 .

[14]  Pablo Saiz,et al.  New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: the experiments experience , 2012 .

[15]  Jamie Shiers,et al.  The Worldwide LHC Computing Grid (worldwide LCG) , 2007, Comput. Phys. Commun..