JACoW : Monitoring the new ALICE Online-Offline computing system

ALICE (A Large Ion Collider Experiment) is a particle detector designed to study heavy-ion collisions and the physics of strongly interacting matter and the quark–gluon plasma at the CERN LHC (Large Hadron Collider). ALICE has been successfully collecting physics data since 2010. Currently, it is in the preparations for a major upgrade of the computing system, called O2 (Online-Offline) and scheduled to be deployed during Long Shutdown 2 in 2019–2020. The O2 system will consist of 268 FLPs (First Level Processors) equipped with readout cards and 1500 EPNs (Event Processing Node) performing data aggregation, calibration, reconstruction and event building. The system will readout 27 Tb/s of raw data and record tens of PBs of reconstructed data per year. To allow an efficient operation of the upgraded experiment, a new Monitoring subsystem will provide a complete overview of the O2 computing system status, detect performance degradation or component failures. The ALICE O2 Monitoring subsystem will collect and receive up to 600 kHz of metrics. It will consist of a custom monitoring library and a toolset to cover four main functional tasks: metric collection, metric processing, storage, visualization and alarming. This paper describes the Monitoring subsystem architecture and the feature set of the monitoring library. It also shows the results of multiple benchmarks, essential to ensure that the processing and storage performance requirements are met. In addition, it presents the evaluation of preselected tools for each of the functional tasks, including Collectd, Apache Flume, Apache Spark, InfluxDB and Grafana. It concludes by describing the next steps towards the final subsystem.