Novel functional and distributed approaches to data analysis available in ROOT

The bright future of particle physics at the Energy and Intensity frontiers poses exciting challenges to the scientific software community. The traditional strategies for processing and analysing data are evolving in order to (i) offer higher-level programming model approaches and (ii) exploit parallelism to cope with the ever increasing complexity and size of the datasets. This contribution describes how the ROOT framework, a cornerstone of software stacks dedicated to particle physics, is preparing to provide adequate solutions for the analysis of large amount of scientific data on parallel architectures. The functional approach to parallel data analysis provided with the ROOT TDataFrame interface is then characterised. The design choices behind this new interface are described also comparing with other widely adopted tools such as Pandas and Apache Spark. The programming model is illustrated highlighting the reduction of boilerplate code, composability of the actions and data transformations as well as the capabilities of dealing with different data sources such as ROOT, JSON, CSV or databases. Details are given about how the functional approach allows transparent implicit parallelisation of the chain of operations specified by the user. The progress done in the field of distributed analysis is examined. In particular, the power of the integration of ROOT with Apache Spark via the PyROOT interface is shown. In addition, the building blocks for the expression of parallelism in ROOT are briefly characterised together with the structural changes applied in the building and testing infrastructure which were necessary to put them in production. 1. Future challenges for analysis at the Energy and Intensity frontiers The full exploitation of the LHC is the highest priority in the European Strategy for Particle Physics adopted as part of the ESFRI Roadmap [1]. A major upgrade of the LHC will take place in about three years from now. The luminosity delivered by the machine will be about ten times higher than the present, nominal one. Such an evolution poses a series of challenges to us for what concerns all steps of the HEP data processing chain: triggering, reconstruction, simulation, digitisation and analysis. Assuming the performance of the software presently in use for HEP data processing and a reasonable evolution of hardware technologies, the amount of resources the LHC community will need to cope with the aforementioned computation requirements will have to be nearly about ten times bigger than what it is today [2]. This is clearly not realistic. In addition, a sizable contribution to the need of computing resources is to be accounted for the High Intensity Frontier program: about 10% of the budget presently required by a LHC experiment today [3].

[1]  Massimo Lamanna,et al.  SWAN: A service for interactive analysis in the cloud , 2018, Future Gener. Comput. Syst..