Stability and scalability of the CMS Global Pool: Pushing HTCondor and glideinWMS to new limits

The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. The total resources at Tier-1 and Tier-2 grid sites pledged to CMS exceed 100,000 CPU cores, while another 50,000 to 100,000 CPU cores are available opportunistically, pushing the needs of the Global Pool to higher scales each year. These resources are becoming more diverse in their accessibility and configuration over time. Furthermore, the challenge of stably running at higher and higher scales while introducing new modes of operation such as multi-core pilots, as well as the chaotic nature of physics analysis workflows, places huge strains on the submission infrastructure. This paper details some of the most important challenges to scalability and stability that the CMS Global Pool has faced since the beginning of the LHC Run II and how they were overcome. 1. The CMS global pool The CMS Global Pool is a single HTCondor [1] pool covering all Grid computing processing resources pledged to CMS plus significant Cloud and opportunistic resources. Resource provisioning is performed by a glideinWMS [2] frontend, which contacts several glideinWMS factories in order to submit pilot jobs to sites. Payload jobs are then matched to pilots by the HTCondor Negotiator which runs as part of the Central Manager of the pool, as can be seen in Figure 1. The other main element of the HTCondor Central Manager is the Collector, which maintains information about the various HTCondor pool daemons described below. The main components of this Global Pool include a glideinWMS frontend and factories, the HTCondor Central Manager and Condor Connection Broker (CCB), deployed in 24-core 48GB (RAM) virtual machines (VMs) running on hypervisors with 10 Gbps ethernet connectivity. Such a set up is deployed at CERN with an analogous infrastructure for High Availability (HA)