The CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when a large fraction of CPUs at certain sites were sitting idle due to lack of demand, all while Terabytes of data were never accessed. To improve the utilization of both CPU and disks, CMS is moving toward controlled overflowing of jobs from sites that have data but are oversubscribed to others with spare CPU and network capacity, with those jobs accessing the data through real time Xrootd streaming over WAN. The major limiting factor for remote data access is the ability of the source storage system to serve such data, so the number of jobs accessing it must be carefully controlled. The CMS approach to this is to implement the overflowing by means of glideinWMS, a Condor based pilot system, and by providing the WMS with the known storage limits and let it schedule jobs within those limits. This paper presents the detailed architecture of the overflow-enabled glideinWMS system, together with operational experience of the past 6 months.
[1]
Douglas Thain,et al.
Distributed computing in practice: the Condor experience
,
2005,
Concurr. Pract. Exp..
[2]
Ian Macneill,et al.
Reducing the Human Cost of Grid Computing With glideinWMS
,
2011,
CLOUD 2011.
[3]
Philippe Canal,et al.
ROOT I/O: The Fast and Furious
,
2011
.
[4]
Ciprian Dobre,et al.
MonALISA: An agent based, dynamic service system to monitor, control and optimize distributed systems
,
2009,
Comput. Phys. Commun..
[5]
Andrew Hanushevsky,et al.
XROOTD/TXNetFile: a highly scalable architecture for data access in the ROOT environment
,
2005,
ICT 2005.
[6]
Igor Sfiligoi,et al.
Use of glide-ins in CMS for production and analysis
,
2010
.