Massive-scale data management using standards-based solutions

In common with many large institutes, CERN has traditionally developed and maintained its own data management solutions. Recently, a significant change of direction has taken place and we have now adopted commercial tools, together with a small amount of site-specific code, for this task. The solutions chosen were originally studied as part of research and development projects oriented towards the Large Hadron Collider (LHC), which is currently under construction at CERN. They have since been adopted not only by the LHC collaborations, which are due to take production data starting in 2005, but also by numerous current experiments, both at CERN and at other High Energy Physics laboratories. Previous experiments, that used data management tools developed in-house, are also studying a possible move to the new environment. To meet the needs of today's experiments, data rates of up to 35 MB/second and data volumes of many hundred TB per experiment must be supported. Data distribution to multiple sites must be provided together with concurrent access for up to some 100 users. Meeting these requirements provides a convenient stepping stone towards those of the LHC, where data rates of 100-1500 MB/second must be supported, together with data volumes of up to 20 PB per experiment. We describe the current status of the production systems, database administration tools that have been developed, typically using the Java binding to the database, data distribution techniques and the performance characteristics of the system. Mechanisms to permit users to find and navigate through their data are discussed, including the issues of naming and meta-data and associated browsers. Coupled to the data management solution is a complete data analysis environment. This too is largely based on commercial, standards-conforming components with application-specific extensions as required. We discuss our experience in using such solutions including issues related to the integration of several commercial packages. Costs and risk factors are described, as well as issues related to responsiveness of the vendors to enhancement requests and support for additional platforms. Finally, we discuss the extent to which we have succeeded in our goal of using commodity solutions and the advantages of a database, as opposed to a file-based, approach for the management of vast volumes of data.