Analysis of the ECMWF Storage Landscape

Despite domain-specific digital archives are growing in number and size, there is a lack of studies describing their architectures and runtime characteristics. This paper investigates the storage landscape of the European Centre for Medium-Range Weather Forecasts (ECMWF) whose storage capacity has reached 100 PB and experiences an annual growth rate of about 45%. Out of this storage, we examine a 14.8 PB user archive and a 37.9 PB object database for metereological data over a period of 29 and 50 months, respectively. We analzye the system's log files to characterize traffic and user behavior, metadata snapshots to identify the current content of the storage systems, and logs of tape libraries to investigate cartridge movements. We have built a caching simulator to examine the efficiency of disk caches for various cache sizes and algorithms, and we investigate the potential of tape prefetching strategies. While the findings of the user archive resemble previous studies on digital archives, our study of the object database is the first one in the field of large-scale active archives.

[1]  C. Walter Kryder's law. , 2005, Scientific American.

[2]  Scott Kirkpatrick,et al.  Architecture of the internet archive , 2009, SYSTOR '09.

[3]  Ethan L. Miller,et al.  Analyzing User Behavior : A Trace Analysis of the NCAR Archival Storage System Technical Report UCSC-SSRC-1202 March 2012 , 2012 .

[4]  Kimberly Keeton,et al.  Why traditional storage systems don't help us save stuff forever , 2005 .

[5]  Robbert van Renesse,et al.  An analysis of Facebook photo caching , 2013, SOSP.

[6]  Brent Welch,et al.  Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[8]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[9]  André Brinkmann,et al.  Lone Star Stack: Architecture of a Disk-Based Archival System , 2014, 2014 9th IEEE International Conference on Networking, Architecture, and Storage.

[10]  Dirk Grunwald,et al.  Massive Arrays of Idle Disks For Storage Archives , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  Shankar Pasupathy,et al.  Measurement and Analysis of Large-Scale Network File System Workloads , 2008, USENIX Annual Technical Conference.

[12]  Marek Chrobak,et al.  Caching Is Hard—Even in the Fault Model , 2012, Algorithmica.

[13]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[14]  Jacob R. Lorch,et al.  A five-year study of file-system metadata , 2007, TOS.

[15]  W. Erdelen United Nations Educational, Scientific and Cultural Organization (UNESCO) , 2019, The Grants Register 2020.

[16]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[17]  Krishna P. Gummadi,et al.  An analysis of Internet content delivery systems , 2002, OPSR.

[18]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[19]  Ethan L. Miller,et al.  Evolutionary Trends in a Supercomputing Tertiary Storage Environment , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[20]  Ethan L. Miller,et al.  Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories , 2012, TOS.

[21]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  G. Kuenning,et al.  A Study of Irregularities in File-Size Distributions , 2002 .

[23]  Zongpeng Li,et al.  Youtube traffic characterization: a view from the edge , 2007, IMC '07.

[24]  D. Rosenthal,et al.  The Economics of Long-Term Digital Storage , 2012 .

[25]  Ethan L. Miller,et al.  Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage , 2008, FAST.

[26]  Yanpei Chen,et al.  Design implications for enterprise storage systems via multi-dimensional trace analysis , 2011, SOSP '11.

[27]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).