Scientific User Behavior and Data-Sharing Trends in a Petascale File System

The Oak Ridge Leadership Computing Facility (OLCF) runs the No. 4 supercomputer in the world, supported by a petascale file system, to facilitate scientific discovery. In this paper, using the daily file system metadata snapshots collected over 500 days, we have studied the behavioral trends of 1, 362 active users and 380 projects across 35 science domains. In particular, we have analyzed both individual and collective behavior of users and projects, highlighting needs from individual communities and the overall requirements to operate the file system. We have analyzed the metadata across three dimensions, namely (i) the projects' file generation and usage trends, using quantitative file system-centric metrics, (ii) scientific user behavior on the file system, and (iii) the data sharing trends of users and projects. To the best of our knowledge, our work is the first of its kind to provide comprehensive insights on user behavior from multiple science domains through metadata analysis of a large-scale shared file system. We envision that this OLCF case study will provide valuable insights for the design, operation, and management of storage systems at scale, and also encourage other HPC centers to undertake similar such efforts.

[1]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[2]  Don E Maxwell,et al.  Monitoring Tools for Large Scale Systems , 2010 .

[3]  Adriana Iamnitchi,et al.  File grouping for scientific data management: lessons from experimenting with real traces , 2008, HPDC '08.

[4]  Yunhao Liu,et al.  Content Distribution for Mobile Internet: A Cloud-based Approach , 2016, Springer Singapore.

[5]  Priya Mahadevan,et al.  Systematic topology analysis and generation using degree correlations , 2006, SIGCOMM.

[6]  Gaogang Xie,et al.  An Empirical Analysis of a Large-scale Mobile Cloud Storage Service , 2016, Internet Measurement Conference.

[7]  Guangwen Yang,et al.  Understanding Data Characteristics and Access Patterns in a Cloud Storage System , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[8]  Robert B. Ross,et al.  Using Property Graphs for Rich Metadata Management in HPC Systems , 2014, 2014 9th Parallel Data Storage Workshop.

[9]  Galen M. Shipman,et al.  Workload characterization of a leadership class storage cluster , 2010, 2010 5th Petascale Data Storage Workshop (PDSW '10).

[10]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[11]  Shankar Pasupathy,et al.  Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems , 2009, FAST.

[12]  Marko Vukolic,et al.  Dissecting UbuntuOne: Autopsy of a Global-scale Personal Cloud Back-end , 2015, Internet Measurement Conference.

[13]  Saurabh Gupta,et al.  Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Andrea C. Arpaci-Dusseau,et al.  A File Is Not a File: Understanding the I/O Behavior of Apple Desktop Applications , 2012, TOCS.

[15]  Adriana Iamnitchi,et al.  Workload characterization in a high-energy data grid and impact on resource management , 2009, Cluster Computing.

[16]  K. L. Kliewer The high performance storage system (HPSS) , 1995 .

[17]  Ross Miller,et al.  Comparative I/O workload characterization of two leadership class storage clusters , 2015, PDSW '15.

[18]  Yang Liu,et al.  Automatic identification of application I/O signatures from noisy server-side traces , 2014, FAST.

[19]  Adam G. Carlyle,et al.  Practical Support Solutions for a Workflow-Oriented Cray Environment , 2012 .

[20]  Tom Barron,et al.  Constellation: A science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[21]  Yang Liu,et al.  Server-Side Log Data Analytics for I/O Workload Characterization and Coordination on Large Shared Storage Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[23]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[24]  Andrea C. Arpaci-Dusseau,et al.  Generating realistic impressions for file-system benchmarking , 2009, TOS.

[25]  Alex Borges Vieira,et al.  Modeling the Dropbox client behavior , 2014, 2014 IEEE International Conference on Communications (ICC).

[26]  Ethan L. Miller,et al.  Single-Snapshot File System Analysis , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[27]  Seung-Hwan Lim,et al.  On-demand data analytics in HPC environments at leadership computing facilities: Challenges and experiences , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[28]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[29]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[30]  Aiko Pras,et al.  Inside dropbox: understanding personal cloud storage services , 2012, Internet Measurement Conference.

[31]  Arnaud Legout,et al.  Social Clicks: What and Who Gets Read on Twitter? , 2016, SIGMETRICS.

[32]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.