论文信息 - Dynamic Non-Hierarchical File Systems for Exascale Storage

Dynamic Non-Hierarchical File Systems for Exascale Storage

Modern high-end computing (HEC) systems must manage petabytes of data stored in billions of files, yet current tec hniques for naming and managing files were developed 40 years ago for collections of thousands of files. HEC users are therefore fo rced to adapt their usage to fit an outdated file system model and interface, unsuitable for exascale systems. Attempts to enri ch the interface, such as augmentation or replacement with databa ses, or the layering of additional interfaces and semantic exten sions atop existing filesystems result in performance-limited sy stems that do not adequately scale. Parallels exist between HEC systems and the web, where locating and browsing data sets has rapidly become dominated b y search. The strengths and weaknesses of the web provide several useful lessons from which we have learned: 1) Although t he web implements a hierarchical namespace, search has become the dominant navigation tool in the face of the massive volum e of data that is accessible; 2) While finding someinformation is easy, finding theright or good information is not; 3) The easier it is for people to contribute information to a repositor y, the more critical it becomes to determine the veracity of that da ta; 4) The links that relate documents provide valuable insight into the importance of documents. From these observations we can see that simply modifying existing high performance filesys tems to support search, and the requisite storage of additional s emantic metadata, would be woefully inadequate. We propose to develop a radically different filesystem struc ture that addresses these problems directly, and which will leverage provenance (a record of the data and processes that contributed to its creation), file content, and rich semantic me tadata to provide a scalable and searchable file namespace. Suc h a namespace would allow the tracking of data as it moves throug h the scientific workflow. This allows scientists to better find and utilize the datatheyneed, using both content and data history to identify and manage stored information. We take advantag e of the familiar search-based metaphor to provide an initial easyto-use interface that enables users to find the files they need and evaluate the authenticity and quality of those files. Realiz ing this vision requires research success in dynamic, nonhierarchi cal file systems design and implementation, large-scale metadata m anagement, efficient scalable indexing, and automatic proven ance capture.

Darrell D. E. Long | Ethan L. Miller