Investigation on distributed file system for scientific big data storage

With the quick development of new-generation scientific instruments, the storage of an unprecedented amount of scientific data (e.g., more than 14 exabytes per day in the SKA project) is becoming a critical problem. The amount of scientific data is growing at an exponential rate and doubling approximately every year. Traditional data storage techniques such as DAS, NAS and SAN, are impossible to fully meet the data access requirements because of the low access performance, limited scalability and weak availability. After the development for tens of years, distributed storage technology is re-garded as the most appropriate technology for massive scientific data storage and archive. In order to further promote the advancements of the distributed storage technology, in this paper, we make a deep investigation on the distributed file system (DFS). We discuss the general architecture of the DFS and describe the implementation of current four mainstream distributed file systems respectively in detail. We then compare several im- portant features of each DFS, and discuss the application situation of each DFS. Finally, we present a series of research directions for future studies of scientific data storage. The study presents a valuable reference for other studies, and also gives a valuable contribution to the current and/or future data intensive storage.