Toward Democratizing Access to Facilities Data: A Framework for Intelligent Data Discovery and Delivery

Data collected by large-scale instruments, observatories, and sensor networks are key enablers of scientific discoveries in many disciplines. However, ensuring that these data can be accessed, integrated, and analyzed in a democratized and timely manner remains a challenge. In this article, we explore how state-of-the-art techniques for data discovery and access can be adapted to facility data and develop a conceptual framework for intelligent data access and discovery. ¢ Science in the 21st century is being transformed by our unprecedented ability to collect and process data from a variety of sources. For example, large-scale multiuser scientific observatories, instruments, and experimental platforms provide a broad community of researchers and educators with open access to shared-use infrastructure and data products generated from geodistributed instruments and equipment [1]. These large facilities (LF) have recently enabled significant scientific discoveries such as the detection of gravitational waves [2] and the imaging of the event horizon of a black hole [3]. However, as the number and scale of such LF increases along with corresponding growth in the number, distribution, and diversity of users, ensuring that LF data can be discovered, accessed, integrated, and analyzed in a timely manner is a growing challenge that is resulting in significant demands on LF cyberinfrastructure (CI) [4]. For example, the Ocean Observatory Initiative (OOI) [5] integrates over 1,250 instruments, producing over 25,000 data items and over 100,000 data products. Similarly, each antenna of the Square Kilometre Array (SKA), the world’s largest radio telescope project, produces raw data at the rate of approximately 0.5-1TB per second and approximately 300PB of data after pre-processing per telescope per year1. 1 https://www.skatelescope.org/the-skaproject/

[1]  The Ligo Scientific Collaboration,et al.  Observation of Gravitational Waves from a Binary Black Hole Merger , 2016, 1602.03837.

[2]  Kevin Fauvel,et al.  A Distributed Multi-Sensor Machine Learning Approach to Earthquake Early Warning , 2020, AAAI.

[3]  Manish Parashar,et al.  Facilitating Data Discovery for Large-scale Science Facilities using Knowledge Networks , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[5]  Vasant Honavar,et al.  The Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data-Driven Research , 2020, Computing in Science & Engineering.

[6]  Larry Smarr,et al.  The Pacific Research Platform: Making High-Speed Networking a Reality for the Scientist , 2018, PEARC.

[7]  Ivan Rodero,et al.  Leveraging User Access Patterns and Advanced Cyberinfrastructure to Accelerate Data Delivery from Shared-use Scientific Observatories , 2020, Future Gener. Comput. Syst..

[8]  Prabhat,et al.  An Assessment of Data Transfer Performance for Large-Scale Climate Data Analysis and Recommendations for the Data Infrastructure for CMIP6 , 2017, ArXiv.

[9]  Chih-Wei L. Huang,et al.  First M87 Event Horizon Telescope Results. IV. Imaging the Central Supermassive Black Hole , 2019, The Astrophysical Journal.

[10]  Yixin Cao,et al.  KGAT: Knowledge Graph Attention Network for Recommendation , 2019, KDD.

[11]  Manish Parashar,et al.  Data Cyberinfrastructure for End-to-End Science , 2020, Computing in Science & Engineering.