Identification and characterization of information-networks in long-tail data collections

Scientists' ability to synthesize and reuse long-tail scientific data lags far behind their ability to collect and produce these data. Many Earth Science Cyberinfrastructures enable sharing and publishing their data over the web using metadata standards. While profiling data attributes advances the Linked Data approach, it has become clear that building information-networks among distributed data silos is essential to increase their integration and reusability. In this research, we developed a Long-Tail Information-Network (LTIN) model, which uses a metadata-driven approach to build semantic information-networks among datasets published over the web and aggregate them around environmental events. The model identifies and characterizes the spatial and temporal contextual association links and dependencies among datasets. This paper presents the design and application of the LTIN model, and an evaluation of its performance. The model capabilities were demonstrated by inferring the information-network of a stream discharge located at the downstream end of the Illinois River.

[1]  C. Kesselman,et al.  A Metadata Catalog Service for Data Intensive Applications , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Praveen Kumar,et al.  Sustainable long term scientific data publication: Lessons learned from a prototype Observatory Information System for the Illinois River Basin , 2014, Environ. Model. Softw..

[3]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[4]  Jos de Bruijn,et al.  Semantic Web Services , 2009, Handbook on Ontologies.

[5]  Jeffery S. Horsburgh,et al.  Development of a Community Hydrologic Information System , 2009 .

[6]  Tantek Çelik,et al.  Microformats: a pragmatic path to the semantic web , 2006, WWW '06.

[7]  A. Rajabifard,et al.  A GML-based approach to automate spatial metadata updating , 2013, Int. J. Geogr. Inf. Sci..

[8]  Peter A. Troch,et al.  The future of hydrology: An evolving science for a changing world , 2010 .

[9]  Peter Bajcsy,et al.  Hydroinformatics: Data Integrative Approaches in Computation, Analysis, and Modeling , 2005 .

[10]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[11]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Praveen Kumar,et al.  Hydrocomplexity: Addressing water security and emergent environmental risks , 2015 .

[13]  D. Maidment Arc hydro : GIS for water resources , 2002 .

[14]  Anne E. Trefethen,et al.  Cyberinfrastructure for e-Science , 2005, Science.

[16]  P. D. Felice,et al.  A comparison of methods for representing topological relationships , 1995 .

[17]  Anthony M. Castronova,et al.  Models as web services using the Open Geospatial Consortium (OGC) Web Processing Service (WPS) standard , 2013, Environ. Model. Softw..

[18]  Jeffery S. Horsburgh,et al.  Components of an environmental observatory information system , 2011, Comput. Geosci..

[19]  Matthew B. Jones,et al.  A metadata-driven approach to loading and querying heterogeneous scientific data , 2010, Ecol. Informatics.

[20]  J. Goodall,et al.  An ontology for component‐based models of water resource systems , 2013 .

[21]  Jeffery S. Horsburgh,et al.  Introducing the Open Source CUAHSI Hydrologic Information System Desktop Application (HIS Desktop) , 2009 .

[22]  John Kunze,et al.  DataONE: Data Observation Network for Earth - Preserving Data and Enabling Innovation in the Biological and Environmental Sciences , 2011, D Lib Mag..

[23]  Werner Kuhn,et al.  Core concepts of spatial information for transdisciplinary research , 2012, Int. J. Geogr. Inf. Sci..

[24]  David G. Tarboton,et al.  The Initial Design of Data Sharing Infrastructure for the Critical Zone Observatory , 2011 .

[25]  Rui Liu,et al.  Brown Dog: Leveraging everything towards autocuration , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[26]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[27]  Inna Kouper,et al.  Towards Sustainable Curation and Preservation: The SEAD Project's Data Services Approach , 2015, 2015 IEEE 11th International Conference on e-Science.

[28]  Srinivasan Parthasarathy,et al.  Local Probabilistic Models for Link Prediction , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[29]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[30]  Anthony M. Castronova,et al.  Feedback loops and temporal misalignment in component‐based hydrologic modeling , 2011 .

[31]  Mark Gahegan,et al.  Re-Envisioning Data Description Using Peirce's Pragmatics , 2014, GIScience.

[32]  Jeffery S. Horsburgh,et al.  Observations Data Model 2: A community information model for spatially discrete Earth observations , 2016, Environ. Model. Softw..

[33]  Jeffery S. Horsburgh,et al.  DEVELOPMENT OF AN INFORMATION SYSTEM FOR THE HYDROLOGIC COMMUNITY , 2010 .

[34]  Francesco Ricci,et al.  Context-Aware Recommender Systems , 2011, AI Mag..

[35]  Andrey Kashlev,et al.  Supporting Geosciences Web Services Metadata Management and Discovery , 2010, 2010 IEEE International Conference on Services Computing.

[36]  Suzie Allard,et al.  DataONE: Facilitating eScience through Collaboration , 2012 .

[37]  P. Bryan Heidorn,et al.  Shedding Light on the Dark Data in the Long Tail of Science , 2008, Libr. Trends.

[38]  Peter Bajcsy,et al.  A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies , 2008 .