Artificial intelligence for ocean science data integration: current state, gaps, and way forward

Oceanographic research is a multidisciplinary endeavor that involves the acquisition of an increasing amount of in-situ and remotely sensed data. A large and growing number of studies and data repositories are now available on-line. However, manually integrating different datasets is a tedious and grueling process leading to a rising need for automated integration tools. A key challenge in oceanographic data integration is to map between data sources that have no common schema and that were collected, processed, and analyzed using different methodologies. Concurrently, artificial agents are becoming increasingly adept at extracting knowledge from text and using domain ontologies to integrate and align data. Here, we deconstruct the process of ocean science data integration, providing a detailed description of its three phases: discover, merge, and evaluate/correct. In addition, we identify the key missing tools and underutilized information sources currently limiting the automation of the integration process. The efforts to address these limitations should focus on (i) development of artificial intelligence-based tools for assisting ocean scientists in aligning their schema with existing ontologies when organizing their measurements in datasets; (ii) extension and refinement of conceptual coverage of – and conceptual alignment between – existing ontologies, to better fit the diverse and multidisciplinary nature of ocean science; (iii) creation of ocean-science-specific entity resolution benchmarks to accelerate the development of tools utilizing ocean science terminology and nomenclature; (iv) creation of ocean-science-specific schema matching and mapping benchmarks to accelerate the development of matching and mapping tools utilizing semantics encoded in existing vocabularies and ontologies; (v) annotation of datasets, and development of tools and benchmarks for the extraction and categorization of data quality and preprocessing descriptions from scientific text; and (vi) creation of large-scale word embeddings trained upon ocean science literature to accelerate the development of information extraction and matching tools based on artificial intelligence.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[3]  Jianliang Xu,et al.  An Ontology-Based Approach for Marine Geochemical Data Interoperation , 2017, IEEE Access.

[4]  Peter J. Stuckey,et al.  Machine Learning and Constraint Programming for Relational-To-Ontology Schema Mapping , 2018, IJCAI.

[5]  Rachael Lammey CrossRef text and data mining services , 2015 .

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Steffen Fritz,et al.  Comparison of Data Fusion Methods Using Crowdsourced Data in Creating a Hybrid Forest Cover Map , 2016, Remote. Sens..

[8]  Alexander Golberg,et al.  Global potential of offshore and shallow waters macroalgal biorefineries to provide for food, chemicals and energy: feasibility and sustainability , 2016 .

[9]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Paul T. Groth,et al.  Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines , 2017, J. Assoc. Inf. Sci. Technol..

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Sonia Bergamaschi,et al.  Schema label normalization for improving schema matching , 2010, Data Knowl. Eng..

[13]  S. Schneider,et al.  Climate Change 2001: Synthesis Report: A contribution of Working Groups I, II, and III to the Third Assessment Report of the Intergovernmental Panel on Climate Change , 2001 .

[14]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[15]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[16]  Soteris A. Kalogirou,et al.  Machine learning methods for solar radiation forecasting: A review , 2017 .

[17]  Pedro M. Domingos,et al.  Learning Source Description for Data Integration , 2000, WebDB.

[18]  Diego Calvanese,et al.  Ontology-Based Data Access: A Survey , 2018, IJCAI.

[19]  Erhard Rahm,et al.  Schema Matching and Mapping , 2013, Schema Matching and Mapping.

[20]  John L. Berg Data base directions: the next steps , 1976, SIGMOD 1976.

[21]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[22]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[23]  Avigdor Gal,et al.  Multi-source uncertain entity resolution: Transforming holocaust victim reports into people , 2017, Inf. Syst..

[24]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[25]  Brian D. Davison,et al.  Generating Schema Labels through Dataset Content Analysis , 2018, WWW.

[26]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[27]  Ilan Koren,et al.  A Satellite-Based Lagrangian View on Phytoplankton Dynamics. , 2018, Annual review of marine science.

[28]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[29]  Enrico Motta,et al.  The Semantic Web - ISWC 2005, 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6-10, 2005, Proceedings , 2005, SEMWEB.

[30]  James Llinas,et al.  Handbook of Multisensor Data Fusion , 2001 .

[31]  Colleen J. O'Brien,et al.  Global marine plankton functional type biomass distributions: coccolithophores , 2012 .

[32]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[33]  Avigdor Gal,et al.  Uncertain Schema Matching , 2011, Uncertain Schema Matching.

[34]  L. Centurioni,et al.  Advances in the Application of Surface Drifters. , 2017, Annual review of marine science.

[35]  Pascal Hitzler,et al.  A Complex Alignment Benchmark: GeoLink Dataset , 2018, International Semantic Web Conference.

[36]  Krzysztof Janowicz,et al.  The GeoLink Framework for Pattern-based Linked Data Integration , 2015, International Semantic Web Conference.

[37]  Michael Stonebraker,et al.  Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[38]  Mourad Ouzzani,et al.  Distributed representations of tuples for entity resolution , 2018, VLDB 2018.

[39]  Jürgen Umbrich,et al.  Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora , 2012, J. Web Semant..

[40]  David M. Kaplan,et al.  Spatial interpolation and filtering of surface current data based on open‐boundary modal analysis , 2007 .

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Simon Claus,et al.  Marine Regions: Towards a Global Standard for Georeferenced Marine Names and Boundaries , 2014 .

[43]  Tim Berners-Lee,et al.  Publishing on the semantic web , 2001, Nature.

[44]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[45]  Avigdor Gal,et al.  Schema matching prediction with applications to data source discovery and dynamic ensembling , 2013, The VLDB Journal.

[46]  Michael Uschold,et al.  Knowledge level modelling: concepts and terminology , 1998, The Knowledge Engineering Review.

[47]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[48]  Jennifer M. Durden,et al.  Integrating “Big Data” into Aquatic Ecology: Challenges and Opportunities , 2017 .

[49]  Avigdor Gal,et al.  Tuning the ensemble selection process of schema matchers , 2010, Inf. Syst..

[50]  Ivan Lopez-Arevalo,et al.  Information extraction meets the Semantic Web: A survey , 2020, Semantic Web.

[51]  S. Riser,et al.  The Argo Program : observing the global ocean with profiling floats , 2009 .

[52]  V. Smetácek,et al.  Mechanisms determining species dominance in a phytoplankton bloom induced by the iron fertilization experiment EisenEx in the Southern Ocean , 2007 .

[53]  Andreas M. Kaplan,et al.  Siri, Siri, in my hand: Who’s the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence , 2019, Business Horizons.

[54]  Laurie J. Kirsch,et al.  The Impact of Data Integration on the Costs and Benefits of Information Systems , 1992, MIS Q..

[55]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[56]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[57]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[58]  Michael Stonebraker,et al.  Text and structured data fusion in data tamer at scale , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[59]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[60]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[61]  Dennis McLeod,et al.  On Database Management System Architecture. , 1979 .

[62]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[63]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[64]  Youngmoo E. Kim,et al.  Learning emotion-based acoustic features with deep belief networks , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[65]  J. Randerson,et al.  Primary production of the biosphere: integrating terrestrial and oceanic components , 1998, Science.

[66]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[67]  C. C. Eriksen,et al.  Seaglider: a long-range autonomous underwater vehicle for oceanographic research , 2001 .

[68]  Phokion G. Kolaitis,et al.  EIRENE: Interactive Design and Refinement of Schema Mappings via Data Examples , 2011, Proc. VLDB Endow..

[69]  L. A. Anderson,et al.  Database of diazotrophs in global ocean: abundance, biomass and nitrogen fixation rates , 2012 .

[70]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[71]  Martin Doerr,et al.  Integrating Heterogeneous and Distributed Information about Marine Species through a Top Level Ontology , 2013, MTSR.

[72]  Theodoros Rekatsinas,et al.  Data Integration and Machine Learning: A Natural Synergy , 2018, Proc. VLDB Endow..

[73]  Marcia Lei Zeng,et al.  Knowledge Organization Systems (KOS) , 2008 .

[74]  Kevin Chen-Chuan Chang,et al.  Automatic complex schema matching across Web query interfaces: A correlation mining approach , 2006, TODS.

[75]  Pascal Hitzler,et al.  The OceanLink project , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[76]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[77]  Yannis Theodoridis,et al.  Maritime data integration and analysis: recent progress and research challenges , 2017, EDBT.

[78]  Deepak Padmanabhan,et al.  Linking and Mining Heterogeneous and Multi-view Data , 2018 .

[79]  Stephen D. Mayhew,et al.  TALEN: Tool for Annotation of Low-resource ENtities , 2018, ACL.

[80]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[81]  Stefan Biffl,et al.  Ontology-Based Data Integration in Multi-Disciplinary Engineering Environments: A Review , 2017, Open J. Inf. Syst..

[82]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[83]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[84]  Bernard Quéguiner,et al.  A global diatom database – abundance, biovolume and biomass in the world ocean , 2012 .

[85]  Divesh Srivastava,et al.  Big Data Integration , 2015, Synthesis Lectures on Data Management.

[86]  Haizhou Li,et al.  Evaluating and Combining Name Entity Recognition Systems , 2016, NEWS@ACM.

[87]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.