Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the “long tail” of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.

[1]  Paul T. Groth,et al.  Ten Simple Rules for the Care and Feeding of Scientific Data , 2014, PLoS Comput. Biol..

[2]  Lucila Ohno-Machado,et al.  DataMed: Finding useful data across multiple biomedical data repositories , 2016, bioRxiv.

[3]  Elena Paslaru Bontas Simperl,et al.  A Query Log Analysis of Dataset Search , 2017, ICWE.

[4]  Peter Schirmbacher,et al.  The Landscape of Research Data Repositories in 2015: A re3data Analysis , 2017, D Lib Mag..

[5]  Lesley Wyborn,et al.  Providing Research Graph Data in JSON-LD Using Schema.org , 2017, WWW.

[6]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[7]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[8]  Andrei Z. Broder,et al.  Anatomy of the long tail: ordinary people with extraordinary tastes , 2010, WSDM '10.

[9]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Brigitte Mathiak,et al.  Are There Any Differences in Data Set Retrieval Compared to Well-Known Literature Retrieval? , 2015, TPDL.

[11]  Carole A. Goble,et al.  Bioschemas: From Potato Salad to Protein Annotation , 2017, SEMWEB.

[12]  Laura Rueda,et al.  DataCite: Lessons Learned on Persistent Identifiers for Research Data , 2016, Int. J. Digit. Curation.

[13]  James A. Hendler,et al.  Open Government Data: A Data Analytics Approach , 2013, IEEE Intelligent Systems.

[14]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[15]  Lucila Ohno-Machado,et al.  DATS, the data tag suite to enable discoverability of datasets , 2017, Scientific Data.

[16]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[17]  Jürgen Umbrich,et al.  Lifting Data Portals to the Web of Data , 2017, LDOW@WWW.

[18]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[19]  Ricardo Baeza-Yates,et al.  Towards Semantic Search , 2008, NLDB.

[20]  Lise Getoor,et al.  Knowledge Graph Identification , 2013, SEMWEB.

[21]  Dan Brickley,et al.  Schema.org: Evolution of Structured Data on the Web , 2015, ACM Queue.

[22]  Elena Paslaru Bontas Simperl,et al.  The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour , 2017, CHI.

[23]  Lily Troia,et al.  A Data Citation Roadmap for Scholarly Data Repositories , 2017 .