Delve: A Dataset-Driven Scholarly Search and Analysis System

Research and experimentation in various scientific fields are based on the observation, analysis and benchmarking on datasets. The advancement of research and development has thus, strengthened the importance of dataset access. However, without enough knowledge of relevant datasets, researchers usually have to go through a process which we term \manual dataset retrieval". With the accelerated rate of scholarly publications, manually finding the relevant dataset for a given research area based on its usage or popularity is increasingly becoming more and more difficult and tedious. In this paper, we present Delve, a web-based dataset retrieval and document analysis system. Unlike traditional academic search engines and dataset repositories, Delve is dataset driven and provides a medium for dataset retrieval based on the suitability or usage in a given field. It also visualizes dataset and document citation relationship, and enables users to analyze a scientific document by uploading its full PDF. In this paper, we first discuss the reasons why the scientific community needs a system like Delve. We then proceed to introduce its internal design and explain how Delve works and how it is beneficial to researchers of all levels

[1]  Noah Porter,et al.  Webster's revised unabridged dictionary of the English language : the dictionary proper being the authentic edition of Webster's international dictionary of one thousand eight hundred and ninety edited under the supervision of Noah Porter , 1913 .

[2]  Robert P. W. Duin,et al.  A note on comparing classifiers , 1996, Pattern Recognit. Lett..

[3]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  E. al.,et al.  The Sloan Digital Sky Survey: Technical summary , 2000, astro-ph/0006396.

[6]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[7]  V. Narayanan,et al.  Color Separation of Galaxy Types in the Sloan Digital Sky Survey Imaging Data , 2001, astro-ph/0107201.

[8]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[9]  Peter Z. Kunszt,et al.  The SDSS skyserver: public access to the sloan digital sky server data , 2001, SIGMOD '02.

[10]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[11]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[12]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[13]  David G. Monet,et al.  The White Dwarf Luminosity Function from Sloan Digital Sky Survey Imaging Data , 2005, astro-ph/0510820.

[14]  Robert C. Nichol,et al.  The clustering of luminous red galaxies in the Sloan Digital Sky Survey imaging data , 2006, astro-ph/0605302.

[15]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[16]  Anthony J. Jakeman,et al.  Ten iterative steps in development and evaluation of environmental models , 2006, Environ. Model. Softw..

[17]  Christos Faloutsos,et al.  Enhanced max margin learning on multimodal data mining in a multimedia database , 2007, KDD '07.

[18]  Haym Hirsh Data Mining Research: Current Status and Future Opportunities , 2008 .

[19]  Ted Pedersen,et al.  Empiricism Is Not a Matter of Faith , 2008, Computational Linguistics.

[20]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[21]  S. Roweis,et al.  An Improved Photometric Calibration of the Sloan Digital Sky Survey Imaging Data , 2007, astro-ph/0703454.

[22]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[23]  Jacob Roll,et al.  Systems biology: model based evaluation and comparison of potential explanations for given biological data , 2009, The FEBS journal.

[24]  C. Lintott,et al.  Galaxy Zoo Green Peas: discovery of a class of compact extremely star-forming galaxies , 2009, 0907.4155.

[25]  Yannis Charalabidis,et al.  Benefits, Adoption Barriers and Myths of Open Data and Open Government , 2012, Inf. Syst. Manag..

[26]  Dominika Tkaczyk,et al.  CERMINE -- Automatic Extraction of Metadata and References from Scientific Literature , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[27]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[28]  Yasuhiro Fujiwara,et al.  Efficient Label Propagation , 2014, ICML.

[29]  Boris Cule,et al.  Collaborative Filtering for Binary, Positiveonly Data , 2017, SKDD.

[30]  Xiangliang Zhang,et al.  Delve: A Data Set Retrieval and Document Analysis System , 2017, ECML/PKDD.

[31]  Matthew P. Adams,et al.  Model fit versus biological relevance: Evaluating photosynthesis-temperature models for three tropical seagrass species , 2017, Scientific Reports.