A semi-automatic approach for detecting dataset references in social science texts

Today, full-texts of scientific articles are often stored in different locations than the used datasets. Dataset registries aim at a closer integration by making datasets citable but authors typically refer to datasets using inconsistent abbreviations and heterogeneous metadata (e.g. title, publication year). It is thus hard to reproduce research results, to access datasets for further analysis, and to determine the impact of a dataset. Manually detecting references to datasets in scientific articles is time-consuming and requires expert knowledge in the underlying research domain.We propose and evaluate a semi-automatic three-step approach for finding explicit references to datasets in social sciences articles.We first extract pre-defined special features from dataset titles in the da|ra registry, then detect references to datasets using the extracted features, and finally match the references found with corresponding dataset titles. The approach does not require a corpus of articles (avoiding the cold start problem) and performs well on a test corpus. We achieved an F-measure of 0.84 for detecting references in full-texts and an F-measure of 0.83 for finding correct matches of detected references in the da|ra dataset registry.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Jaideep Srivastava,et al.  Data Extract: Mining Context from the Web for Dataset Extraction , 2013 .

[3]  Roman Kern,et al.  TeamBeam - Meta-Data Extraction from Scientific Literature , 2012, D Lib Mag..

[4]  Simone Sacchi,et al.  Definitions of dataset in the scientific and technical literature , 2010, ASIST.

[5]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[6]  Juan-Zi Li,et al.  Keyword Extraction Using Support Vector Machine , 2006, WAIM.

[7]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[8]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[9]  Divesh Srivastava,et al.  A Dataset Search Engine for the Research Document Corpus , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[10]  Han-Joon Kim,et al.  News Keyword Extraction for Topic Tracking , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[11]  Micah Altman,et al.  A Proposed Standard for the Scholarly Citation of Quantitative Data , 2008, IASSIST Conference.

[12]  Vishal Gupta,et al.  Effective Approaches For Extraction Of Keywords , 2010 .

[13]  Wolf-Tilo Balke,et al.  Rule based Autonomous Citation Mining with TIERL , 2010, J. Digit. Inf. Manag..

[14]  Micah Altman,et al.  A Proposed Standard for the Scholarly Citation of Quantitative Data , 2008 .

[15]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[16]  Christoph Lange,et al.  Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML , 2015, MTSR.

[17]  Philipp Mayr,et al.  Digital Library Research in Action: Supporting Information Retrieval in Sowiport , 2015, D Lib Mag..

[18]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[19]  Xin Chen,et al.  An Improved Hidden Markov Model for Literature Metadata Extraction , 2010, ICIC.

[20]  Christoph Lange,et al.  Identifying and Improving Dataset References in Social Sciences Full Texts , 2016, ELPUB.

[21]  Kai Eckert,et al.  Identifying References to Datasets in Publications , 2012, TPDL.

[22]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[23]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[24]  Brigitte Mathiak,et al.  Challenges in Matching Dataset Citation Strings to Datasets in Social Science , 2015, D Lib Mag..

[25]  Thomas Gottron,et al.  Normalized Relevance Distance - A Stable Metric for Computing Semantic Relatedness over Reference Corpora , 2014, ECAI.

[26]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[27]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[28]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[29]  Simone Marinai,et al.  Metadata Extraction from PDF Papers for Digital Library Ingest , 2009, 2009 10th International Conference on Document Analysis and Recognition.