Identifying References to Datasets in Publications

Research data and publications are usually stored in separate and structurally distinct information systems. Often, links between these resources are not explicitly available which complicates the search for previous research. In this paper, we propose a pattern induction method for the detection of study references in full texts. Since these references are not specified in a standardized way and may occur inside a variety of different contexts --- i.e., captions, footnotes, or continuous text --- our algorithm is required to induce very flexible patterns. To overcome the sparse distribution of training instances, we induce patterns iteratively using a bootstrapping approach. We show that our method achieves promising results for the automatic identification of data references and is a first step towards building an integrated information system.

[1]  Jöran Beel,et al.  Citation Proximity Analysis (CPA) : A New Approach for Identifying Related Work Based on Co-Citation Analysis , 2009 .

[2]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[3]  Alexander A. Morgan,et al.  Investigation of Unsupervised Pattern Learning Techniques for Bootstrap Construction of a Medical Treatment Lexicon , 2009, BioNLP@HLT-NAACL.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[6]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[7]  Heiner Stuckenschmidt,et al.  Thesaurus Extension Using Web Search Engines , 2010, ICADL.

[8]  Toby Green,et al.  We need publishing standards for datasets and data tables , 2009, Learn. Publ..

[9]  P. Pantel,et al.  A Bootstrapping Algorithm for Automatically Harvesting Semantic Relations , 2006, Proceedings of the Fifth International Workshop on Inference in Computational Semantics.

[10]  Jane Hunter,et al.  The Role of Digital Libraries in a Time of Global Change, 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010, Gold Coast, Australia, June 21-25, 2010. Proceedings , 2010, ICADL.

[11]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[12]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[13]  Micah Altman,et al.  A Proposed Standard for the Scholarly Citation of Quantitative Data , 2008 .

[14]  Wolf-Tilo Balke,et al.  Rule based Autonomous Citation Mining with TIERL , 2010, J. Digit. Inf. Manag..