The Automatic Detection of Dataset Names in Scientific Articles

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.

[1]  Jaideep Srivastava,et al.  Research dataset discovery from research publications using web context , 2017, Web Intell..

[2]  Kyle Chard,et al.  Towards hybrid human-machine scientific information extraction , 2018, 2018 New York Scientific Data Summit (NYSDS).

[3]  Yu Song,et al.  POSBIOTM-NER : A Machine Learning Approach for Bio-Named Entity Recognition , 2004 .

[4]  Orathai Khongtum,et al.  The Entity Recognition of Thai Poem Compose by Sunthorn Phu by Using the Bidirectional Long Short Term Memory Technique , 2019, MIWAI.

[5]  Christopher. Simons,et al.  Machine learning with Python , 2017 .

[6]  Gonçalo Simões,et al.  Information Extraction tasks : a survey , 2009 .

[7]  Chenliang Li,et al.  A Survey on Deep Learning for Named Entity Recognition , 2018, IEEE Transactions on Knowledge and Data Engineering.

[8]  Dirk Van den Poel,et al.  Using Predicted Outcome Stratified Sampling to Reduce the Variability in Predictive Performance of a One-Shot Train-and-Test Split for Individual Customer Predictions , 2006, Industrial Conference on Data Mining - Posters.

[9]  Elena Paslaru Bontas Simperl,et al.  The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour , 2017, CHI.

[10]  Sriraam Natarajan,et al.  A Comparison of Weak Supervision methods for Knowledge Base Construction , 2016, AKBC@NAACL-HLT.

[11]  Michal Konkol,et al.  Named Entity Recognition , 2012 .

[12]  Thomas R Nichols,et al.  Putting the Kappa Statistic to Use , 2010 .

[13]  David Konopnicki,et al.  A Summarization System for Scientific Documents , 2019, EMNLP.

[14]  Nerea Ezeiza,et al.  Measuring the effect of different types of unsupervised word representations on Medical Named Entity Recognition , 2019, Int. J. Medical Informatics.

[15]  See-Kiong Ng,et al.  Negative Training Data Can be Harmful to Text Classification , 2010, EMNLP.

[16]  Jenna Kim,et al.  The impact of imbalanced training data on machine learning for author name disambiguation , 2018, Scientometrics.

[17]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Chunhua Weng,et al.  Pretraining to Recognize PICO Elements from Randomized Controlled Trial Literature , 2019, MedInfo.

[20]  Christoph Lange,et al.  Identifying and Improving Dataset References in Social Sciences Full Texts , 2016, ELPUB.

[21]  C. Strasser,et al.  Researcher Perspectives on Publication and Peer Review of Data , 2014, PloS one.

[22]  Li Dong,et al.  Learning a Unified Named Entity Tagger from Multiple Partially Annotated Corpora for Efficient Adaptation , 2019, CoNLL.

[23]  Ratheesh Raghavan,et al.  Study of the relationship of training set size to error rate in yet another decision tree And random forest algorithms , 2006 .

[24]  Tong Zeng,et al.  Assigning credit to scientific datasets using article citation networks , 2020, J. Informetrics.

[25]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[26]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[27]  Recurrent neural networks with specialized word embedding for Chinese Clinical Named Entity Recognition , 2017 .

[28]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[29]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[30]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[31]  Xian Wu,et al.  Domain Adaptation with Latent Semantic Association for Named Entity Recognition , 2009, NAACL.

[32]  Lei Zhang,et al.  Transfer Adaptation Learning: A Decade Survey , 2019, IEEE transactions on neural networks and learning systems.

[33]  Subhasis Chaudhuri,et al.  Generalized Zero-shot Learning using Open Set Recognition , 2019, BMVC.

[34]  Min-Yen Kan,et al.  Dataset Mention Extraction and Classification , 2019, Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications.

[35]  Alberto Lavelli,et al.  Assessing the practical usability of an automatically annotated corpus , 2011, Linguistic Annotation Workshop.

[36]  Xu Sun,et al.  A Unified Model for Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media , 2017, AAAI.

[37]  Euan A. Ashley,et al.  Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences , 2019, Nature Communications.

[38]  Robert Stevens,et al.  bioNerDS: exploring bioinformatics’ database and software use through literature mining , 2013, BMC Bioinformatics.

[39]  Mari Ostendorf,et al.  Scientific Information Extraction with Semi-supervised Neural Tagging , 2017, EMNLP.

[40]  Elke A. Rundensteiner,et al.  Bidirectional LSTM-CRF for Adverse Drug Event Tagging in Electronic Health Records , 2018, Medication and Adverse Drug Event Detection.

[41]  Gerhard Weikum,et al.  Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment , 2015, TACL.

[42]  Timo Borst,et al.  Patterns for searching data on the web across different research communities , 2020 .

[43]  Elizabeth Du,et al.  The discourse-level structure of empirical abstracts: an exploratory study , 1991, Inf. Process. Manag..

[44]  Alena Begler A standard language for the description of datasets obtained in experimental studies , 2019, SEMANTICS Posters&Demos.

[45]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[46]  Marieke van Erp,et al.  Reusable Research? A Case Study in Named Entity Recognition , 2013 .

[47]  Kalina Bontcheva,et al.  Generalisation in named entity recognition: A quantitative analysis , 2017, Comput. Speech Lang..

[48]  Jian Wu,et al.  Method and Dataset Mining in Scientific Papers , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[49]  Michael C. Frank,et al.  Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition , 2018, Royal Society Open Science.

[50]  Piero Quatto,et al.  Fleiss’ kappa statistic without paradoxes , 2015 .

[51]  Chunyan Miao,et al.  A Survey of Zero-Shot Learning , 2019, ACM Trans. Intell. Syst. Technol..

[52]  Ming Yang,et al.  Bidirectional Long Short-Term Memory Networks for Relation Classification , 2015, PACLIC.

[53]  Robert Gaizauskas,et al.  Bioinformatics applications of information extraction from scientific journal articles , 2000, J. Inf. Sci..

[54]  Dr. Neeta A. Deshpande,et al.  A Survey on Machine Learning Techniques to Extract Chemical Names from Text Documents , 2015 .

[55]  Frank Krüger,et al.  Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach , 2020, ESWC.

[56]  Behrang Q. Zadeh,et al.  SemEval-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers , 2018, *SEMEVAL.

[57]  Gary D. Bader,et al.  Towards reliable named entity recognition in the biomedical domain , 2019, bioRxiv.

[58]  Mari Ostendorf,et al.  Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction , 2018, EMNLP.

[59]  He Zhao,et al.  A Context-based Framework for Modeling the Role and Function of On-line Resource Citations in Scientific Literature , 2019, EMNLP.

[60]  Elena Simperl,et al.  Dataset search: a survey , 2019, The VLDB Journal.

[61]  Andrzej J. Bojarski,et al.  The influence of negative training set size on machine learning-based virtual screening , 2014, Journal of Cheminformatics.

[62]  Xuan-Hieu Phan,et al.  Named Entity Recognition for Vietnamese Spoken Texts and Its Application in Smart Mobile Voice Interaction , 2016, ACIIDS.

[63]  Roman Klinger,et al.  Classical Probabilistic Models and Conditional Random Fields , 2007 .

[64]  Dan Brickley,et al.  Google Dataset Search: Building a search engine for datasets in an open Web ecosystem , 2019, WWW.

[65]  Judy Pearsall,et al.  Oxford Dictionary of English , 2010 .

[66]  Nigel Collier,et al.  Bidirectional LSTM for Named Entity Recognition in Twitter Messages , 2016, NUT@COLING.

[67]  Paolo Rosso,et al.  Conditional Random Fields vs. Hidden Markov Models in a biomedical Named Entity Recognition task , 2007 .

[68]  Brigitte Mathiak,et al.  Challenges in Matching Dataset Citation Strings to Datasets in Social Science , 2015, D Lib Mag..

[69]  Christoph Lange,et al.  A semi-automatic approach for detecting dataset references in social science texts , 2016, Inf. Serv. Use.

[70]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[71]  Thomas M. Breuel,et al.  Benchmarking of LSTM Networks , 2015, ArXiv.

[72]  Naomie Salim,et al.  Chemical named entities recognition: a review on approaches and applications , 2014, Journal of Cheminformatics.

[73]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[74]  Natasha Noy,et al.  Google Dataset Search by the Numbers , 2020, SEMWEB.

[75]  Aneta Poniszewska-Maranda,et al.  Towards the Named Entity Recognition Methods in Biomedical Field , 2020, SOFSEM.

[76]  Paul Groth,et al.  Understanding data search as a socio-technical practice , 2018 .

[77]  Isabel Segura-Bedmar,et al.  Protected Health Information Recognition by BiLSTM-CRF , 2019, IberLEF@SEPLN.

[78]  Charles Jochim,et al.  Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction , 2019, ACL.