Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods

The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.

[1]  Mauro dos Santos Anjo,et al.  Fingerspelling Recognition with Support Vector Machines and Hidden Conditional Random Fields - A Comparison with Neural Networks and Hidden Markov Models , 2012, IBERAMIA.

[2]  Gobinda G. Chowdhury,et al.  Bibliometric cartography of information retrieval research by using co-word analysis , 2001, Inf. Process. Manag..

[3]  Maurice Bouwhuis,et al.  CoPub: a literature-based keyword enrichment tool for microarray data analysis , 2008, Nucleic Acids Res..

[4]  Oren Etzioni,et al.  Semantic Role Labeling for Open Information Extraction , 2010, HLT-NAACL 2010.

[5]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[6]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[7]  Alexander A. Morgan,et al.  Gene name identification and normalization using a model organism database , 2004, J. Biomed. Informatics.

[8]  H. Lowe,et al.  Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. , 1994, JAMA.

[9]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[10]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[11]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[12]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[13]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[14]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[15]  Alexander Löser,et al.  KrakeN: N-ary Facts in Open Information Extraction , 2012, AKBC-WEKEX@NAACL-HLT.

[16]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[17]  K. Hyland,et al.  Disciplinary Discourses: Social Interactions in Academic Writing , 2001 .

[18]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[19]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[20]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[21]  Ying Ding,et al.  Overlaying communities and topics: an analysis on publication networks , 2011, Scientometrics.

[22]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[23]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[24]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[25]  Ying He,et al.  Biological Entity Recognition with Conditional Random Fields , 2008, AMIA.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Vassilis Virvilis,et al.  Literature mining, ontologies and information visualization for drug repurposing , 2011, Briefings Bioinform..

[28]  Christine Thielen,et al.  An Approach to Proper Name Tagging for German , 1995, cmp-lg/9506024.

[29]  Cassidy R. Sugimoto,et al.  Argue, observe, assess: Measuring disciplinary identities and differences through socio‐epistemic discourse , 2015, J. Assoc. Inf. Sci. Technol..

[30]  Tanja Bekhuis Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy , 2006, Biomedical digital libraries.

[31]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[32]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[33]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[34]  Nigel Collier,et al.  Bio-Medical Entity Extraction using Support Vector Machines , 2005, Artif. Intell. Medicine.

[35]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[36]  Graeme Hirst,et al.  Discipline impact factors: A method for determining core journal lists , 1978, J. Am. Soc. Inf. Sci..

[37]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[38]  Ralph Grishman,et al.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition , 1998, VLC@COLING/ACL.

[39]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[40]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41]  Joachim M. Buhmann,et al.  2010 International Conference on Pattern Recognition The binormal assumption on precision-recall curves , 2022 .

[42]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[43]  Satoshi Sekine,et al.  Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy , 2004, LREC.

[44]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[45]  Suresh Manandhar,et al.  An Unsupervised Method for General Named Entity Recognition and Automated Concept Discovery , 2004 .

[46]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[47]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[48]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[49]  Christopher D. Manning,et al.  SPIED: Stanford Pattern based Information Extraction and Diagnostics , 2014 .

[50]  Girish Keshav Palshikar,et al.  Automatic gazette creation for named entity recognition and application to resume processing , 2012, COMPUTE.

[51]  Jacob de Vlieg,et al.  Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[52]  Ying Ding,et al.  Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other , 2012, J. Assoc. Inf. Sci. Technol..

[53]  Jean Pierre Courtial,et al.  Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry , 1991, Scientometrics.

[54]  Hongfang Liu,et al.  BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[55]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[56]  Wen Lou,et al.  Semantic information retrieval research based on co-occurrence analysis , 2014, Online Inf. Rev..

[57]  Gianluca Demartini,et al.  Effective named entity recognition for idiosyncratic web collections , 2014, WWW.

[58]  Félix de Moya Anegón,et al.  A dictionary-based approach to normalizing gene names in one domain of knowledge from the biomedical literature , 2012, J. Documentation.

[59]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[60]  Juan-Zi Li,et al.  Tree-Structured Conditional Random Fields for Semantic Annotation , 2006, International Semantic Web Conference.

[61]  Christopher D. Manning,et al.  Legal Docket-Entry Classification : Where Machine Learning stumbles , 2008 .

[62]  Hongfang Liu,et al.  Research Paper: Quantitative Assessment of Dictionary-based Protein Named Entity Tagging , 2006, J. Am. Medical Informatics Assoc..

[63]  Neil R. Smalheiser,et al.  Ranking indirect connections in literature-based discovery: The role of medical subject headings , 2006, J. Assoc. Inf. Sci. Technol..

[64]  Saul A. Kripke,et al.  Naming and Necessity , 1980 .

[65]  Cassidy R. Sugimoto,et al.  The cognitive structure of Library and Information Science: Analysis of article title words , 2011, J. Assoc. Inf. Sci. Technol..

[66]  Ramesh Nallapati,et al.  Legal Docket Classification: Where Machine Learning Stumbles , 2008, EMNLP.