论文信息 - Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods - 字舞流文

Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods

The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.

Erjia Yan | Yongjun Zhu | E. Yan | Yongjun Zhu

[1] Mauro dos Santos Anjo,et al. Fingerspelling Recognition with Support Vector Machines and Hidden Conditional Random Fields - A Comparison with Neural Networks and Hidden Markov Models , 2012, IBERAMIA.

[2] Gobinda G. Chowdhury,et al. Bibliometric cartography of information retrieval research by using co-word analysis , 2001, Inf. Process. Manag..

[3] Maurice Bouwhuis,et al. CoPub: a literature-based keyword enrichment tool for microarray data analysis , 2008, Nucleic Acids Res..

[4] Oren Etzioni,et al. Semantic Role Labeling for Open Information Extraction , 2010, HLT-NAACL 2010.

[5] Ellen Riloff,et al. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[6] John D. Lafferty,et al. A correlated topic model of Science , 2007, 0708.3601.

[7] Alexander A. Morgan,et al. Gene name identification and normalization using a model organism database , 2004, J. Biomed. Informatics.

[8] H. Lowe,et al. Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. , 1994, JAMA.

[9] Doug Downey,et al. Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[10] Andrew McCallum,et al. Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[11] D. Swanson. Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[12] Ellen Riloff,et al. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[13] Ellen Riloff,et al. Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[14] Clement J. McDonald,et al. What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[15] Alexander Löser,et al. KrakeN: N-ary Facts in Open Information Extraction , 2012, AKBC-WEKEX@NAACL-HLT.

[16] Oren Etzioni,et al. Open Information Extraction from the Web , 2007, CACM.

[17] K. Hyland,et al. Disciplinary Discourses: Social Interactions in Academic Writing , 2001 .

[18] Kevin A Hallgren,et al. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[19] Betsy L. Humphreys,et al. Relationships in Medical Subject Headings (MeSH) , 2001 .

[20] Hagit Shatkay,et al. Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[21] Ying Ding,et al. Overlaying communities and topics: an analysis on publication networks , 2011, Scientometrics.

[22] Nigel Collier,et al. Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[23] Andrew McCallum,et al. An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[24] Nigel Collier,et al. Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[25] Ying He,et al. Biological Entity Recognition with Conditional Random Fields , 2008, AMIA.

[26] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27] Vassilis Virvilis,et al. Literature mining, ontologies and information visualization for drug repurposing , 2011, Briefings Bioinform..

[28] Christine Thielen,et al. An Approach to Proper Name Tagging for German , 1995, cmp-lg/9506024.

[29] Cassidy R. Sugimoto,et al. Argue, observe, assess: Measuring disciplinary identities and differences through socio‐epistemic discourse , 2015, J. Assoc. Inf. Sci. Technol..

[30] Tanja Bekhuis. Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy , 2006, Biomedical digital libraries.

[31] Satoshi Sekine,et al. A survey of named entity recognition and classification , 2007 .

[32] Betsy L. Humphreys,et al. Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[33] Michael Krauthammer,et al. Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[34] Nigel Collier,et al. Bio-Medical Entity Extraction using Support Vector Machines , 2005, Artif. Intell. Medicine.

[35] Ramesh Nallapati,et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[36] Graeme Hirst,et al. Discipline impact factors: A method for determining core journal lists , 1978, J. Am. Soc. Inf. Sci..

[37] M. Ashburner,et al. Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[38] Ralph Grishman,et al. Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition , 1998, VLC@COLING/ACL.

[39] Christopher D. Manning,et al. Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[40] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41] Joachim M. Buhmann,et al. 2010 International Conference on Pattern Recognition The binormal assumption on precision-recall curves , 2022 .

[42] Andrew McCallum,et al. An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[43] Satoshi Sekine,et al. Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy , 2004, LREC.

[44] Stan Matwin,et al. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[45] Suresh Manandhar,et al. An Unsupervised Method for General Named Entity Recognition and Automated Concept Discovery , 2004 .

[46] Richard M. Schwartz,et al. Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[47] P. Bork,et al. Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[48] Luciano Del Corro,et al. ClausIE: clause-based open information extraction , 2013, WWW.

[49] Christopher D. Manning,et al. SPIED: Stanford Pattern based Information Extraction and Diagnostics , 2014 .

[50] Girish Keshav Palshikar,et al. Automatic gazette creation for named entity recognition and application to resume processing , 2012, COMPUTE.

[51] Jacob de Vlieg,et al. Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[52] Ying Ding,et al. Scholarly network similarities: How bibliographic coupling networks, citation networks, cocitation networks, topical networks, coauthorship networks, and coword networks relate to each other , 2012, J. Assoc. Inf. Sci. Technol..

[53] Jean Pierre Courtial,et al. Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry , 1991, Scientometrics.

[54] Hongfang Liu,et al. BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[55] Hua Xu,et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[56] Wen Lou,et al. Semantic information retrieval research based on co-occurrence analysis , 2014, Online Inf. Rev..

[57] Gianluca Demartini,et al. Effective named entity recognition for idiosyncratic web collections , 2014, WWW.

[58] Félix de Moya Anegón,et al. A dictionary-based approach to normalizing gene names in one domain of knowledge from the biomedical literature , 2012, J. Documentation.

[59] Yoram Singer,et al. Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[60] Juan-Zi Li,et al. Tree-Structured Conditional Random Fields for Semantic Annotation , 2006, International Semantic Web Conference.

[61] Christopher D. Manning,et al. Legal Docket-Entry Classification : Where Machine Learning stumbles , 2008 .

[62] Hongfang Liu,et al. Research Paper: Quantitative Assessment of Dictionary-based Protein Named Entity Tagging , 2006, J. Am. Medical Informatics Assoc..

[63] Neil R. Smalheiser,et al. Ranking indirect connections in literature-based discovery: The role of medical subject headings , 2006, J. Assoc. Inf. Sci. Technol..

[64] Saul A. Kripke,et al. Naming and Necessity , 1980 .

[65] Cassidy R. Sugimoto,et al. The cognitive structure of Library and Information Science: Analysis of article title words , 2011, J. Assoc. Inf. Sci. Technol..

[66] Ramesh Nallapati,et al. Legal Docket Classification: Where Machine Learning Stumbles , 2008, EMNLP.