What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing. Due to their importance, identifying acronyms and corresponding phrases (i.e., acronym identification (AI)) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding. Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement. More specifically, limited size of manually annotated AI datasets or noises in the automatically created acronym identification datasets obstruct designing advanced high-performing acronym identification models. Moreover, the existing datasets are mostly limited to the medical domain and ignore other domains. In order to address these two limitations, we first create a manually annotated large AI dataset for scientific domain. This dataset contains 17,506 sentences which is substantially larger than previous scientific AI datasets. Next, we prepare an AD dataset for scientific domain with 62,441 samples which is significantly larger than the previous scientific AD dataset. Our experiments show that the existing state-of-the-art models fall far behind human-level performance on both datasets proposed by this work. In addition, we propose a new deep learning model that utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence. The proposed model outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset.

[1]  Walter Daelemans,et al.  Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts , 2016, BioNLP@ACL.

[2]  Silviu Cucerzan,et al.  Acronym-Expansion Recognition and Ranking on the Web , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[3]  Gianluca Demartini,et al.  Ontology-Based Word Sense Disambiguation for Scientific Literature , 2013, ECIR.

[4]  Amir Pouran Ben Veyseh Cross-Lingual Question Answering Using Common Semantic Space , 2016, TextGraphs@NAACL-HLT.

[5]  Katrin Kirchhoff,et al.  Unsupervised Resolution of Acronyms and Abbreviations in Nursing Notes Using Document-Level Context Models , 2016, Louhi@EMNLP.

[6]  Christian Wartena,et al.  Using Word Embeddings for Unsupervised Acronym Disambiguation , 2018, COLING.

[7]  Dragomir R. Radev,et al.  A Neural Topic-Attention Model for Medical Term Abbreviation Disambiguation , 2019, ArXiv.

[8]  Xinghua Lu,et al.  Deep Contextualized Biomedical Abbreviation Expansion , 2019, BioNLP@ACL.

[9]  Yaoyun Zhang,et al.  Clinical Abbreviation Disambiguation Using Neural Word Embeddings , 2015, BioNLP@IJCNLP.

[10]  Dan Roth,et al.  Relational Inference for Wikification , 2013, EMNLP.

[11]  Eytan Adar,et al.  SaRAD: a Simple and Robust Abbreviation Dictionary , 2004, Bioinform..

[12]  Carol Friedman,et al.  A Study of Abbreviations in Clinical Notes , 2007, AMIA.

[13]  Bridget T. McInnes,et al.  Evaluating Feature Extraction Methods for Knowledge-based Biomedical Word Sense Disambiguation , 2017, BioNLP.

[14]  Dejing Dou,et al.  Graph based Neural Networks for Event Factuality Prediction using Syntactic and Semantic Structures , 2019, ACL.

[15]  Jun Xu,et al.  A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD) , 2017, J. Am. Medical Informatics Assoc..

[16]  Maurice H. T. Ling,et al.  BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature , 2009, BMC Bioinformatics.

[17]  Bo Zhao,et al.  Guess Me if You Can: Acronym Disambiguation for Enterprises , 2018, ACL.

[18]  Tu Bao Ho,et al.  Abbreviation Identification in Clinical Notes with Level-wise Feature Engineering and Supervised Learning , 2016, PKAW.

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  H R Garner,et al.  Heuristics for Identification of Acronym-Definition Patterns within Text: Towards an Automated Construction of Comprehensive Acronym-Definition Dictionaries , 2002, Methods of Information in Medicine.

[21]  Yeye He,et al.  Mining acronym expansions and their meanings using query click log , 2013, WWW.

[22]  Chao Li,et al.  Acronym Disambiguation Using Word Embedding , 2015, AAAI.

[23]  Marti A. Hearst,et al.  Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions , 2020, SDP.

[24]  Sérgio Matos,et al.  Biomedical Word Sense Disambiguation with Word Embeddings , 2017, PACBB.

[25]  Naoaki Okazaki,et al.  Data and text mining Building an abbreviation dictionary using a term recognition approach , 2006 .

[26]  Yalou Huang,et al.  Multi-granularity sequence labeling model for acronym expansion identification , 2017, Inf. Sci..

[27]  Yue Wang,et al.  Clinical Word Sense Disambiguation with Interactive Search and Classification , 2016, AMIA.

[28]  Franck Dernoncourt,et al.  Improving Slot Filling by Utilizing Contextual Information , 2020, NLP4CONVAI.

[29]  Yi Zhang,et al.  Learning conditional random fields with latent sparse features for acronym expansion finding , 2011, CIKM '11.

[30]  Youngja Park,et al.  Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[31]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[32]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..

[33]  Peter D. Turney,et al.  A Supervised Learning Approach to Acronym Identification , 2005, Canadian AI.

[34]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[35]  Padmini Srinivasan,et al.  My Word! Machine versus Human Computation Methods for Identifying and Resolving Acronyms , 2019, Computación y Sistemas.

[36]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[37]  Hong Yu,et al.  Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles , 2007, J. Biomed. Informatics.

[38]  Ira Assent,et al.  Unsupervised Abbreviation Disambiguation Contextual disambiguation using word embeddings , 2019, ArXiv.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[41]  Terra Blevins,et al.  Moving Down the Long Tail of Word Sense Disambiguation with Gloss-Informed Biencoders , 2020, ACL.

[42]  Durvasula V. L. N. Somayajulu,et al.  Finding acronym expansion using semi-Markov conditional random fields , 2014, COMPUTE '14.

[43]  Rebecca J. Passonneau,et al.  Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation , 2006, LREC.

[44]  Aditya Thakker,et al.  Acronym Disambiguation: A Domain Independent Approach , 2017, 1711.09271.