Predicting Chinese Abbreviations with Minimum Semantic Unit and Global Constraints

We propose a new Chinese abbreviation prediction method which can incorporate rich local information while generating the abbreviation globally. Different to previous character tagging methods, we introduce the minimum semantic unit, which is more fine-grained than character but more coarse-grained than word, to capture word level information in the sequence labeling framework. To solve the “character duplication” problem in Chinese abbreviation prediction, we also use a substring tagging strategy to generate local substring tagging candidates. We use an integer linear programming (ILP) formulation with various constraints to globally decode the final abbreviation from the generated candidates. Experiments show that our method outperforms the state-of-the-art systems, without using any extra resource.

[1]  Eric P. Xing,et al.  Concise Integer Linear Programming Formulations for Dependency Parsing , 2009, ACL.

[2]  Silviu Cucerzan,et al.  Acronym-Expansion Recognition and Ranking on the Web , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[3]  Xu Sun,et al.  Generalized Abbreviation Prediction with Negative Full Forms and Its Application on Improving Chinese Web Search , 2013, IJCNLP.

[4]  Xu Sun,et al.  Predicting Chinese Abbreviations from Definitions: An Empirical Learning Approach Using Support Vector Regression , 2008, Journal of Computer Science and Technology.

[5]  Youngja Park,et al.  Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[6]  H R Garner,et al.  Heuristics for Identification of Acronym-Definition Patterns within Text: Towards an Automated Construction of Comprehensive Acronym-Definition Dictionaries , 2002, Methods of Information in Medicine.

[7]  Charles P. Bourne,et al.  A Study of Methods for Systematically Abbreviating English Words and Names , 1961, JACM.

[8]  Kazem Taghva,et al.  Recognizing acronyms and their definitions , 1999, International Journal on Document Analysis and Recognition.

[9]  Houfeng Wang,et al.  Constructing Chinese Abbreviation Dictionary: A Stacked Approach , 2012, COLING.

[10]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..

[11]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[12]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[13]  Sophia Ananiadou,et al.  A Machine Learning Approach to Acronym Generation , 2005, LBLODMBS@IDMB.

[14]  Hong Yu,et al.  A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations , 2006, TOIS.

[15]  Eytan Adar,et al.  SaRAD: a Simple and Robust Abbreviation Dictionary , 2004, Bioinform..

[16]  Dan Roth,et al.  Semantic Role Labeling Via Integer Linear Programming Inference , 2004, COLING.

[17]  Xu Sun,et al.  Robust Approach to Abbreviating Terms: A Discriminative Latent Variable Model with Global Information , 2009, ACL/IJCNLP.

[18]  Mitch Marcus,et al.  Exploring Deterministic Constraints: from a Constrained English POS Tagger to an Efficient ILP Solution to Chinese Word Segmentation , 2012, ACL.

[19]  Mandalay Grems,et al.  Abbreviating words systematically , 1960, Commun. ACM.

[20]  Sebastian Riedel,et al.  Incremental Integer Linear Programming for Non-projective Dependency Parsing , 2006, EMNLP.

[21]  Yaakov HaCohen-Kerner,et al.  Combined One Sense Disambiguation of Abbreviations , 2008, ACL.