论文信息 - Building a large dictionary of abbreviations for named entity recognition in Portuguese historical corpora

Building a large dictionary of abbreviations for named entity recognition in Portuguese historical corpora

Abbreviated forms offer a special challenge in a historical corpus, since they show graphic variations, besides being frequent and ambiguous. The purpose of this paper is to present the process of building a large dictionary of historical Portuguese abbreviations, whose entries include the abbreviation and its expansion, as well as morphosyntactic and semantic information (a predefined set of named entities – NEs). This process has been carried out in a hybrid fashion that uses linguistic resources (such as a printed dictionary and lists of abbreviations) and abbreviations extracted from the Historical Dictionary of Brazilian Portuguese (HDPB) corpus via finite-state automata and regular expressions. Besides being useful to disambiguate the abbreviations found in the HDBP corpus, this dictionary can be used in other projects and tasks, mainly NE recognition.

[1] Nathalie Friburger. Reconnaissance automatique des noms propres : application à la classification automatique de textes journalistiques , 2002 .

[2] Rafael Giusti,et al. Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[3] Max Silberztein,et al. INTEX: An FST Toolbox , 2000, Theor. Comput. Sci..

[4] George Hripcsak,et al. Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[5] Dana Dannélls,et al. Automatic Acronym Recognition , 2006, EACL.

[6] Marti A. Hearst,et al. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[7] Fernando Batista,et al. Building a Dictionary of Anthroponyms , 2006, PROPOR.

[8] Gregory R. Crane,et al. The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[9] Maria Helena Ochi Flexor,et al. Abreviaturas : manuscritos dos séculos XVI ao XIX , 2008 .

[10] Eric Laporte,et al. UNITEX-PB, a set of flexible language resources for Brazilian Portuguese , 2005 .

[11] Maria das Graças Volpe Nunes,et al. DIADORIM - A Lexical Database for Brazilian Portuguese , 2002, LREC.

[12] S. Lukas. Challenges in Modelling a Richly Annotated Diachronic Corpus of German , 2004 .

[13] Yaakov HaCohen-Kerner,et al. Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents , 2004, EsTAL.

[14] Serguei V. S. Pakhomov. Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[15] Jeffrey A. Rydberg-Cox. Automatic disambiguation of Latin abbreviations in early modern texts for humanities digital libraries , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[16] Elisabete Ranchhod. O uso de dicionários e de autómatos finitos na representação lexical das línguas naturais , 2001 .