Building a large dictionary of abbreviations for named entity recognition in Portuguese historical corpora

Abbreviated forms offer a special challenge in a historical corpus, since they show graphic variations, besides being frequent and ambiguous. The purpose of this paper is to present the process of building a large dictionary of historical Portuguese abbreviations, whose entries include the abbreviation and its expansion, as well as morphosyntactic and semantic information (a predefined set of named entities – NEs). This process has been carried out in a hybrid fashion that uses linguistic resources (such as a printed dictionary and lists of abbreviations) and abbreviations extracted from the Historical Dictionary of Brazilian Portuguese (HDPB) corpus via finite-state automata and regular expressions. Besides being useful to disambiguate the abbreviations found in the HDBP corpus, this dictionary can be used in other projects and tasks, mainly NE recognition.

[1]  Nathalie Friburger Reconnaissance automatique des noms propres : application à la classification automatique de textes journalistiques , 2002 .

[2]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[3]  Max Silberztein,et al.  INTEX: An FST Toolbox , 2000, Theor. Comput. Sci..

[4]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[5]  Dana Dannélls,et al.  Automatic Acronym Recognition , 2006, EACL.

[6]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[7]  Fernando Batista,et al.  Building a Dictionary of Anthroponyms , 2006, PROPOR.

[8]  Gregory R. Crane,et al.  The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[9]  Maria Helena Ochi Flexor,et al.  Abreviaturas : manuscritos dos séculos XVI ao XIX , 2008 .

[10]  Eric Laporte,et al.  UNITEX-PB, a set of flexible language resources for Brazilian Portuguese , 2005 .

[11]  Maria das Graças Volpe Nunes,et al.  DIADORIM - A Lexical Database for Brazilian Portuguese , 2002, LREC.

[12]  S. Lukas Challenges in Modelling a Richly Annotated Diachronic Corpus of German , 2004 .

[13]  Yaakov HaCohen-Kerner,et al.  Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents , 2004, EsTAL.

[14]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[15]  Jeffrey A. Rydberg-Cox Automatic disambiguation of Latin abbreviations in early modern texts for humanities digital libraries , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[16]  Elisabete Ranchhod O uso de dicionários e de autómatos finitos na representação lexical das línguas naturais , 2001 .