Building a Corpus-based Historical Portuguese Dictionary: Challenges and Opportunities

Historical corpora are important resources for different areas. Philology, Human Language Technology, Literary Studies, History, and Lexicography are some that benefit from them. However, compiling historical corpora is different from compiling contemporary corpora. Corpus designers have to deal with several characteristics inherent in historical texts, such as: absence of a spelling standard, pervasive use of abbreviations plus their spelling variations, lack of space between words, irregular use of hyphenation, non- standard typographical symbols. This paper addresses the challenges posed in processing the corpus designed for the Historical Dictionary of Brazilian Portuguese (HDBP) project, which is composed of texts from the sixteenth through the beginning of the nineteenth century, and the solutions found to support the compilation of a Historical Portuguese dictionary based on this corpus. RESUME: Les corpus historiques sont des ressources importantes pour differents domaines: a Philologie, la Technologie du Langage Humain, les Etudes Litteraires, l'Histoire et la Lexicographie en tirent profit. Toutefois, la compilation des corpus historiques est differente de la compilation des corpus contemporains. Les concepteurs de corpus doivent faire face a des problemes inherents aux textes historiques, tels que: l'absence d'une norme orthographique, l'utilisation generalisee des abreviations en plus de leurs variantes orthographiques, le manque d'espace entre les mots, l'utilisation irreguliere des traits d'union, les symbols typographiques non standard. Ce document aborde les defis poses dans le traitement des corpus concus po ur le Dictionnaire Historique du Portugais Bresilien (DHPB), qui est compose de textes du XVIe jusqu'au debut du XIXe siecle, et les solutions trouvees pour appuyer la compilation d'um dictionnaire du portugais historique base sur ce corpus.

[1]  Yaakov HaCohen-Kerner,et al.  Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents , 2004, EsTAL.

[2]  Maria Helena Ochi Flexor,et al.  Abreviaturas : manuscritos dos séculos XVI ao XIX , 2008 .

[3]  Uta Grothkopf,et al.  Historical Astrolexicography and Old Publications , 1998 .

[4]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[5]  Henrik Køhler Simonsen CorpLex: Blueprints of a corporate dictionary and editing system , 2006 .

[6]  Jón Hilmar Jónsson,et al.  Using a Computer Corpus to Supplement a Citation Collection for a Historical Dictionary , 1993 .

[7]  V. Skretkowicz Dictionary of the Scots language , 2004 .

[8]  Diana Santos,et al.  Ambientes de processamento de corpora em português: Comparação entre dois sistemas , 1999 .

[9]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[10]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[11]  Alan Johnson,et al.  Preface , 2021, Journal of Antimicrobial Chemotherapy.

[12]  Marcelo Finger,et al.  Aprendizado de regras de substituição para normatização de textos históricos , 2004 .

[13]  S. Lukas Challenges in Modelling a Richly Annotated Diachronic Corpus of German , 2004 .

[14]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[15]  Sandra Maria Aluísio,et al.  A terminologia na era da informática , 2006 .

[16]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[17]  Sandra Maria Aluísio,et al.  Procorph: um sistema de apoio à criação de dicionários históricos , 2008 .

[18]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[19]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[20]  J. E. Lighter,et al.  Random House historical dictionary of American slang , 1997 .

[21]  J. R. Ruiz,et al.  Recopilación y estructuración del vocabulario de especialidad en el "Nuevo Diccionario Histórico del Español" (RAE) , 2008 .

[22]  Dana Dannélls,et al.  Automatic Acronym Recognition , 2006, EACL.

[23]  Arnaldo Candido Junior Criação de um ambiente para o processamento de córpus de Português Histórico , 2008 .

[24]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[25]  Sandra M. Aluísio,et al.  The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools , 2004, LREC.

[26]  Alexander M. Robertson,et al.  Word Variant Identification in Old French , 1997, Inf. Res..

[27]  Gregory R. Crane,et al.  The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[28]  Margaret King,et al.  Evaluation of natural language processing systems , 1991 .

[29]  Thorsten Trippel,et al.  Building a historical corpus for Classical Portuguese: some technological aspects , 2006, LREC.

[30]  Paul Edward Rayson,et al.  Matrix : a statistical method and software tool for linguistic analysis through corpus comparison , 2003 .

[31]  Jeffrey A. Rydberg-Cox Automatic disambiguation of Latin abbreviations in early modern texts for humanities digital libraries , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[32]  Stefan Th. Gries,et al.  What is Corpus Linguistics? , 2009, Lang. Linguistics Compass.

[33]  Maria das Graças Volpe Nunes,et al.  DIADORIM - A Lexical Database for Brazilian Portuguese , 2002, LREC.

[34]  Weiss,et al.  Text Mining , 2010 .

[35]  Takenobu Tokunaga,et al.  Automatic expansion of abbreviations by using context and character information , 2004, Inf. Process. Manag..

[36]  Oto Vale,et al.  Building a large dictionary of abbreviations for named entity recognition in Portuguese historical corpora , 2008, LREC 2008.