ATALA 59 AUTOMATED CREATION OF A PARTIALLY SYNTACTICALLY ANNOTATED CORPUS OF MEDIEVAL PORTUGUESE USING CONTEMPORARY PORTUGUESE RESOURCES

We were faced with the ultimate goal of syntactically annotating Medieval Portuguese (MP) texts, of the 13 and 14 centuries. But there was a great deficit of language-specific computational resources available for the task. Namely, no grammar existed (or exists today), and only a very small body of lexical knowledge had been prepared. So we took the innovative stance of using existing resources of another language, namely a widecoverage grammar of Contemporary Portuguese (CP), with the hypothesis that the grammars of the two languages overlapped substantially. This hypothesis has proved truthful, and a good quality partially tree-annotated corpus of MP was actually produced. In fact, MP and CP differ mainly in their lexicons, especially in word spelling: MP takes more liberties on spelling, with many words having two, three or more different spellings. MP also uses outdated grammatical constructions such as subject-object-verb word order (in CP the canonical order is subject-verb-object). We were able to implement this cross-language reuse strategy because we developed and applied methods of non-deterministic lexical analysis and partial syntactic analysis. In section 2 we describe these methods in detail. Whereas other authors (Marcus M. et al. 1993, Hobbs J. et al. 1997) use resources that are language-specific and even domain-specific (law, technical, news, etc.), our approach is domain independent and resistant to language variations. The system works better if the input text is POS-tagged, which is relatively easy to obtain, in any language, with current tagging technology. So, our method is appropriate for bootstrapping in languages for which there are none or few computational resources.