论文信息 - ATALA 59 AUTOMATED CREATION OF A PARTIALLY SYNTACTICALLY ANNOTATED CORPUS OF MEDIEVAL PORTUGUESE USING CONTEMPORARY PORTUGUESE RESOURCES

ATALA 59 AUTOMATED CREATION OF A PARTIALLY SYNTACTICALLY ANNOTATED CORPUS OF MEDIEVAL PORTUGUESE USING CONTEMPORARY PORTUGUESE RESOURCES

We were faced with the ultimate goal of syntactically annotating Medieval Portuguese (MP) texts, of the 13 and 14 centuries. But there was a great deficit of language-specific computational resources available for the task. Namely, no grammar existed (or exists today), and only a very small body of lexical knowledge had been prepared. So we took the innovative stance of using existing resources of another language, namely a widecoverage grammar of Contemporary Portuguese (CP), with the hypothesis that the grammars of the two languages overlapped substantially. This hypothesis has proved truthful, and a good quality partially tree-annotated corpus of MP was actually produced. In fact, MP and CP differ mainly in their lexicons, especially in word spelling: MP takes more liberties on spelling, with many words having two, three or more different spellings. MP also uses outdated grammatical constructions such as subject-object-verb word order (in CP the canonical order is subject-verb-object). We were able to implement this cross-language reuse strategy because we developed and applied methods of non-deterministic lexical analysis and partial syntactic analysis. In section 2 we describe these methods in detail. Whereas other authors (Marcus M. et al. 1993, Hobbs J. et al. 1997) use resources that are language-specific and even domain-specific (law, technical, news, etc.), our approach is domain independent and resistant to language variations. The system works better if the input text is POS-tagged, which is relatively easy to obtain, in any language, with current tagging technology. So, our method is appropriate for bootstrapping in languages for which there are none or few computational resources.

Vitor ROCIO | Mário A. ALVES | José Gabriel P. LOPES | Maria F. XAVIER | Graça VICENTE

[1] José Gabriel Pereira Lopes,et al. Learning Verbal Transitivity Using LogLinear Models , 1998, ECML.

[2] Yves Schabes,et al. The Lexical Analysis of Natural Languages , 1997 .

[3] Stuart M. Shieber,et al. An Introduction to Unification-Based Approaches to Grammar , 1986, CSLI Lecture Notes.

[4] Douglas E. Appelt,et al. FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[5] Éric Villemonte de la Clergerie,et al. LPDA: Another look at Tabulation in Logic Programming , 1994, ICLP.

[6] José Gabriel Pereira Lopes,et al. Partial Parsing, Deduction and Tabling , 1998, TAPD.

[7] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.