论文信息 - Very high accuracy rule-based nominal lemmatization with a minimal lexicon

Very high accuracy rule-based nominal lemmatization with a minimal lexicon

In natural language processing, lemmatization is a procedure by which an inflectionally normalized form (the lemma) is automatically assigned to word forms. The particular task that is addressed in this paper is that of Nominal Lemmatization, targeting only Adjectives and Common Nouns. Note that the other tokens from the other nominal categories — such as Articles, Demonstratives, etc. — form a closed list, and can thus be easily lemmatized by a simple list look-up procedure. In this paper, our concern will thus be the tokens from the open nominal categories. This paper describes a shallow processing, rule-based algorithm for Nominal Lemmatization in Portuguese with minimal word lists. Additionally, evaluation results are presented scored from an efficient implementation of this algorithm. In Section 2, we describe the lemmatization task in greater detail and the issues that it raises. In Section 3, we outline the shallow processing algorithm that is used while Section 4 deals with the methods used to minimize the lexicon that is required. In Section 5, we present some of the harder cases which are caused by ambiguity. In Section 6, evaluation results, together with details about the performance of the implementation, are presented. In Section 7 we provide links to on-line demos and services that use the lemmatizer. Finally, in Section 8, the results are discussed and some prospects for future work are presented.

A. Branco | J. Silva

[1] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[2] José João Almeida,et al. jspell.pm: um módulo de análise morfológica para uso em processamento de linguagem natural , 2001 .

[3] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4] Grzegorz Chrupala,et al. Simple Data-Driven Context-Sensitive Lemmatization , 2006, Proces. del Leng. Natural.