论文信息 - Unified Lexicon and Unified Morphosyntactic Specifications for Written and Spoken Italian

Unified Lexicon and Unified Morphosyntactic Specifications for Written and Spoken Italian

The goal of this paper is (1) to illustrate a specific procedure for merging different monolingual lexicons, focusing on techniques for detecting and mapping equivalent lexical entries, and (2) to sketch a production model that enables one to obtain lexical resources via unification of existing data. We describe the creation of a Unified Lexicon (UL) from a common sample of the Italian PAROLE/SIMPLE/CLIPS phonological lexicon and of the Italian LCSTAR pronunciation lexicon. We expand previous experiments carried out at ILC-CNR: based on a detailed mechanism for mapping grammatical classifications of candidate UL entries, a consensual set of Unified Morphosyntactic Specifications (UMS) shared by lexica for the written and spoken areas is proposed. The impact of the UL on cross-validation issues is analysed: by looking into conflicts, mismatches and diverging classifications can be detected in both resources. The work presented is in line with the activities promoted by ELRA towards the development of methods for packaging new language resources by combining independently created resources, and was carried out as part of the ELRA Production Committee activities. ELRA aims to exploit the UL experience to carry out such merging activities for resources available on the ELRA catalogue in order to fulfill the users’ needs.

[1] Nicoletta Calzolari,et al. EAGLES Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages , 1996 .

[2] Tracy Holloway King,et al. Unifying Lexical Resources , 2005 .

[3] Asunción Moreno,et al. Large lexica for speech-to-speech translation: from specification to creation , 2003, INTERSPEECH.

[4] Nicoletta Calzolari,et al. The Italian "Parole" Corpus : An Overview , 1996 .

[5] Marisa Ulivieri,et al. Unifying Lexicons in view of a Phonological and Morphological Lexical DB , 2004, LREC.

[6] Marisa Ulivieri,et al. CLIPS, a Multi-level Italian Computational Lexicon: a Glimpse to Data , 2002, LREC.

[7] Geoffrey Leech,et al. Standards for Tagsets. , 1999 .

[8] York Sure-Vetter,et al. Ontology Mapping - An Integrated Approach , 2004, ESWS.