TRANSCRIBING NAMES WITH FOREIGN ORIGIN IN THE ONOMASTICA PROJECT

This paper studies the problem of transcribing foreign names. The transcriptions of first names in five languages have been studied to show examples of how this problem has been dealt with in the Onomastica Multi-Lingual Pronunciation Dictionary of European names. The paper describes this dictionary and the methods used to do the automatic transcriptions for the Swedish part. INTRODUCTION Names have a different morphology and phonology compared to ordinary words. This is the reason why the normal letterto-sound rules used in general text-tospeech systems are inadequate for the transcription of proper names. To deal with the name pronunciation problem, name transcription rules and a name dictionary have to be developed. The objective of the Onomastica project is to produce such rules and a dictionary of European names that will be published on a CD-ROM. This paper will present the problems encountered in the work on this project, and how these have been solved. The transcriptions of first names in five languages are examined to illustrate the problem. The Swedish name transcription system will be presented as well. THE ONOMASTICA DATABASE The objective of the ONOMASTICA project, funded by the LRE-programme, is to build a quality controlled, multilingual pronunciation dictionary of proper names in Europe. The project covers eleven languages: Danish, Dutch, English, French, German, Greek, Italian, Norwegian, Portuguese, Spanish and Swedish. Transcription of up to 1.000.000 names per language will be produced in a semi-automatic way. The ultimate pronunciation dictionary should include a carefully verified transcription of each name, but due to the limited resources only a subset of the name list can be transcribed and verified manually. The names are transcribed in three different quality bands, where the first band includes transcriptions judged to be correct for some owners of the name. The second band gives transcriptions that are acceptable to a native speaker/listener. The third band contains names that have been transcribed automatically, without manual checking. The names in bands I & II were chosen according to their frequency in the telephone directory, so that a cumulative coverage of at least 80% was obtained. From the Swedish database, described in Table 1, the names that occurred more than five times were selected for transcription in band I, obtaining a cumulative coverage of between close to 95 % for surnames and 100% for town names (almost all places have more than five subscribers). Table 1. The Swedish Name Database Name category # of names names with frequency >5 Surnames 228048 46859 Place names 6373 6120 Titles 27055 5370 Street names 65196 39822 First names 6085