Enhanced Japanese Electronic Dictionary Look-up

This paper describes the process of data preparation and reading generation for an ongoing project aimed at improving the accessibility of unknown words for learners of foreign languages, focusing initially on Japanese. Rather then requiring absolute knowledge of the readings of words in the foreign language, we allow look-up of dictionary entries by readings which learners can predictably be expected to associate with them. We automatically extract an exhaustive set of phonemic readings for each grapheme segment and learn basic morpho-phonological rules governing compound word formation, associating a probability with each. Then we apply the naive Bayes model to generate a set of readings and give each a likeliness score based on previously extracted evidence and corpus frequencies.