Hybrid word sense disambiguation using language resources for transliteration of Arabic numerals in Korean

The high frequency of the use of Arabic numerals in informative texts and their multiple senses and readings deteriorate the accuracy of TTS systems. This paper presents a hybrid word sense disambiguation method exploiting a tagged corpus and a Korean wordnet, KorLex 1.0, for the correct and efficient conversion of Arabic numerals into Korean phonemes according to their senses. Individual contextual features are extracted from the tagged corpus and are grouped in order to determine the sense of Arabic numerals. Least upper bound synsets among common hypernyms of contextual features were obtained from the KorLex hierarchy, and they were used as semantic categories of the contextual features of Arabic numerals. The semantic classes were trained to classify the meaning and the reading of Arabic numerals using decision tree and to compose grapheme-to-phoneme rules for an automatic transliteration system for Arabic numerals. The proposed system outperforms the customized TTS systems by 3.9%--20.3%.

[1]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[2]  Hyuk-Chul Kwon,et al.  Disambiguation Based on Wordnet for Transliteration of Arabic Numerals for Korean TTS , 2006, CICLing.

[3]  David S. Touretzky,et al.  The Mathematics of Inheritance Systems , 1984 .

[4]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[7]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[8]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[9]  Roland R. Hausser Foundations of Computational Linguistics: Man-Machine Communication in Natural Language , 1999 .

[10]  Walter Daelemans,et al.  A language-independent, data-oriented architecture for grapheme-to-phoneme conversion , 1994, SSW.

[11]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[12]  Feng-Long Huang,et al.  Disambiguating the senses of non-text symbols for Mandarin TTS systems with a three-layer classifier , 2003, Speech Commun..

[13]  Alan W. Black,et al.  Non-standard word and homograph resolution for asian language text analysis , 2000, INTERSPEECH.

[14]  Aesun Yoon,et al.  An automatic transcription system for Arabic numerals in Korean , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[15]  Hyuk-Chul Kwon,et al.  Stochastic Korean Word-Spacing with Smoothing Using Korean Spelling Checker , 2004, Int. J. Comput. Process. Orient. Lang..

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[18]  David Yarowsky,et al.  Homograph Disambiguation in Text-to-Speech Synthesis , 1997 .

[19]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[20]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[21]  David C. Kuncicky Introduction to Word , 1998 .