Validating Transliteration Hypotheses Using the Web: Web Counts vs. Web Mining

We describe a novel approach for validating transliteration hypotheses based on a Web mining technique. We implemented a machine transliteration system and generated Chinese, Japanese, and Korean transliteration hypotheses for given English words. Then, we mined the Web for features relevant to validating transliteration hypotheses. Finally we validated transliteration hypotheses using machine learning algorithms learned with the mined features. Comparing Web counts with our Web mining technique, our proposed method consistently performed better than systems based on Web counts, regardless of the language.

[1]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  Naoto Kato,et al.  Transliteration Considering Context Information based on the Maximum Entropy Method , 2003 .

[3]  H. Isahara,et al.  A Comparison of Different Machine Transliteration Models , 2006, J. Artif. Intell. Res..

[4]  Yaser Al-Onaizan,et al.  Translating Named Entities Using Monolingual and Bilingual Resources , 2002, ACL.

[5]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[6]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[7]  Hitoshi Isahara,et al.  Improving Machine Transliteration Performance by Using Multiple Transliteration Models , 2006, ICCPOL.

[8]  Gregory Grefenstette,et al.  Mining the Web to Create a Language Model for Mapping between English Names and Phrases and Japanese , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[9]  Hozumi Tanaka,et al.  Direct Combination of Spelling and Pronunciation Information for Robust Back-Transliteration , 2005, CICLing.

[10]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[11]  Key-Sun Choi,et al.  An English-Korean Transliteration Model Using Pronunciation and Contextual Rules , 2002, COLING.

[12]  Carson Kai-Sang Leung,et al.  CanTree: a tree structure for efficient incremental mining of frequent patterns , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  In-Ho Kang,et al.  English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks , 2000, COLING.

[14]  Gregory Grefenstette,et al.  Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation , 2004, ACL.

[15]  Carson Kai-Sang Leung,et al.  DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams , 2006, Sixth International Conference on Data Mining (ICDM'06).

[16]  MAGDALINI EIRINAKI,et al.  Web mining for web personalization , 2003, TOIT.

[17]  Osmar R. Zaïane,et al.  Incremental mining of frequent patterns without candidate generation or support constraint , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[18]  Heikki Mannila,et al.  Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction , 2001, KDD '01.

[19]  Geert Wets,et al.  Segmentation of visiting patterns on web sites using a sequence alignment method , 2003 .