Multi-View Co-Training of Transliteration Model

This paper discusses a new approach to training of transliteration model from unlabeled data for transliteration extraction. We start with an inquiry into the formulation of transliteration model by considering different transliteration strategies as a multi-view problem, where each view exploits a natural division of transliteration features, such as phonemebased, grapheme-based or hybrid features. Then we introduce a multi-view Cotraining algorithm, which leverages compatible and partially uncorrelated information across different views to effectively boost the model from unlabeled data. Applying this algorithm to transliteration extraction, the results show that it not only circumvents the need of data labeling, but also achieves performance close to that of supervised learning, where manual labeling is required for all training samples.

[1]  Gökhan Tür,et al.  Combining active and semi-supervised learning for spoken language understanding , 2005, Speech Commun..

[2]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[3]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[6]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[7]  Key-Sun Choi,et al.  An ensemble of transliteration models for information retrieval , 2006, Inf. Process. Manag..

[8]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[9]  Hozumi Tanaka,et al.  Improving Back-Transliteration by Combining Information Sources , 2004, IJCNLP.

[10]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[11]  Haizhou Li,et al.  A phonetic similarity model for automatic extraction of transliteration pairs , 2007, TALIP.

[12]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Sung-Hyon Myaeng,et al.  Automatic identification and back-transliteration of foreign words for information retrieval , 1999, Inf. Process. Manag..

[15]  Hsin-Hsi Chen,et al.  Translating-transliterating named entities for multilingual information access , 2006, J. Assoc. Inf. Sci. Technol..

[16]  Eric Brill,et al.  Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs , 2001, NLPRS.

[17]  Haizhou Li,et al.  Learning Transliteration Lexicons from the Web , 2006, ACL.