Probabilistic N-gram language model for SMS Lingo

This paper presents a pioneering step in designing Bi-Gram based decoder for SMS Lingo. In the last few years, a significant increment in both the computational power and storage capacity of computers, and the availability of large volumes of bilingual data, have made possible for Statistical Machine Translation (SMT) to become an actual and practical technology. This paper employs Bi-Gram Language Model (LM) with a SMT decoder through which a sentence written with short forms in an SMS is translated into long form sentence. Here the results over a development and test set are analyzed and commented. The main objective behind this project is to analyze the improvement in efficiency as the size of bilingual corpus increases.

[1]  Norbert Fuhr,et al.  Language Models and Smoothing Methods for Collections with Large Variation in Document Length , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[2]  José B. Mariño,et al.  System Combination for Machine Translation of Spoken and Written Language , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Mauro Cettolo,et al.  Efficient Handling of N-gram Language Models for Statistical Machine Translation , 2007, WMT@ACL.

[4]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[5]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[6]  Carlos A. Henr ´ iquez,et al.  A Ngram-based Statistical Machine Translation Approach for Text Normalization on Chat-speak Style Communications , 2009 .

[7]  Richard C. Rose,et al.  Integration of Statistical Models for Dictation of Document Translations in a Machine-Aided Human Translation Task , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[11]  Wenju Liu,et al.  A novel interpolated N-gram language model based on class hierarchy , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[12]  Xiao-Long Wang,et al.  A Statistical Based Part of Speech Tagger for Urdu Language , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[13]  Seiichi Nakagawa,et al.  Out-of-vocabulary term detection by n-gram array with distance from continuous syllable recognition results , 2010, 2010 IEEE Spoken Language Technology Workshop.

[14]  Yang Liu,et al.  Toward text message normalization: Modeling abbreviation generation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Srinivas Bangalore,et al.  Bootstrapping Bilingual Data using Consensus Translation for a Multilingual Instant Messaging System , 2002, COLING.

[16]  José B. Mariño,et al.  Extending MARIE: an N-gram-based SMT decoder , 2007, ACL.

[17]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[18]  Yong Zhao,et al.  Using N-gram based Features for Machine Translation System Combination , 2009, HLT-NAACL.