Machine Translation and Divergence Study for English–Maithili

In terms of language technology, Maithili is a resource-poor language. It is spoken in India and Nepal and is one of the 22 scheduled languages in India. Maithili has almost no language technology resource. English in India happens to be a dominant language in terms of content and usage. However, since more than 90% of Indians do not use English, a translation from English to Maithili becomes critical. An absence of basic tools in this language has affected resource creation of machine translation (MT). The present work discusses efforts for language technology resource (LTR) creation and divergence study for a statistical English-Maithili MT (EMMT) system. Creating any statistical MT (SMT) system requires sizeable parallel, aligned corpora for training and testing. Creating general-purpose source corpora for English language and creating translation equivalents with possible alignments require time and effort. The paper focuses on the data collection methods, cleaning, the size and structure of the text corpora, alignment and parallelization strategies, training, testing, and a study of divergence between the language pair.