Building language resources for a Multi-Engine English-Filipino machine translation system

In this paper, we present the building of various language resources for a multi-engine bi-directional English-Filipino Machine Translation (MT) system. Since linguistics information on Philippine languages are available, but as of yet, the focus has been on theoretical linguistics and little is done on the computational aspects of these languages, attempts are reported here on the manual construction of these language resources such as the grammar, lexicon, morphological information, and the corpora which were literally built from almost non-existent digital forms. Due to the inherent difficulties of manual construction, we also discuss our experiments on various technologies for automatic extraction of these resources to handle the intricacies of the Filipino language, designed with the intention of using them for the MT system. To implement the different MT engines and to ensure the improvement of translation quality, other language tools (such as the morphological analyzer and generator, and the part of speech tagger) were developed.

[1]  De Guzman,et al.  Syntactic derivation of Tagalog verbs , 1978 .

[2]  David Yarowsky,et al.  Modeling and learning multilingual inflectional morphology in a minimally supervised framework , 2003 .

[4]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[5]  H. Altay Güvenir,et al.  Learning Translation Templates from Bilingual Translation Examples , 2004, Applied Intelligence.

[6]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Rayid Ghani,et al.  Mining the web to create minority language corpora , 2001, CIKM '01.

[9]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[10]  Lauri Karttunen,et al.  Finite-State Phonology: Proceedings of the 5th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON) , 2001, ArXiv.

[11]  R. Ghani Using the Web to Create Minority Language Corpora , 2001 .

[13]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[14]  Katharina Probst Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages , 2002 .

[15]  Andy Way,et al.  Recent Advances in Example-Based Machine Translation , 2004 .

[16]  Charibeth Cheng,et al.  The Revised Wordframe Model for the Filipino Language , 2006 .

[17]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[18]  Patrick Pantel,et al.  Clustering by committee , 2003 .

[20]  Lauri Karttunen,et al.  Finite-State Non-Concatenative Morphotactics , 2000, ACL.

[21]  Rachel E. O. Roxas,et al.  A Constraint-based Morphological Analyzer for Concatenative and Non-concatenative Morphology , 2006, PACLIC.