Automatic Acquisition of Machine Translation Resources in the Abu-MaTran Project

This paper provides an overview of the research and development activi- ties carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisi- tion of the main resources required by the two most popular approaches to machine translation, i.e. statistical (corpora) and rule-based models (dictionaries and rules). All these tools have been released under open-source licenses and have been devel- oped with the aim of being useful for industrial exploitation.

[1]  Nikola Ljubesic,et al.  {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.

[2]  Antonio Toral,et al.  caWaC – A web corpus of Catalan and its application to language modeling and machine translation , 2014, LREC.

[3]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[4]  Antonio Toral,et al.  D4.1b MT systems for the second development cycle , 2015 .

[5]  Mikel L. Forcada,et al.  Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora , 2014, J. Artif. Intell. Res..

[6]  Georg Rehm,et al.  META-NET Strategic Research Agenda for Multilingual Europe 2020 , 2013 .

[7]  Mikel L. Forcada,et al.  An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknknown words , 2014, EAMT.

[8]  Tomaz Erjavec,et al.  TweetCaT: a tool for building Twitter corpora of smaller languages , 2014, LREC.

[9]  Nikola Ljubesic,et al.  Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites , 2014, LREC.

[10]  Víctor M. Sánchez-Cartagena,et al.  A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora , 2015, Comput. Speech Lang..

[11]  Gregor Thurmair,et al.  A modular open-source focused crawler for mining monolingual and bilingual corpora from the web , 2013, BUCC@ACL.