Extraction of Indonesian and english parallel sentences from movie subtitles

Parallel corpus serves as a mandatory resource to develop machine-learning-based statistical translation engine. The size and coverage of parallel corpus available for training affects directly the translation accuracy of the engine. To have more training data available for the development of the translation engine in conversational domain, we propose a method to extract parallel data from Movie Subtitles using dynamic time warping, cosine similarity and beam search algorithm. The proposed method is capable of extracting 30% parallel sentences from a set of Indonesian-English movie subtitles with a precision of 98%.

[1]  Pascale Fung,et al.  Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[2]  Bin Tan,et al.  Automatic Construction of Web-Based English/Chinese Parallel Corpora , 2010, 2010 Third International Symposium on Intelligent Information Technology and Security Informatics.

[3]  Yanhui Feng,et al.  Using HTML Tags to Improve Parallel Resources Extraction , 2011, 2011 International Conference on Asian Language Processing.

[4]  Ayu Purwarianti,et al.  Indonesian-Japanese term extraction from bilingual corpora using machine learning , 2015, 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[5]  Min Zhang,et al.  Feature-Based Method for Document Alignment in Comparable News Corpora , 2009, EACL.

[6]  Miao Li,et al.  Automatically Mining Parallel Corpora for Minority Languages from Web Pages , 2012, 2012 International Conference on Asian Language Processing.

[7]  F. Ren,et al.  English/Arabic bilingual dictionary construction using parallel texts from the Internet archive , 2003, 2003 46th Midwest Symposium on Circuits and Systems.

[8]  Xiaojie Wang,et al.  Constructing Parallel Corpus from Movie Subtitles , 2009, ICCPOL.

[9]  Qingsheng Zhu,et al.  Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.

[10]  Laurent Besacier,et al.  Mining Parallel Data from Comparable Corpora via Triangulation , 2011, 2011 International Conference on Asian Language Processing.

[11]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[12]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[13]  Motaz Saad,et al.  Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities , 2013 .

[14]  Le Quang Hung,et al.  Extracting Parallel Texts from the Web , 2010, 2010 Second International Conference on Knowledge and Systems Engineering.