Training Schemes for the Transliteration of the Balinese Script Into the Latin Script on Palm Leaf Manuscript Images

Considering the importance of the contents of the Balinese palm leaf manuscripts, transliteration system has to be developed in order to be able to read easily these manuscripts. The challenge comes from the fact that Balinese script is a syllabic script and the mapping between linguistic symbols and images of symbols is not straightforward. In addition, with a very limited training data availability, some adaptations of LSTM in the transliteration training scheme need to be designed, to be analyzed and to be evaluated. This paper contributes in proposing and evaluating some adapted segmentation free training schemes for the transliteration of the Balinese script into the Latin script from palm leaf manuscript images. We describe the generated synthetic dataset and the proposed training schemes at two different levels (word level and text line level) to transliterate the real word and text lines from palm leaf manuscript images. For word transliteration, in general, training schemes at word level perform better than training schemes at text line level. As comparison, the segmentation based transliteration method gives a very promising result. For text line transliteration, segmentation based transliteration method outperforms all segmentation free training schemes for the less degraded collections, while the segmentation free training schemes contributes in transliterating the text lines for more degraded manuscripts. Training at text line level with a pre-trained model at word level could give a better result in word transliteration while still keeping the optimal performances for text line transliteration.

[1]  Made Windu Antara Kesiman,et al.  AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[2]  Eiichiro Sumita,et al.  Transliteration Using a Phrase-Based Statistical Machine Translation System to Re-Score the Output of a Joint Multigram Model , 2010, NEWS@ACL.

[3]  Jean-Christophe Burie,et al.  The Handwritten Sundanese Palm Leaf Manuscript Dataset from 15th Century , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[4]  Made Windu Antara Kesiman,et al.  Knowledge Representation and Phonological Rules for the Automatic Transliteration of Balinese Script on Palm Leaf Manuscript , 2017, Computación y Sistemas.

[5]  Andreas Dengel,et al.  anyOCR: A sequence learning based OCR system for unlabeled historical documents , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[6]  Thomas M. Breuel,et al.  Can we build language-independent OCR using LSTM networks? , 2013, MOCR '13.

[7]  Made Windu Antara Kesiman,et al.  A Complete Scheme of Spatially Categorized Glyph Recognition for the Transliteration of Balinese Palm Leaf Manuscripts , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[8]  Thomas M. Breuel,et al.  High-Performance OCR for Printed English and Fraktur Using LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[10]  Sophea Chhun,et al.  Benchmarking of Document Image Analysis Tasks for Palm Leaf Manuscripts from Southeast Asia , 2018, J. Imaging.

[11]  Vasudeva Varma,et al.  A Language-Independent Transliteration Schema Using Character Aligned Models at NEWS 2009 , 2009, NEWS@IJCNLP.