A method for solving scriptio continua in Javanese manuscript transliteration

Many Javanese manuscripts in Indonesia are stored in museums and libraries. Most of these manuscripts were written using local scripts that are rarely used in everyday life, and hence a software application that can help and improve the reading of these manuscripts is valuable. An essential step in automatic manuscript image transliteration is post-processing, which involves editing and concatenating syllables into words. The main problem of post-processing is that there exists no symbol for space between words in a sentence, which is called the scriptio-continua problem. This paper proposes methods based on the backtracking algorithm to solve the scriptio continua in the post-processing step of Javanese manuscript image transliteration. The proposed methods use a depth-first search in seeking relevant candidate words to determine whether to merge a new syllable or not. The results of the proposed methods to concatenate 17,687 syllables from the Hamong Tani book using a dictionary containing 49,801 words are found to be satisfactory in terms of computation and accuracy. The accuracy of the implemented greedy and brute-force methods is both 81.64%. However, the greedy-based method is more efficient and has a better performance than the brute-force method.

[1]  Agus Harjoko,et al.  Preprocessing Model of Manuscripts in Javanese Characters , 2014 .

[2]  Keith Stevens,et al.  Measuring the Impact of Sense Similarity on Word Sense Induction , 2011, ULNLP@EMNLP.

[3]  M. Shridhar,et al.  SEGMENTATION-BASED CURSIVE HANDWRITING RECOGNITION , 1997 .

[4]  Vasco M. Manquinho,et al.  Backtracking , 2018, Israeli Foreign Policy since the End of the Cold War.

[5]  Awais Adnan,et al.  Urdu Optical Character Recognition Systems: Present Contributions and Future Directions , 2018, IEEE Access.

[6]  Anastasia Rita Widiarti-Winarko Algorithm for Grouping Syllables Result from the Javanese Literature Document Image Recognition , 2012 .

[7]  Ninawati Syahrul Upaya dan Penyelamatan Naskah Kuno Lampung , 2014 .

[8]  William Bright,et al.  A Matter of Typology: Alphasyllabaries and Abugidas , 1999 .

[9]  Agus Harjoko,et al.  A proposed model for Javanese manuscript images transliteration , 2018 .

[10]  Evangelos E. Milios,et al.  Statistical learning for OCR error correction , 2018, Inf. Process. Manag..

[11]  A. Karimi,et al.  Master‟s thesis , 2011 .

[12]  S. Prawiroatmodjo Bausastra Jawa-Indonesia , 1981 .

[13]  Andreas Dengel,et al.  OCR Error Correction: State-of-the-Art vs an NMT-based Approach , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[14]  Adam Jatowt,et al.  Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing , 2019, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[15]  Wanchai Rivepiboon,et al.  A Unified Model of Thai Romanization and Word Segmentation , 2004, PACLIC.

[16]  Reza Pulungan,et al.  Aplikasi DTMC Untuk Post Proses Pengenalan Citra Dokumen Teks , 2012 .

[17]  James Suciadi STUDI ANALISIS METODE-METODE PARSING DAN INTERPRETASI SEMANTIK PADA NATURAL LANGUAGE PROCESSING , 2001 .

[18]  Franky,et al.  English-to-Indonesian Lexical Mapping using Latent Semantic Analysis , 2008 .

[19]  Agus Harjoko,et al.  The Model and Implementation of Javanese Script Image Transliteration , 2017, 2017 International Conference on Soft Computing, Intelligent System and Information Technology (ICSIIT).

[20]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[21]  David D. Palmer,et al.  Chinese Word Segmentation and Information Retrieval , 1997 .