A Comparison Between Traditional Machine Learning Approaches And Deep Neural Networks For Text Processing In Romanian

This paper presents a comparison between traditional machine learning approaches (decision trees and multilayer perceptron) and the latest trend in artificial intelligence, deep neural networks for three separate tasks of text processing in Romanian. The tasks we examine are: lexical stress assignment, syllabification and phonetic transcription. The evaluation is performed on large manually transcribed lexicons and uses simple input features derived strictly from the orthographic form of the words. Results show that, depending on the task, the performance of each of the algorithms can vary, and that in some limited cases, the decision trees can outperform the deep neural networks.

[1]  Dragos Burileanu,et al.  An advanced NLP framework for high-quality Text-to-Speech synthesis , 2011, 2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD).

[2]  Eugeniu Oancea,et al.  On letter to sound conversion for Romanian: A comparison of five algorithms , 2013, 2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD).

[3]  Stefan Daniel Dumitrescu,et al.  Fast and Accurate Decision Trees for Natural Language Processing Tasks , 2017, RANLP.

[4]  Liviu P. Dinu,et al.  Predicting Romanian Stress Assignment , 2014, EACL.

[5]  Stefan Daniel Dumitrescu,et al.  Tools and resources for Romanian text-to-speech and speech-to-text applications , 2018, RoCHI.

[6]  Horia Cucu,et al.  SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian , 2014, Speech Commun..

[7]  Ana-Maria Barbu,et al.  Romanian Lexical Data Bases: Inflected and Syllabic Forms Dictionaries , 2008, LREC.

[8]  Stefan-Adrian Toma,et al.  Automatic rule-based syllabication for Romanian , 2009, 2009 Proceedings of the 5-th Conference on Speech Technology and Human-Computer Dialogue.

[9]  Ioana Chitoran,et al.  Using a machine learning model to assess the complexity of stress systems , 2014, LREC.

[10]  Simon King,et al.  The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate , 2011, Speech Commun..

[11]  Weiping Li,et al.  Review of Deep Learning , 2018, ArXiv.

[12]  Ovidiu Buza,et al.  100K+ words, machine-readable, pronunciation dictionary for the Romanian language , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[13]  Dragos Burileanu,et al.  Basic Research and Implementation Decisions for a Text-to-Speech Synthesis System in Romanian , 2002, Int. J. Speech Technol..

[14]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[15]  Melania Duma,et al.  Enhanced Rule-Based Phonetic Transcription for the Romanian Language , 2009, 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[16]  Liviu P. Dinu,et al.  Romanian Syllabication Using Machine Learning , 2013, TSD.

[17]  Ovidiu Buza,et al.  Automated grapheme-to-phoneme conversion system for Romanian , 2011, International Conference on Speech Technology and Human-Computer Dialogue.

[18]  Radu Ion,et al.  Bermuda, a data-driven tool for phonetic transcription of words , 2012 .

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  Eugeniu Oancea,et al.  Stressed Syllable Determination for Romanian Words within Speech Synthesis Applications , 2002, Int. J. Speech Technol..

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Hans Uszkoreit,et al.  The Romanian Language in the Digital Age , 2012 .

[23]  Merle Horne,et al.  Word stress in Romanian , 1997 .

[24]  Bogdan Orza,et al.  The SWARA speech corpus: A large parallel Romanian read speech dataset , 2017, 2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD).