A unified lexical processing framework based on the Margin Infused Relaxed Algorithm. A case study on the Romanian Language

General natural language processing and text-to-speech applications require certain (lexical level) processing steps in order to solve some frequent tasks such as lemmatization, syllabification, lexical stress prediction and phonetic transcription. These steps usually require knowledge of the word’s lexical composition (derivative morphology, inflectional affixes, etc.). For known words all applications use lexicons, but there are always out-of-vocabulary (OOV) words that impede the performance of NLP and speech synthesis applications. In such cases, either rule based or data-driven techniques are used to automatically process these OOV words and generate the desired results. In this paper we describe how the above mentioned tasks can be achieved using a Perceptron with the Margin Infused Relaxed Algorithm (MIRA) and sequence labeling.

[1]  Grzegorz Kondrak,et al.  Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion , 2008, ACL.

[2]  Vladimir Popescu,et al.  HYBRID SYLLABIFICATION AND LETTER-TO-PHONE CONVERSION FOR TTS SYNTHESIS , 2011 .

[3]  D. Kahn,et al.  Syllable-Based Generalizations in English Phonology , 2015 .

[4]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[5]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[6]  Walter Daelemans,et al.  IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[7]  Dan Tufis Tiered Tagging and Combined Language Models Classifiers , 1999, TSD.

[8]  Simon King,et al.  The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate , 2011, Speech Commun..

[9]  M. Sima,et al.  A PHONETIC CONVERTER FOR SPEECH SYNTHESIS IN ROMANIAN ' UDJRú % , 1999 .

[10]  Aj.M.M. Weijters A SIMPLE LOOK-UP PROCEDURE SUPERIOR TO NETTALK? , 1991 .

[11]  Grzegorz Kondrak,et al.  Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion , 2008, ACL.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Vera Demberg,et al.  Phonological Constraints and Morphological Preprocessing for Grapheme-to-Phoneme Conversion , 2007, ACL.

[14]  Eugeniu Oancea,et al.  Stressed Syllable Determination for Romanian Words within Speech Synthesis Applications , 2002, Int. J. Speech Technol..

[15]  Merle Horne,et al.  Word stress in Romanian , 1997 .