Creating Data in Icelandic for Text Normalization

There is no natural way to acquire normalized data so we try to create good enough data to attempt more advanced methods for text normalization. We manually annotated the first normalized corpus in Icelandic, 40,000 sentences, and developed Regína, a rule-based system for text normalization. Regína gets 90.83% accuracy compared to the manually annotated corpus on non-standard words. Regína showed a significant improvement in accuracy when compared to an older normalization system for Icelandic. The normalized corpus and Regína will be released as open source.

[1]  Inga Rún Helgadóttir,et al.  Building an ASR Corpus Using Althingi's Parliamentary Speeches , 2017, INTERSPEECH.

[2]  Navdeep Jaitly,et al.  RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[3]  Jón Guðnason,et al.  Risamálheild: A Very Large Icelandic Text Corpus , 2018, LREC.

[4]  Anna Björk Nikulásdóttir,et al.  Bootstrapping a Text Normalization System for an Inflected Language. Numbers as a Test Case , 2019, INTERSPEECH.

[5]  David Erik Mollberg,et al.  Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition , 2020, LREC.

[6]  Richard Sproat,et al.  Multilingual text analysis for text-to-speech synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Yuxuan Wang,et al.  A Hybrid Text Normalization System Using Multi-Head Self-Attention For Mandarin , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[9]  Aman Hussain,et al.  Text Normalization using Memory Augmented Neural Networks , 2019, Speech Commun..

[10]  Steinþór Steingrímsson,et al.  Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step , 2019, RANLP.

[11]  Jón Guðnason,et al.  Málrómur: A Manually Verified Corpus of Recorded Icelandic Speech , 2017, NODALIDA.

[12]  Anna Björk Nikulásdóttir,et al.  An Icelandic Pronunciation Dictionary for TTS , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[13]  Bharat Ram Ambati,et al.  A Mostly Data-Driven Approach to Inverse Text Normalization , 2017, INTERSPEECH.

[14]  Brian Roark,et al.  Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[15]  Navdeep Jaitly,et al.  An RNN Model of Text Normalization , 2017, INTERSPEECH.

[16]  Richard Sproat,et al.  The Kestrel TTS text normalization system , 2014, Natural Language Engineering.

[17]  Jón Guðnason,et al.  Manual Speech Synthesis Data Acquisition - From Script Design to Recording Speech , 2020, SLTU/CCURL@LREC.

[18]  Jón Guðnason,et al.  Language Technology Programme for Icelandic 2019-2023 , 2020, LREC.