Massively Multilingual Pronunciation Modeling with WikiPron

We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.

[1]  Ryan Cotterell,et al.  UniMorph 3.0: Universal Morphology , 2018, LREC.

[2]  Sanjeev Khudanpur,et al.  WEB-derived pronunciations , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[4]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[5]  Fuchun Peng,et al.  Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[7]  Keikichi Hirose,et al.  Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework , 2016, Nat. Lang. Eng..

[8]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[9]  Ryan Cotterell,et al.  Weird Inflects but OK: Making Sense of Morphological Generation Errors , 2019, Conference on Computational Natural Language Learning.

[10]  Kyle Gorman,et al.  Pynini: A Python library for weighted finite-state grammar compilation , 2016 .

[11]  Steven Moran,et al.  The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles , 2017 .

[12]  Tanja Schultz,et al.  Wiktionary as a source for automatic pronunciation extraction , 2010, INTERSPEECH.

[13]  Kevin Knight,et al.  Grapheme-to-Phoneme Models for (Almost) Any Language , 2016, ACL.

[14]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[15]  BRETT KESSLER,et al.  IS ENGLISH SPELLING CHAOTIC? MISCONCEPTIONS CONCERNING ITS IRREGULARITY , 2003 .

[16]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[17]  Richard Sproat,et al.  Book Reviews: A Computational Theory of Writing Systems , 2006, CL.

[18]  Daan van Esch,et al.  Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks , 2016, INTERSPEECH.

[19]  Brian Roark,et al.  The OpenGrm open-source finite-state grammar software libraries , 2012, ACL.