CASR: A Corpus for Albanian Speech Recognition

This research paper introduces a Corpus for Albanian Speech Recognition (CASR), aims for training and evaluating Automatic Speech Recognition models for Albanian language. The corpus is based on bible' audiobook, comprising 20 hours of transcribed audio data, where transcripts and audios are in the Albanian standard language. An end-to-end speech recognition model based on deep learning is implemented to test and evaluate this corpus. It shows that acoustic models trained on CASR gives satisfactory results. The corpus will be freely available for independent research and provides a valuable resource for research on Albanian ASR.

[1]  Arbana Kadriu NLTK tagger for Albanian using iterative approach , 2013, Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces.

[2]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[3]  Bernard Comrie,et al.  Language Universals and Linguistic Typology: Syntax and Morphology , 1981 .

[4]  Leonard Newmark,et al.  Standard Albanian: A Reference Grammar for Students , 1983 .

[5]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Phil D. Green,et al.  From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition , 2004, INTERSPEECH.

[7]  V. Orel A Concise Historical Grammar of the Albanian Language: Reconstruction of Proto-Albanian , 2000 .

[8]  Tingting Lv,et al.  Bidirectional Recurrent Neural Network And Convolutional Neural Network (BiRCNN) For ECG Beat Classification , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[9]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[10]  Benjamin Peter Milner,et al.  Speech recognition in adverse environments , 1994 .

[11]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[12]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[13]  John Liu,et al.  Deep Learning for NLP and Speech Recognition , 2020, Springer International Publishing.

[14]  Arbana Kadriu Modeling a Two-Level Formalism for Inflection of Nouns and Verbs in Albanian , 2010 .

[15]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[16]  John W. Merrill,et al.  Automatic Speech Recognition , 2005 .

[17]  Jonathan G. Fiscus,et al.  Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech , 2006, LREC.

[18]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.