QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus

We introduce the largest transcribed Arabic speech corpus, QASR1, collected from the broadcast domain. This multi-dialect speech dataset contains 2, 000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acousticsand/or linguisticsbased Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.

[1]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[2]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[3]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jorgen Valk,et al.  VOXLINGUA107: A Dataset for Spoken Language Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[5]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[6]  Firoj Alam,et al.  Punctuation Restoration using Transformer Models for Resource-Rich and -Poor Languages , 2020, WNUT.

[7]  Sheena Christabel Pravin,et al.  A Hybrid Deep Ensemble for Speech Disfluency Classification , 2021, Circuits, Systems, and Signal Processing.

[8]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[9]  Ahmed Abdelali,et al.  Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR , 2021, Interspeech.

[10]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Conditional Random Fields , 2008 .

[11]  James R. Glass,et al.  The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[12]  Stephan Vogel,et al.  Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[14]  Hao Tang,et al.  Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[16]  David Pallett,et al.  A look at NIST'S benchmark ASR tests: past, present, and future , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[17]  Kareem Darwish,et al.  Named Entity Recognition using Cross-lingual Resources: Arabic as an Example , 2013, ACL.

[18]  Shinji Watanabe,et al.  Arabic Speech Recognition by End-to-End, Modular Systems and Human , 2021, Comput. Speech Lang..

[19]  Hazem Hajj,et al.  AraBERT: Transformer-based Model for Arabic Language Understanding , 2020, OSACT.

[20]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[21]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[22]  James Glass,et al.  The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[24]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Xinxing Li,et al.  A 43 Language Multilingual Punctuation Prediction Neural Network Model , 2020, INTERSPEECH.

[26]  James R. Glass,et al.  MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation (MCE) Plan, Dataset and Baseline System , 2019, INTERSPEECH.

[27]  Morena Danieli,et al.  Automatic classification of speech overlaps: Feature representation and algorithms , 2019, Comput. Speech Lang..

[28]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[29]  Amitava Das,et al.  Comparing the Level of Code-Switching in Corpora , 2016, LREC.

[30]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[31]  James R. Glass,et al.  ADI17: A Fine-Grained Arabic Dialect Identification Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Kareem Darwish,et al.  Automatic Correction of Arabic Text: a Cascaded Approach , 2014, ANLP@EMNLP.

[34]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[35]  Hao Li,et al.  Data Augmentation for end-to-end Code-Switching Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[36]  James R. Glass,et al.  What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information? , 2020, INTERSPEECH.

[37]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[38]  Laura Kallmeyer,et al.  A Neural Architecture for Dialectal Arabic Segmentation , 2017, WANLP@EACL.

[39]  Nizar Habash,et al.  NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task , 2020, WANLP.

[40]  Younes Samih,et al.  Effects of Dialectal Code-Switching on Speech Modules: A Study Using Egyptian Arabic Broadcast Speech , 2020, INTERSPEECH.

[41]  Ahmed Abdelali,et al.  ASAD: Arabic Social media Analytics and unDerstanding , 2021, EACL.

[42]  Jennifer Williams,et al.  An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets , 2020, INTERSPEECH.

[43]  Ahmed Abdelali,et al.  Highly Effective Arabic Diacritization using Sequence to Sequence Modeling , 2019, NAACL.

[44]  Morena Danieli,et al.  Functions of Silences towards Information Flow in Spoken Conversation , 2017, SCNLP@EMNLP 2017.

[45]  Ahmed Abdelali,et al.  QCRI$@$QALB-2015 Shared Task: Correction of Arabic Text for Native and Non-Native Speakers' Errors , 2015, ANLP@ACL.

[46]  Sameer Khurana,et al.  QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[47]  Nicolas Audibert,et al.  Towards Interactive Annotation for Hesitation in Conversational Speech , 2020, LREC.

[48]  James R. Glass,et al.  MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[49]  Han Lu,et al.  End-To-End Multi-Talker Overlapping Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Ahmed Abdelali,et al.  QADI: Arabic Dialect Identification in the Wild , 2020, WANLP.

[51]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.