Evaluating Transformer Models for Punctuation Restoration in Italian

In this paper, we propose an evaluation of a Transformerbased punctuation restoration model for the Italian language. Experimenting with a BERT-base model, we perform several fine-tuning with different training data and sizes and tested them in an inand crossdomain scenario. Moreover, we offer a comparison in a multilingual setting with the same model fine-tuned on English transcriptions. Finally, we conclude with an error analysis of the main weaknesses of the model related to specific punctuation marks.

[1]  Elizabeth Salesky,et al.  The Multilingual TEDx Corpus for Speech Recognition and Translation , 2021, Interspeech 2021.

[2]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[3]  Seokhwan Kim,et al.  Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punctuation Restoration , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Chng Eng Siong,et al.  Transfer Learning for Punctuation Prediction , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[5]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[6]  Guy Aston,et al.  Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian , 2004, LREC.

[7]  Christoph Meinel,et al.  Sentence Boundary Detection Based on Parallel Lexical and Acoustic Models , 2016, INTERSPEECH.

[8]  Jianhua Tao,et al.  Adversarial Transfer Learning for Punctuation Restoration , 2020, ArXiv.

[9]  Jianhua Tao,et al.  Self-attention Based Model for Punctuation Prediction Using Word and Speech Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ya Li,et al.  Distilling Knowledge from an Ensemble of Models for Punctuation Prediction , 2017, INTERSPEECH.

[11]  Tanel Alumäe,et al.  Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration , 2016, INTERSPEECH.

[12]  Attila Nagy,et al.  Automatic punctuation restoration with BERT models , 2021, ArXiv.

[13]  Haifeng Zhao,et al.  Using bidirectional LSTM with BERT for Chinese punctuation prediction , 2019, 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP).

[14]  Firoj Alam,et al.  Punctuation Restoration using Transformer Models for High-and Low-Resource Languages , 2020, W-NUT@EMNLP.

[15]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[16]  Peter Bell,et al.  Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  V. Silber-Varod,et al.  The effect of pitch, intensity and pause duration in punctuation detection , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[19]  Benoît Sagot,et al.  Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures , 2019 .

[20]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[22]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Jörg Tiedemann,et al.  The OPUS Corpus - Parallel and Free: http://logos.uio.no/opus , 2004, LREC.

[24]  Ji-Hwan Kim,et al.  A combined punctuation generation and speech recognition system and its performance enhancement using prosody , 2003, Speech Commun..

[25]  Peter Bell,et al.  Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tanel Alumäe,et al.  LSTM for punctuation restoration in speech transcripts , 2015, INTERSPEECH.

[27]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[28]  Gökhan Tür,et al.  Automatic detection of sentence boundaries and disfluencies based on recognized words , 1998, ICSLP.