Automatic Truecasing of Video Subtitles Using BERT: A Multilingual Adaptable Approach

This paper describes an approach for automatic capitalization of text without case information, such as spoken transcripts of video subtitles, produced by automatic speech recognition systems. Our approach is based on pre-trained contextualized word embeddings, requires only a small portion of data for training when compared with traditional approaches, and is able to achieve state-of-the-art results. The paper reports experiments both on general written data from the European Parliament, and on video subtitles, revealing that the proposed approach is suitable for performing capitalization, not only in each one of the domains, but also in a cross-domain scenario. We have also created a versatile multilingual model, and the conducted experiments show that good results can be achieved both for monolingual and multilingual data. Finally, we applied domain adaptation by finetuning models, initially trained on general written data, on video subtitles, revealing gains over other approaches not only in performance but also in terms of computational cost.

[1]  Arnav Khare Joint Learning for Named Entity Recognition and Capitalization Generation , 2006 .

[2]  Ji-Hwan Kim,et al.  Automatic capitalisation generation for speech input , 2004, Comput. Speech Lang..

[3]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[4]  Helena Moniz,et al.  Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[6]  Binh Nguyen Thai,et al.  Recovering Capitalization for Automatic Speech Recognition of Vietnamese using Transformer and Chunk Merging , 2019, 2019 11th International Conference on Knowledge and Systems Engineering (KSE).

[7]  Binh Nguyen,et al.  Fast and Accurate Capitalization and Punctuation for Automatic Speech Recognition Using Transformer and Chunk Merging , 2019, 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Lucian Vlad Lita,et al.  tRuEcasIng , 2003, ACL.

[10]  Daniel Kondratyuk,et al.  75 Languages, 1 Model: Parsing Universal Dependencies Universally , 2019, EMNLP.

[11]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[12]  Alex Acero,et al.  Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lo , 2006, Comput. Speech Lang..

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  George F. Foster,et al.  Truecasing For The Portage System , 2005 .

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Michiel Bacchiani,et al.  Restoring punctuation and capitalization in transcribed speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Fernando Batista,et al.  The impact of language dynamics on the capitalization of broadcast news , 2008, INTERSPEECH.

[18]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[19]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[20]  David Yarowsky,et al.  DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French , 1994, ACL.

[21]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[24]  Fernando Batista,et al.  Language Dynamics and Capitalization using Maximum Entropy , 2008, ACL.

[25]  Fernando Batista,et al.  Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news , 2008, Speech Commun..

[26]  Douglas A. Reynolds,et al.  Measuring the readability of automatic speech-to-text transcripts , 2003, INTERSPEECH.

[27]  Martin Raab,et al.  The ISL TC-STAR Spring 2006 ASR Evaluation Systems , 2006 .

[28]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[29]  Daniel Marcu,et al.  Capitalizing Machine Translation , 2006, NAACL.