Should we use movie subtitles to study linguistic patterns of conversational speech? A study based on French, English and Taiwan Mandarin

Linguistic research benefits from the wide range of resources and software tools developed for natural language processing (NLP) tasks. However, NLP has a strong historical bias towards written language, thereby making these resources and tools often inadequate to address research questions related to the linguistic patterns of spontaneous speech. In this preliminary study, we investigate whether corpora of movie and TV subtitles can be employed to estimate data-driven NLP models adapted to conversational speech. In particular, the presented work explore lexical and syntactic distributional aspects across three genres (conversational, written and subtitles) and three languages (French, English and Taiwan Mandarin). Ongoing work focuses on comparing these three genres on the basis of deeper syntactic conversational patterns , using graph-based modelling and visualisation.

[1]  Shu-Chuan Tseng,et al.  Computational Linguistics & Chinese Language Processing Aims and Scope on the Use of Speech Recognition Techniques to Identify Bird a Novel Approach for Handling Unknown Word Problem in Chinese-vietnamese Machine Translation , 2022 .

[2]  Pierre Lison,et al.  Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models , 2017, SIGDIAL Conference.

[3]  Jonathan Ginzburg,et al.  The interactive stance : meaning for conversation , 2012 .

[4]  W. Levelt,et al.  Speaking: From Intention to Articulation , 1990 .

[5]  Laurie Jane Anderson,et al.  Differences between spoken and written language , 1990 .

[6]  Elisabeth Schriberg,et al.  Preliminaries to a Theory of Speech Disfluencies , 1994 .

[7]  K. Lambrecht On the status of SVO sentences in French discourse , 1987 .

[8]  Paul Fraisse,et al.  Comparaisons entre les langages oral et écrit , 1959 .

[9]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[10]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[11]  Chu-Ren Huang,et al.  Un état des lieux du traitement automatique du Chinois , 2014 .

[12]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  E. Schegloff,et al.  Opening up Closings , 1973 .

[14]  K. Lambrecht Presentational cleft constructions in spoken French , 1988 .

[15]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[16]  Helen R. Fairbanks,et al.  II. The quantitative differentiation of samples of spoken language. , 1944 .

[17]  Benoît Sagot,et al.  Unsupervized Word Segmentation: the Case for Mandarin Chinese , 2012, ACL.

[18]  Stephen Clark,et al.  A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model , 2010, EMNLP.

[19]  Elizabeth Couper-Kuhlen,et al.  Introducing Interactional Linguistics , 2001 .

[20]  Jörg Tiedemann,et al.  OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora , 2018, LREC.

[21]  Herbert H. Clark,et al.  Navigating joint projects with dialogue , 2003 .

[22]  Christopher Potts,et al.  Developing linguistic theories using annotated corpora , 2017 .

[23]  V. Yngve On getting a word in edgewise , 1970 .

[24]  Herbert H. Clark,et al.  Contributing to Discourse , 1989, Cogn. Sci..

[25]  Jan Gorisch,et al.  A CUP of CoFee: A large Collection of feedback Utterances Provided with communicative function annotations , 2016, LREC.

[26]  C. Blanche-Benveniste,et al.  Le français parlé : études grammaticales , 1990 .