Tale following: Real-time speech recognition applied to live performance

This paper describes a system for tale following, that is to say speaker-independent but text-dependent speech recognition follo wed by automatic alignment. The aim of this system is to follow in real-time the progress of actors reading a text in order to automatically trigger audio e vents. The speech recognition engine used is the well known Sphinx from CMU. We used the real-time implementation pocketsphinx, based on sphinx II, with the French acoustic models developed at LIUM. Extensive testing using 21 speakers from the PFC corpus (excerpts in ''standard french'') shows that decent performances are obt ained by the system -- around 30\% Word Error Rate (WER). However, testing using a recording during the rehearsals shows that in real conditions, the performance is a bit worse : the WER is 40\%. Thus, the strategy we devised for our final application includes the use of a constrained automatic alignment algorithm. The aligner is derived from a biological DNA sequences analysis algorithm. Using the whole system, the experiments report that events are triggered with an average delay of 9 s ($\pm$ 8 s). The system is integrated into a widely used real-time sound processing software, Max/MSP, which is here used to trigger audio ev ents, but could also be used to trigger other kinds of events such as lights, videos, etc.

[1]  Arshia Cont,et al.  Antescofo: Anticipatory Synchronization and control of Interactive parameters in Computer Music , 2008, ICMC.

[2]  Paul Lamere,et al.  Design of the CMU Sphinx-4 Decoder , 2022 .

[3]  Myriam Desainte-Catherine,et al.  Interactive scores: A model for specifying temporal relations between interactive and static events , 2005 .

[4]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[5]  Nicola Orio,et al.  Score Following: State of the Art and New Developments , 2003, NIME.

[6]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[7]  G. Gravier,et al.  STER evaluation campaign of rich transcription of French broadcast news , 2011 .

[8]  Miller Puckette,et al.  Score Following in Practice , 1992, ICMC.

[9]  Paul Deléglise,et al.  The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[10]  Myriam Desainte-Catherine,et al.  VIRAGE : DESIGNING AN INTERACTIVE INTERMEDIA SEQUENCER FROM USERS REQUIREMENTS AND THEORETICAL BACKGROUND , 2010 .

[11]  Barry Vercoe,et al.  The Synthetic Performer in The Context of Live Performance , 1984, International Conference on Mathematics and Computing.

[12]  J. Durand,et al.  Phonologie, variation et accents du français , 2009 .

[13]  Julien Allali,et al.  Polyphonic Alignment Algorithms for Symbolic Music Retrieval , 2009, CMMR/ICAD.

[14]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.