论文信息 - Extension du vocabulaire d’un système de transcription avec de nouveaux noms propres en utilisant un corpus diachronique

Extension du vocabulaire d’un système de transcription avec de nouveaux noms propres en utilisant un corpus diachronique

Proper names are usually keys to understand the information contained in a document. Our work focuses on increasing the vocabulary size of a speech transcription system by automatically retrieving proper names from contemporary diachronic text documents. We proposed methods that dynamically augment the automatic speech recognition system vocabulary, using lexical and temporal features. We assume that the same proper names frequently appear in documents relating to the same time period. We studied a method based on Mutual Information and we proposed a new method based on cosine similarity to retrieve new proper names. In this new method, proper name context is represented by vector space model (Bag of Words). We also studied different metrics for proper name selection in order to limit the vocabulary augmentation and therefore the impact on the ASR performances. Recognition results show a significant reduction of the word error rate using augmented vocabulary with retrieved proper names.

Georges Linarès | Irina Illina | Dominique Fohr

[1] Kazuo Onoe,et al. Time dependent language model for broadcast news transcription and its post-correction , 1998, ICSLP.

[2] Marcello Federico,et al. Lexicon adaptation for broadcast news transcription , 2001 .

[3] Georges Linarès,et al. Person name recognition in ASR outputs using continuous context models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Georges Linarès,et al. Local Methods for On-Demand Out-of-Vocabulary Word Retrieval , 2008, LREC.

[5] Alexandre Allauzen,et al. Diachronic vocabulary adaptation for broadcast news transcription , 2005, INTERSPEECH.