Indexation sonore : recherche de composantes primaires pour une structuration audiovisuelle. (Audio classification: search of primary components for audiovisual structuring)

Le developpement croissant des donnees numeriques et l'explosion des acces multimedia a l'information, sont confrontes au manque d'outils automatiques efficaces. Dans ce cadre, plusieurs approches relatives a l'indexation et la structuration de la bande sonore de documents audiovisuels sont proposees. Leurs buts sont de detecter les composantes primaires telles que la parole, la musique et les sons cles (jingles, sons caracteristiques, mots cles...). Pour la classification parole/musique, trois parametres inhabituels sont extraits : la modulation de l'entropie, la duree des segments (issue d'une segmentation automatique) et le nombre de ces segments par seconde. Les informations issues de ces trois parametres sont ensuite fusionnees avec celle issue de la modulation de l'energie a quatre hertz. Des experiences sur un corpus radiophonique montrent la robustesse de ces parametres : notre systeme possede un taux de classification correcte superieur a 90%. Le systeme est ensuite compare, puis fusionne a un systeme classique base sur des Modeles de Melanges de lois Gaussiennes (MMG) et une analyse cepstrale. Un autre partitionnement consiste a detecter des sons cles. La selection de candidats potentiels est effectuee en comparant la « signature » de chacun des jingles au flux de donnees. Ce systeme est simple par sa mise en œuvre mais rapide et tres efficace : sur un corpus audiovisuel d'une dizaine d'heures (environ 200 jingles) aucune fausse alarme n'est presente. Il y a seulement deux omissions dans des conditions extremes. Les sons caracteristiques (applaudissements et rires) sont modelises a l'aide de MMG dans le domaine spectral. Un corpus televisuel permet de valider cette premiere etude par des resultats encourageants. La detection de mots cles est effectuee de maniere classique : il ne s'agit pas ici d'ameliorer les systemes existants mais de se placer toujours dans un besoin de structuration. Ainsi, ces mots cles renseignent sur le type des emissions (journal, meteo, documentaire...). Grâce a l'extraction de ces composantes primaires, les emissions audiovisuelles peuvent etre annotees de maniere automatique. Au travers de deux etudes, une reflexion est conduite quant a l'utilisation de ces composantes afin de trouver une structure temporelle aux documents. La premiere etude permet une detection d'un motif recurrent dans une collection d'emissions, dites de plateau, alors que la seconde realise la structuration en themes d'un journal televise. Quelques pistes de reflexions sur l'apport de l'analyse video sont developpees et les besoins futurs sont explores.

[1]  Hynek Hermansky,et al.  Perceptually based linear predictive analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Jean-Luc Gauvain,et al.  Speaker verification over the telephone , 2000, Speech Commun..

[3]  N. Suaudeau Un modele probabiliste pour integrer la dimension temporelle dans un systeme de reconnaissance automatique de parole , 1994 .

[4]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[5]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[6]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[7]  Rainer Lienhart,et al.  Scene Determination Based on Video and Audio Features , 2004, Multimedia Tools and Applications.

[8]  Louis-Jean Boë,et al.  La parole et son traitement automatique , 1989 .

[9]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[10]  Claire-Hélène Demarty Segmentation et structuration d'un document vidéo pour la caractérisation et l'indexation de son contenu sémantique , 2000 .

[11]  C.-C. Jay Kuo,et al.  Hierarchical system for content-based audio classification and retrieval , 1998, Other Conferences.

[12]  C. Montacie,et al.  Temporal decomposition and acoustic-phonetic decoding of speech , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Yasuharu Asano,et al.  Traitement de la parole , 2000 .

[15]  Julien Pinquier,et al.  Fusion de paramètres pour une classification automatique parole/musique robuste. Séparation parole/musique dans les fichiers a , 2003, Tech. Sci. Informatiques.

[16]  Philippe Aigrain,et al.  Medium knowledge-based macro-segmentation of video into sequences , 1997 .

[17]  Philippe Leray,et al.  Pertinence des mesures de confiance en classification , 2000 .

[18]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[19]  R. Wohlford,et al.  Keyword recognition using template concatenation , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Don Kimber,et al.  Acoustic Segmentation for Audio Browsers , 1997 .

[21]  Jean Véronis,et al.  A multilingual prosodic database , 1998, ICSLP.

[22]  Guy Perennou,et al.  BDLEX: a lexicon for spoken and written french , 1998, LREC.

[23]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[24]  B. Chigier,et al.  Rejection and keyword spotting algorithms for a directory assistance city name recognition application , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Mübeccel Demirekler,et al.  Speaker identification by combining multiple classifiers using Dempster-Shafer theory of evidence , 2003, Speech Commun..

[26]  Laurent Besacier Un modèle parallèle pour la reconnaissance automatique du locuteur , 1998 .

[27]  Jonathan Harrington,et al.  The Acoustic Theory of Speech Production , 1999 .

[28]  Patrick Gros,et al.  Audiovisual integration for tennis broadcast structuring , 2006, Multimedia Tools and Applications.

[29]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[30]  Michael J. Carey,et al.  A comparison of features for speech, music discrimination , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[31]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[32]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[33]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[34]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[35]  Bistra Andreeva,et al.  Acoustic parameters versus phonetic features in ASR , 1999 .

[36]  Joseph Razik,et al.  Segmentation Parole/Musique pour la transcription automatique , 2004 .

[37]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[38]  Martin Brown,et al.  Network Performance Assessment for Neurofuzzy Data Modelling , 1997, IDA.

[39]  C. D. Forgie,et al.  Automatic Recognition of Spoken Digits , 1958 .

[40]  Edward J. Delp,et al.  Combining audio and video for video sequence indexing applications , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[41]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[42]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[43]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[44]  R. Moddemeijer On estimation of entropy and mutual information of continuous distributions , 1989 .

[45]  François Pachet,et al.  Clavis: a temporal reasoning system for classification of audiovisual sequences , 2000 .

[46]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[47]  Régine André-Obrecht,et al.  Direct identification vs. correlated models to process acoustic and articulatory informations in automatic speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Jean-Luc Gauvain,et al.  Speaker recognition with the Switchboard corpus , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[50]  Jonathan Foote,et al.  A Similarity Measure for Automatic Audio Classification , 1997 .

[51]  D. P. Morgan,et al.  Multiple neural network topologies applied to keyword spotting , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[52]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[53]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[54]  Pierrick Philippe,et al.  Indexation Audio: un état de I’art , 2000, Ann. des Télécommunications.

[55]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[56]  Hae-Kwang Kim,et al.  Détection automatique des mouvements de caméra et des régions de textes pour la structuration et l'indexation de documents audiovisuels , 1997 .

[57]  Delphine Charlet,et al.  Confidence measure and incremental adaptation for the rejection of incorrect data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[58]  Francisco Javier Caminero Gil,et al.  Improving utterance verification using hierarchical confidence measures in continuous natural numbers recognition , 1997, ICASSP.

[59]  Richard P. Lippmann,et al.  Techniques for information retrieval from voice messages , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[60]  Yi Lu,et al.  Machine printed character segmentation --; An overview , 1995, Pattern Recognit..

[61]  T. Houtgast,et al.  A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria , 1985 .

[62]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[63]  Hanan Samet,et al.  Using negative shape features for logo similarity matching , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[64]  Philip Lockwood,et al.  Beam search and partial traceback in the frame-synchronous two-level algorithm (TLBS) , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  N. Wiener The Wiener RMS (Root Mean Square) Error Criterion in Filter Design and Prediction , 1949 .

[66]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[67]  Wenjun Zeng,et al.  Integrated image and speech analysis for content-based video indexing , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[68]  Frank K. Soong,et al.  High performance connected digit recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[69]  Julien Pinquier,et al.  A fusion study in speech / music classification , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[70]  João Paulo da Silva Neto,et al.  The development of a portuguese version of a media watch system , 2001, INTERSPEECH.

[71]  Luis A. Hernández Gómez,et al.  Grammar learning and word spotting using recurrent neural networks , 1993, EUROSPEECH.

[72]  Stephen W. Smoliar,et al.  Video parsing, retrieval and browsing: an integrated and content-based solution , 1997, MULTIMEDIA '95.

[73]  Martine Adda-Decker,et al.  The 300k LIMSI German broadcast news transcription system , 2003, INTERSPEECH.

[74]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[75]  D. Matrouf,et al.  5 - Utilisation des modèles de Markov cachés pour le débruitage , 2001 .

[76]  Stéphane Rossignol,et al.  Segmentation et indexation des signaux sonores musicaux , 2000 .

[77]  John M. Gauch,et al.  The vision digital video library , 1997, Inf. Process. Manag..

[78]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[79]  Chafic Mokbel,et al.  Blind equalization using adaptive filtering for improving speech recognition over telephone , 1995, EUROSPEECH.

[80]  G. Jaffré,et al.  Costume: a new feature for automatic video content indexing , 2004 .

[81]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[82]  Lie Lu,et al.  A robust audio classification and segmentation method , 2001, MULTIMEDIA '01.

[83]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[84]  Stephen J. Cox,et al.  Confidence measures for the SWITCHBOARD database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[85]  I. Demeure,et al.  Systèmes de processus légers: concepts et exemples , 1994 .

[86]  Douglas D. O'Shaughnessy,et al.  Specific language modelling for new-word detection in continuous-speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).