论文信息 - MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval

MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval

List of Acronyms. List of Symbols. 1. Introduction. 1.1 Audio Content Description. 1.2 MPEG-7 Audio Content Description - An Overview. 1.2.1 MPEG-7 Low-Level Descriptors. 1.2.2 MPEG-7 Description Schemes. 1.2.3 MPEG-7 Description Definition Language (DDL). 1.2.4 BiM (Binary Format for MPEG-7). 1.3 Organization of the Book. 2. Low-Level Descriptors. 2.1 Introduction. 2.2 Basic Parameters and Notations. 2.2.1 Time Domain. 2.2.2 Frequency Domain. 2.3 Scalable Series. 2.3.1 Series of Scalars. 2.3.2 Series of Vectors. 2.3.3 Binary Series. 2.4 Basic Descriptors. 2.4.1 Audio Waveform. 2.4.2 Audio Power. 2.5 Basic Spectral Descriptors. 2.5.1 Audio Spectrum Envelope. 2.5.2 Audio Spectrum Centroid. 2.5.3 Audio Spectrum Spread. 2.5.4 Audio Spectrum Flatness. 2.6 Basic Signal Parameters. 2.6.1 Audio Harmonicity. 2.6.2 Audio Fundamental Frequency. 2.7 Timbral Descriptors. 2.7.1 Temporal Timbral: Requirements. 2.7.2 Log Attack Time. 2.7.3 Temporal Centroid. 2.7.4 Spectral Timbral: Requirements. 2.7.5 Harmonic Spectral Centroid. 2.7.6 Harmonic Spectral Deviation. 2.7.7 Harmonic Spectral Spread. 2.7.8 Harmonic Spectral Variation. 2.7.9 Spectral Centroid. 2.8 Spectral Basis Representations. 2.9 Silence Segment. 2.10 Beyond the Scope of MPEG-7. 2.10.1 Other Low-Level Descriptors. 2.10.2 Mel-Frequency Cepstrum Coefficients. References. 3. Sound Classification and Similarity. 3.1 Introduction. 3.2 Dimensionality Reduction. 3.2.1 Singular Value Decomposition (SVD). 3.2.2 Principal Component Analysis (PCA). 3.2.3 Independent Component Analysis (ICA). 3.2.4 Non-Negative Factorization (NMF). 3.3 Classification Methods. 3.3.1 Gaussian Mixture Model (GMM). 3.3.2 Hidden Markov Model (HMM). 3.3.3 Neural Network (NN). 3.3.4 Support Vector Machine (SVM). 3.4 MPEG-7 Sound Classification. 3.4.1 MPEG-7 Audio Spectrum Projection (ASP) Feature Extraction. 3.4.2 Training Hidden Markov Models (HMMs). 3.4.3 Classification of Sounds. 3.5 Comparison of MPEG-7 Audio Spectrum Projection vs. MFCC Features. 3.6 Indexing and Similarity. 3.6.1 Audio Retrieval Using Histogram Sum of Squared Differences. 3.7 Simulation Results and Discussion. 3.7.1 Plots of MPEG-7 Audio Descriptors. 3.7.2 Parameter Selection. 3.7.3 Results for Distinguishing Between Speech, Music and Environmental Sound. 3.7.4 Results of Sound Classification Using Three Audio Taxonomy Methods. 3.7.5 Results for Speaker Recognition. 3.7.6 Results of Musical Instrument Classification. 3.7.7 Audio Retrieval Results. 3.8 Conclusions. References. 4. Spoken Content. 4.1 Introduction. 4.2 Automatic Speech Recognition. 4.2.1 Basic Principles. 4.2.2 Types of Speech Recognition Systems. 4.2.3 Recognition Results. 4.3 MPEG-7 SpokenContent Description. 4.3.1 General Structure. 4.3.2 SpokenContentHeader. 4.3.3 SpokenContentLattice. 4.4 Application: Spoken Document Retrieval. 4.4.1 Basic Principles of IR and SDR. 4.4.2 Vector Space Models. 4.4.3 Word-Based SDR. 4.4.4 Sub-Word-Based Vector Space Models. 4.4.5 Sub-Word String Matching. 4.4.6 Combining Word and Sub-Word Indexing. 4.5 Conclusions. 4.5.1 MPEG-7 Interoperability. 4.5.2 MPEG-7 Flexibility. 4.5.3 Perspectives. References. 5. Music Description Tools. 5.1 Timbre. 5.1.1 Introduction. 5.1.2 InstrumentTimbre. 5.1.3 HarmonicInstrumentTimbre. 5.1.4 PercussiveInstrumentTimbre. 5.1.5 Distance Measures. 5.2 Melody. 5.2.1 Melody. 5.2.2 Meter. 5.2.3 Scale. 5.2.4 Key. 5.2.5 MelodyContour. 5.2.6 MelodySequence. 5.3 Tempo. 5.3.1 AudioTempo. 5.3.2 AudioBPM. 5.4 Application Example: Query-by-Humming. 5.4.1 Monophonic Melody Transcription. 5.4.2 Polyphonic Melody Transcription. 5.4.3 Comparison of Melody Contours. References. 6. Fingerprinting and Audio Signal Quality. 6.1 Introduction. 6.2 Audio Signature. 6.2.1 Generalities on Audio Fingerprinting. 6.2.2 Fingerprint Extraction. 6.2.3 Distance and Searching Methods. 6.2.4 MPEG-7-Standardized AudioSignature. 6.3 Audio Signal Quality. 6.3.1 AudioSignalQuality Description Scheme. 6.3.2 BroadcastReady. 6.3.3 IsOriginalMono. 6.3.4 BackgroundNoiseLevel. 6.3.5 CrossChannelCorrelation. 6.3.6 RelativeDelay. 6.3.7 Balance. 6.3.8 DcOffset. 6.3.9 Bandwidth. 6.3.10 TransmissionTechnology. 6.3.11 ErrorEvent and ErrorEventList. References. 7. Application. 7.1 Introduction. 7.2 Automatic Audio Segmentation. 7.2.1 Feature Extraction. 7.2.2 Segmentation. 7.2.3 Metric-Based Segmentation. 7.2.4 Model-Selection-Based Segmentation. 7.2.5 Hybrid Segmentation. 7.2.6 Hybrid Segmentation Using MPEG-7 ASP. 7.2.7 Segmentation Results. 7.3 Sound Indexing and Browsing of Home Video Using Spoken Annotations. 7.3.1 A Simple Experimental System. 7.3.2 Retrieval Results. 7.4 Highlights Extraction for Sport Programmes Using Audio Event Detection. 7.4.1 Goal Event Segment Selection. 7.4.2 System Results. 7.5 A Spoken Document Retrieval System for Digital Photo Albums. References. Index.

[1] Hynek Hermansky,et al. RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[2] Constantin Papaodysseus,et al. A New Approach to the Automatic Recognition of Musical Recordings , 2001 .

[3] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4] John C. Platt,et al. Extracting noise-robust features from audio data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Ray Meddis,et al. Virtual pitch and phase sensitivity of a computer model of the auditory periphery , 1991 .

[6] P. Smaragdis,et al. Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[7] David Anthony James,et al. The Application of Classical Informa - tion Retrieval Techniques to Spoken Documents , 1995 .

[8] Herbert Gish,et al. Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[9] Peng Yu,et al. An improved model-based speaker segmentation system , 2003, INTERSPEECH.

[10] E. Batlle,et al. Automatic Song Identification in Noisy Broadcast Audio , 2002 .

[11] David Malah,et al. Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12] Anssi Klapuri,et al. Signal Processing Methods for the Automatic Transcription of Music , 2004 .

[13] Larry P. Heck,et al. Speaker tracking and detection with multiple speakers , 1999, EUROSPEECH.

[14] Stanley Boykin,et al. Audio Hot Spotting and Retrieval using Multiple Features , 2004, HLT-NAACL 2004.

[15] Martha Larson,et al. Using syllable-based indexing features and language models to improve German spoken document retrieval , 2003, INTERSPEECH.

[16] Ian H. Witten,et al. Signal processing for melody transcription , 1995 .

[17] Thomas Sikora,et al. Speech enhancement based on smoothing of spectral noise floor , 2004, INTERSPEECH.

[18] Thomas Sikora,et al. Evaluation of distance measures for MPEG-7 melody contours , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[19] Thomas Sikora,et al. Automatic segmentation of speakers in broadcast audio material , 2003, IS&T/SPIE Electronic Imaging.

[20] Lie Lu,et al. A robust audio classification and segmentation method , 2001, MULTIMEDIA '01.

[21] Kunio Kashino,et al. Very quick audio searching: introducing global pruning to the Time-Series Active Search , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[22] Philip N. Garner,et al. Representation and linking mechanisms for audio in MPEG-7 , 2000, Signal Process. Image Commun..

[23] Peter Schäuble,et al. A system for retrieving speech documents , 1992, SIGIR '92.

[24] Thomas Sikora,et al. Comparison of MPEG-7 audio spectrum projection features and MFCC applied to speaker recognition, sound classification and audio segmentation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25] Pedro Cano,et al. Audio Watermarking and Fingerprinting: For Which Applications? , 2003 .

[26] Chin-Hui Lee,et al. Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[27] Karen Sparck Jones,et al. Spoken Document Retrieval for TREC-8 at Cambridge University , 1998, TREC.

[28] Frank Kurth,et al. Identification of Highly Distorted Audio Material for Querying Large Scale Data Bases , 2002 .

[29] Seungjin Choi,et al. Non-negative component parts of sound for classification , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[30] L. Varga,et al. Short-term sound stream characterization for reliable, real-time occurrence monitoring of given sound-prints , 2000, 2000 10th Mediterranean Electrotechnical Conference. Information Technology and Electrotechnology for the Mediterranean Countries. Proceedings. MeleCon 2000 (Cat. No.00CH37099).

[31] Alexander G. Hauptmann,et al. SPEECH RECOGNITION AND INFORMATION RETRIEVAL: EXPERIMENTS IN RETRIEVING SPOKEN DOCUMENTS , 1997 .

[32] Pedro Cano,et al. A review of algorithms for audio fingerprinting , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[33] Tsuhan Chen,et al. Audio Feature Extraction and Analysis for Scene Segmentation and Classification , 1998, J. VLSI Signal Process..

[34] Ian H. Witten,et al. Towards the digital music library: tune retrieval from acoustic input , 1996, DL '96.

[35] Marc Leman,et al. An Auditory Model Based Transcriber of Singing Sequences , 2002, ISMIR.

[36] Frederick Jelinek,et al. Statistical methods for speech recognition , 1997 .

[37] Martin Kaltenbrunner,et al. Statistical Significance in Song-Spotting in Audio , 2001 .

[38] Stephen W. Hainsworth,et al. Techniques for the Automated Analysis of Musical Audio , 2004 .

[39] Vijay Balasubramanian,et al. Speech-Based Retrieval Using Semantic Co-Occurrence Filtering , 1994, HLT.

[40] H. Sebastian Seung,et al. Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[41] Thomas Niesler,et al. Experiments in broadcast news transcription , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[42] Les E. Atlas,et al. Modulation frequency features for audio fingerprinting , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43] Alexander H. Waibel,et al. Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[44] Masataka Goto. A predominant-F/sub 0/ estimation method for CD recordings: MAP estimation using EM algorithm for adaptive tone models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[45] Frédéric Bimbot,et al. Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[46] Andrew K. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[47] Francine Chen,et al. Segmentation of speech using speaker identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[48] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[49] J. G. Lourens. Detection and Logging Advertisements using its Sound , 1990, IEEE South African Symposium on Communications and Signal Processing.

[50] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[51] Min Chen,et al. DETECTION OF SOCCER GOAL SHOTS USING JOINT MULTIMEDIA FEATURES AND CLASSIFICATION RULES , 2003 .

[52] Thomas Sikora,et al. Audio classification based on MPEG-7 spectral basis representations , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[53] Anssi Klapuri,et al. Means of Integrating Audio Content Analysis Algorithms , 2001 .

[54] Ton Kalker,et al. A Highly Robust Audio Fingerprinting System , 2002, ISMIR.

[55] Thomas Sikora,et al. BeatBank ? An MPEG-7 Compliant Query by Tapping System , 2004 .

[56] Ricardo A. Baeza-Yates,et al. Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[57] Helmut Neuschmied,et al. Robust Sound Modeling for Song Detection in Broadcast Audio , 2002 .

[58] Thomas Sikora,et al. A Query by Humming System using MPEG-7 Descriptors , 2004 .

[59] Dauid F. Percy. Cluster Analysis (3rd Edition) , 1994 .

[60] Thomas Sikora,et al. Phonetic confusion based document expansion for spoken document retrieval , 2004, INTERSPEECH.

[61] Erwin M. Bakker,et al. Semantic Video Retrieval Using Audio Analysis , 2002, CIVR.

[62] Peter Kabal,et al. The computation of line spectral frequencies using Chebyshev polynomials , 1986, IEEE Trans. Acoust. Speech Signal Process..

[63] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[64] Lutz Prechelt,et al. An interface for melody input , 2001, TCHI.

[65] Mark A. Clements,et al. Scoring Algorithms for Wordspotting Systems , 2004, HLT-NAACL 2004.

[66] B. S. Manjunath,et al. Introduction to mpeg-7 , 2002 .

[67] Ross Wilkinson,et al. Experiments in spoken document retrieval using phoneme n-grams , 2000, Speech Commun..

[68] Masataka Goto,et al. A robust predominant-F0 estimation method for real-time detection of melody and bass lines in CD recordings , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[69] Peng Yu,et al. A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[70] Pedro Cano,et al. Mixed Watermarking-Fingerprinting Approach for Integrity Verification of Audio Recordings , 2002 .

[71] Jean-Luc Gauvain,et al. The LIMSI SDR System for TREC-8 , 1999, TREC.

[72] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[73] S. Robertson. The probability ranking principle in IR , 1997 .

[74] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[75] Dragutin Petkovic,et al. Phonetic confusion matrix based spoken document retrieval , 2000, SIGIR '00.

[76] R. C. Rose,et al. Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition , 1995, Comput. Speech Lang..

[77] Douglas A. Reynolds,et al. Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[78] Takehito Utsuro,et al. Keyword recognition and extraction by multiple-LVCSRs with 60, 000 words in speech-driven WEB retrieval task , 2004, INTERSPEECH.

[79] Justin Zobel,et al. Music Ranking Techniques Evaluated , 2000, ISMIR.

[80] H. Gish,et al. Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[81] Jean-Luc Gauvain,et al. Partitioning and transcription of broadcast news data , 1998, ICSLP.

[82] Jr. J.P. Campbell,et al. Speaker recognition: a tutorial , 1997, Proc. IEEE.

[83] S. R. Subramanya,et al. Transform-based indexing of audio data for multimedia databases , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[84] Kenney Ng,et al. Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[85] Eric D. Scheirer,et al. Tempo and beat analysis of acoustic musical signals. , 1998, The Journal of the Acoustical Society of America.

[86] Chng Eng Siong,et al. Sports highlight detection from keyword sequences using HMM , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[87] Ramarathnam Venkatesan,et al. A Perceptual Audio Hashing Algorithm: A Tool for Robust Audio Identification and Information Hiding , 2001, Information Hiding.

[88] Fabio Crestani,et al. Using semantic and phonetic term similarity for spoken document retrieval and spoken query processing , 2002 .

[89] Thomas Sikora,et al. Combination of phone N-grams for a MPEG-7-based spoken document retrieval system , 2004, 2004 12th European Signal Processing Conference.

[90] Dragutin Petkovic,et al. Towards robust features for classifying audio in the CueVideo system , 1999, MULTIMEDIA '99.

[91] Martin Wechsler,et al. Spoken document retrieval based on phoneme recognition , 1998 .

[92] Karen Spärck Jones,et al. Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[93] M. Sugiyama,et al. Speech segmentation and clustering based on speaker features , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[94] Eric Allamanche,et al. Content-based Identification of Audio Material Using MPEG-7 Low Level Description , 2001, ISMIR.

[95] Douglas B. Paul. An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[96] Holger H. Hoos,et al. GUIDO/MIR - an Experimental Musical Information Retrieval System based on GUIDO Music Notation , 2001, ISMIR.

[97] Stephen E. Robertson,et al. Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR , 1997, TREC.

[98] John H. L. Hansen,et al. Unsupervised audio stream segmentation and clustering via the Bayesian information criterion , 2000, INTERSPEECH.

[99] Kenney Ng. Towards robust methods for spoken document retrieval , 1998, ICSLP.

[100] Aaron E. Rosenberg,et al. Unsupervised speaker segmentation of telephone conversations , 2002, INTERSPEECH.

[101] S. Furui,et al. Cepstral analysis technique for automatic speaker verification , 1981 .

[102] Douglas A. Reynolds,et al. Blind clustering of speech utterances based on speaker and language characteristics , 1998, ICSLP.

[103] Barry Vercoe,et al. Melody retrieval on the web , 2001, IS&T/SPIE Electronic Imaging.

[104] Emanuele Pollastri. An Audio Front End for Query-by-Humming Systems , 2001, ISMIR.

[105] Markus Cremer,et al. Scalable robust audio fingerprinting using MPEG-7 content description , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[106] M. A. Siegler,et al. Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[107] H. Gish,et al. An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[108] Regunathan Radhakrishnan,et al. Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[109] Preeti Rao,et al. Pitch Detection of the Singing Voice in Muscial Audio , 2003 .

[110] Lie Lu,et al. Speaker change detection and tracking in real-time news broadcasting analysis , 2002, MULTIMEDIA '02.

[111] Thomas Sikora,et al. How Efficient is MPEG-7 for General Sound Recognition? , 2004 .

[112] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[113] Beth Logan,et al. Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio , 2002 .

[114] Youngmoo E. Kim,et al. Analysis of a Contour-based Representation for Melody , 2000, ISMIR.

[115] Robert M. Gray,et al. An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[116] P. Boersma. ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[117] J. Zobel,et al. Matching Techniques for Large Music Databases , 1999 .

[118] Vincent Kanade,et al. Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[119] Ellen M. Voorhees,et al. Overview of the Seventh Text REtrieval Conference , 1998 .

[120] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[121] Nicolas Moreau,et al. Phone-based Spoken Document Retrieval in Conformance with the MPEG-7 Standard , 2004 .

[122] Hsin-Min Wang,et al. A sequential metric-based audio segmentation method via the Bayesian information criterion , 2003, INTERSPEECH.

[123] Ricardo A. Baeza-Yates,et al. Searching in metric spaces , 2001, CSUR.

[124] Ramesh A. Gopinath,et al. Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[125] Lie Lu,et al. UBM-based real-time speaker segmentation for broadcasting news , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[126] Anders Arpteg. Information Retrieval Techniques , 1999 .

[127] Christian Wellekens,et al. DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[128] Timo Viitaniemi,et al. Probabilistic models for the transcription of single-voice melodies , 2003 .