Contributions to music semantic analysis and its acceleration techniques

Digitalized music production exploded in the past decade. Huge amount of data drives the development of effective and efficient methods for automatic music analysis and retrieval. This thesis focuses on performing semantic analysis of music, in particular mood and genre classification, with low level and mid level features since the mood and genre are among the most natural semantic concepts expressed by music perceivable by audiences. In order to delve semantics from low level features, feature modeling techniques like K-means and GMM based BoW and Gaussian super vector have to be applied. In this big data era, the time and accuracy efficiency becomes a main issue in the low level feature modeling. Our first contribution thus focuses on accelerating k-means, GMM and UBM-MAP frameworks, involving the acceleration on single machine and on cluster of workstations. To achieve the maximum speed on single machine, we show that dictionary learning procedures can elegantly be rewritten in matrix format that can be accelerated efficiently by high performance parallel computational infrastructures like multi-core CPU, GPU. In particular with GPU support and careful tuning, we have achieved two magnitudes speed up compared with single thread implementation. Regarding data set which cannot fit into the memory of individual computer, we show that the k-means and GMM training procedures can be divided into map-reduce pattern which can be executed on Hadoop and Spark cluster. Our matrix format version executes 5 to 10 times faster on Hadoop and Spark clusters than the state-of-the-art libraries. Beside signal level features, mid-level features like harmony of music, the most natural semantic given by the composer, are also important since it contains higher level of abstraction of meaning beyond physical oscillation. Our second contribution thus focuses on recovering note information from music signal with musical knowledge. This contribution relies on two levels of musical knowledge: instrument note sound and note co-occurrence/transition statistics. In the instrument note sound level, a note dictionary is firstly built i from Logic Pro 9. With the musical dictionary in hand, we propose a positive constraint matching pursuit (PCMP) algorithm to perform the decomposition. In the inter-note level, we propose a two stage sparse decomposition approach integrated with note statistical information. In frame level decomposition stage, note co-occurrence probabilities are embedded to guide atom selection and to build sparse multiple candidate graph providing backup choices for later selections. In the global optimal path searching stage, note transition probabilities are incorporated. Experiments on multiple data sets show that our proposed approaches outperform the state-of-the-art in terms of accuracy and recall for note recovery and music mood/genre classification.

[1]  Wolfgang Hess,et al.  Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[2]  Jun Wu,et al.  Multipitch estimation by joint modeling of harmonic and transient sounds , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Emmanuel Dellandréa,et al.  Music sparse decomposition onto a MIDI dictionary of musical words and its application to music mood classification , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[4]  G. H. Wakefield,et al.  To catch a chorus: using chroma-based representations for audio thumbnailing , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[5]  Song Yan,et al.  CUDA-Based Fast GMM Model Training Method and Its Application , 2012 .

[6]  Bo Xu,et al.  SVM-based audio scene classification , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[7]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[8]  Koen E. A. van de Sande,et al.  Empowering Visual Categorization With the GPU , 2011, IEEE Transactions on Multimedia.

[9]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[10]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[11]  Silvia Pfeiffer,et al.  Pause concepts for audio segmentation at different semantic levels , 2001, MULTIMEDIA '01.

[12]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[13]  Koichi Shinoda,et al.  A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems , 2011, ACM Multimedia.

[14]  Wasfi G. Al-Khatib,et al.  Machine-learning based classification of speech and music , 2006, Multimedia Systems.

[15]  George Tzanetakis,et al.  Manipulation, analysis and retrieval systems for audio signals , 2002 .

[16]  Xia Wang,et al.  Noise robust Chinese speech recognition using feature vector normalization and higher-order cepstral coefficients , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[17]  Moo Young Kim,et al.  Music genre/mood classification using a feature-based modulation spectrum , 2011, International Conference on Mobile IT Convergence.

[18]  Changsheng Xu,et al.  Automatic music classification and summarization , 2005, IEEE Transactions on Speech and Audio Processing.

[19]  Anil C. Kokaram,et al.  A Wavelet Packet representation of audio signals for music genre classification using different ensemble and feature selection techniques , 2003, MIR '03.

[20]  Jeffrey J. Scott,et al.  MUSIC EMOTION RECOGNITION: A STATE OF THE ART REVIEW , 2010 .

[21]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[22]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  A. Noll Short‐Time Spectrum and “Cepstrum” Techniques for Vocal‐Pitch Detection , 1964 .

[24]  Elias Pampalk,et al.  Using Psycho-Acoustic Models and Self-Organizing Maps to Create a Hierarchical Structuring of Music by Sound Similarity , 2002 .

[25]  Ian Buck,et al.  Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[26]  B. P. Bogert,et al.  The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[27]  Hsin-Min Wang,et al.  Audio Classification Using Semantic Transformation and Classifier Ensemble , 2010 .

[28]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[29]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[30]  John P. Oakley,et al.  Storage and Retrieval for Image and Video Databases , 1993 .

[31]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[32]  Goujun Lu,et al.  Indexing and Retrieval of Audio: A Survey , 2001, Multimedia Tools and Applications.

[33]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[34]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[35]  Sridhar Krishnan,et al.  Gaussian Mixture Modeling Using Short Time Fourier Transform Features for Audio Fingerprinting , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[36]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[37]  Mohan S. Kankanhalli,et al.  Precise pitch profile feature extraction from musical audio for key detection , 2006, IEEE Transactions on Multimedia.

[38]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[40]  Elias Pampalk,et al.  Content-based organization and visualization of music archives , 2002, MULTIMEDIA '02.

[41]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Sean Owen,et al.  Mahout in Action , 2011 .

[43]  S. Mallat A wavelet tour of signal processing , 1998 .

[44]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[45]  Emmanuel Dellandréa,et al.  What is the best segment duration for music mood analysis ? , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[46]  Andrew D. Pangborn Scalable data clustering using GPUs , 2010 .

[47]  Shingo Uchihashi,et al.  The beat spectrum: a new approach to rhythm analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[48]  Tao Li,et al.  Content-based music similarity search and emotion detection , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Masataka Goto,et al.  A chorus-section detecting method for musical audio signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[50]  Eric D. Scheirer,et al.  Tempo and beat analysis of acoustic musical signals. , 1998, The Journal of the Acoustical Society of America.

[51]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Antje Strauss Content Based Audio Classification And Retrieval For Audiovisual Data Parsing , 2016 .

[54]  A. Bruckstein,et al.  Sparse non-negative solution of a linear system of equations is unique , 2008, 2008 3rd International Symposium on Communications, Control and Signal Processing.

[55]  Yunde Jia,et al.  FISHER NON-NEGATIVE MATRIX FACTORIZATION FOR LEARNING LOCAL FEATURES , 2004 .

[56]  Karthikeyan Umapathy,et al.  Multigroup classification of audio signals using time-frequency parameters , 2005, IEEE Transactions on Multimedia.

[57]  C. Ergun,et al.  Fast Universal Background Model (UBM) Training on GPUs using Compute Unified Device Architecture (CUDA) , 2012 .

[58]  Ekaterina Gonina,et al.  Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training , 2011 .

[59]  Zbynek Zajíc,et al.  Fast Estimation of Gaussian Mixture Model Parameters on GPU Using CUDA , 2011, 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[60]  Changsheng Xu,et al.  Audio keyword generation for sports video analysis , 2004, MULTIMEDIA '04.

[61]  C. C. Pratt,et al.  Music as a Language of Emotion , 1948 .

[62]  Man Lan,et al.  Initialization of cluster refinement algorithms: a review and comparative study , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[63]  Tao Li,et al.  Toward intelligent music information retrieval , 2006, IEEE Transactions on Multimedia.

[64]  B. Moore,et al.  Auditory filter shapes at low center frequencies. , 1990, The Journal of the Acoustical Society of America.

[65]  Shigeki Sagayama,et al.  Multipitch Analysis with Harmonic Nonnegative Matrix Approximation , 2007, ISMIR.

[66]  Yangdong Deng,et al.  GPU accelerated face detection , 2010, 2010 International Conference on Intelligent Control and Information Processing.

[67]  Christine Guillemot,et al.  A complementary matching pursuit algorithm for sparse approximation , 2008, 2008 16th European Signal Processing Conference.

[68]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[69]  Ming Li,et al.  THINKIT'S SUBMISSIONS FOR MIREX2009 AUDIO MUSIC CLASSIFICATION AND SIMILARITY TASKS , 2009 .

[70]  Mehmet Emre Çek,et al.  Analysis of observed chaotic data , 2004 .

[71]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[72]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[73]  Hirokazu Kameoka,et al.  A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[74]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[75]  Muhammad Moinuddin,et al.  Automatic classification of speech and music using neural networks , 2004, MMDB '04.

[76]  Yi-Hsuan Yang,et al.  A Regression Approach to Music Emotion Recognition , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  Emmanuel Vincent,et al.  Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[78]  Emanuele Pollastri,et al.  Musical Instrument Timbres Classification with Spectral Features , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[79]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[80]  Emmanuel Dellandréa,et al.  Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme , 2013, Comput. Vis. Image Underst..

[81]  Katsutoshi Itoyama,et al.  Bayesian Nonnegative Harmonic-Temporal Factorization and Its Application to Multipitch Analysis , 2012, ISMIR.

[82]  Zhigang Luo,et al.  Manifold Regularized Discriminative Nonnegative Matrix Factorization With Fast Gradient Descent , 2011, IEEE Transactions on Image Processing.

[83]  Björn W. Schuller,et al.  Determination of Nonprototypical Valence and Arousal in Popular Music: Features and Performances , 2010, EURASIP J. Audio Speech Music. Process..

[84]  Pascal Vincent,et al.  Discriminative Non-negative Matrix Factorization for Multiple Pitch Estimation , 2012, ISMIR.

[85]  Anastasios Tefas,et al.  Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification , 2006, IEEE Transactions on Neural Networks.

[86]  Avery Wang,et al.  The Shazam music recognition service , 2006, CACM.

[87]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[88]  Emmanuel Vincent,et al.  Instrument-Specific Harmonic Atoms for Mid-Level Music Representation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[89]  G. Tzanetakis Audio-based gender identification using bootstrapping , 2005, PACRIM. 2005 IEEE Pacific Rim Conference on Communications, Computers and signal Processing, 2005..

[90]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[91]  Silvia Pfeiffer,et al.  Importance of perceptive adaptation of sound features in audio content processing , 1998, Electronic Imaging.

[92]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[93]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[94]  Meinard Müller,et al.  The Cyclic Beat Spectrum: Tempo-Related Audio Features for Time-Scale Invariant Audio Identification , 2006, ISMIR.