Speech and Music Classification and Separation: A Review

Abstract The classification and separation of speech and music signals have attracted attention by many researchers. The purpose of the classification process is needed to build two different libraries: speech library and music library, from a stream of sounds. However, the separation process is needed in a cocktail-party problem to separate speech from music and remove the undesired one. In this paper, a review of the existing classification and separation algorithms is presented and discussed. The classification algorithms will be divided into three categories: time-domain, frequency-domain, and time-frequency domain approaches. The time-domain approaches used in literature are: the zero-crossing rate (ZCR), the short-time energy (STE), the ZCR and the STE with positive derivative, with some of their modified versions, the variance of the roll-off, and the neural networks. The frequency-domain approaches are mainly based on: spectral centroid, variance of the spectral centroid, spectral flux, variance of the spectral flux, roll-off of the spectrum, cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in literature; so, the spectrogram and the evolutionary spectrum will be introduced. Also, some new algorithms dealing with music and speech separation and segregation processes will be presented.

[1]  Niall J. L. Griffith,et al.  Connectionist visualisation of tonal structure , 2004, Artificial Intelligence Review.

[2]  Ken-ichi Ohya A Sound Synthesis by Recurrent Neural Network , 1995, ICMC.

[3]  Keith D. Martin,et al.  TOWARD AUTOMATIC SOUND SOURCE RECOGNITION: IDENTIFYING MUSICAL INSTRUMENTS , 1998 .

[4]  Roberto Bresin,et al.  Neural Networks for Musical Tones Compression, Control and Synthesis , 1994, ICMC.

[5]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[6]  Ian J. Taylor,et al.  An Object Oriented ARTMAP System for Classifying Pitch , 1993, ICMC.

[7]  Peter M. Todd,et al.  Pitch, Harmony, and Neural Nets: A Psychological Perspective , 2003 .

[8]  Peter M. Todd,et al.  Musical networks , 1999 .

[9]  Roberto Bresin,et al.  Neural networks for a simpler control of synthesis algorithm of musical tones and for their compression , 1994, Proceedings of IEEE-SP International Symposium on Time- Frequency and Time-Scale Analysis.

[10]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[11]  F. Kubala,et al.  Automatic Speaker Clustering , 1997 .

[12]  Piero Cosi,et al.  Auditory modelling and self‐organizing neural networks for timbre classification , 1994 .

[13]  A.I. Al-Shoshan LTV system identification using the time-varying autocorrelation function and application to audio signal discrimination , 2002, 6th International Conference on Signal Processing, 2002..

[14]  M. B. Priestley,et al.  Non-linear and non-stationary time series analysis , 1990 .

[15]  John Backus,et al.  The Acoustical Foundations of Music , 1970 .

[16]  Jacqueline A. Jones,et al.  On the perception of meter , 1992 .

[17]  Giovanni De Poli,et al.  Representations of musical signals , 1991 .

[18]  R J Stubbs,et al.  Effects of signal-to-noise ratio, signal periodicity, and degree of hearing impairment on the performance of voice-separation algorithms. , 1991, The Journal of the Acoustical Society of America.

[19]  Jun Toyama,et al.  A modified LEGION using a spectrogram for speech segregation , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[20]  Louis P. DiPalma,et al.  Music and Connectionism , 1991 .

[21]  David E. Goldberg,et al.  Genetic Algorithms and Computer-Assisted Music Composition , 1991, ICMC.

[22]  Anssi Klapuri,et al.  Musical instrument recognition using cepstral coefficients and temporal features , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[23]  Peter Fedor Principles of The Design of D-Neuronal Networks II: Composing Simple Melodies , 1992, Int. J. Neural Syst..

[24]  Markus Jakobsson Machine-generated music with themes , 1992 .

[25]  J. Stephen Downie,et al.  The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future , 2004, Computer Music Journal.

[26]  M. Furst,et al.  Neural network based model for classification of music type , 1995, Eighteenth Convention of Electrical and Electronics Engineers in Israel.

[27]  Augusto Salgado CarpinteiroSchool A Neural Model to Segment Musical Pieces , 1995 .

[28]  Carl Malamud,et al.  Speaker identification based text to audio alignment for an audio retrieval system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  John R. Pierce,et al.  The science of musical sound , 1983 .

[30]  Andreas S. Weigend Connectionism for Music and Audition , 1993, NIPS.

[31]  R. H. Myers,et al.  Probability and Statistics for Engineers and Scientists , 1978 .

[32]  Dave Trubitt,et al.  Neural networks and computer music: within the seemingly simple concept of neural network lies the potential for computer-musician interaction on an unprecedented scale , 1991 .

[33]  John Mourjopoulos,et al.  Neural Network Mapping to Subjective Spectra of Music Sounds , 1992 .

[34]  R. Pfeifer,et al.  Connectionism in Perspective , 1989 .

[35]  George Tzanetakis,et al.  Multifeature audio segmentation for browsing and annotation , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[36]  Daniel Lehmann,et al.  An Artificial Neural Net for Harmonizing Melodies , 1995, ICMC.

[37]  DeLiang Wang,et al.  An extended model for speech segregation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[38]  Roland Wilson,et al.  A Neural Network for Triad Classification , 1995, International Conference on Mathematics and Computing.

[39]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[40]  Pauli Laine Generating Musical Patterns Using Mutually Inhibited Artificial Neurons , 1997, ICMC.

[41]  Robert O. Gjerdingen,et al.  Categorization of Musical Patterns by Self-Organizing Neuronlike Networks , 1990 .

[42]  Reinhold Orglmeister,et al.  A contextual blind separation of delayed and convolved sources , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Stephen Grossberg,et al.  ARTSTREAM: a neural network model of auditory scene analysis and source segregation , 2004, Neural Networks.

[44]  Nicholas Cook,et al.  A guide to musical analysis , 1987 .

[45]  Stephen Grossberg,et al.  A Neural Network Model of Auditory Scene Anaysis and Source Segregation , 1994 .

[46]  T. Ross Fuzzy Logic with Engineering Applications , 1994 .

[47]  Petri Toiviainen Modeling the Target-Note Technique of Bebop-Style Jazz Improvisation: An Artificial Neural Network Approach , 1995 .

[48]  Justine Sergent,et al.  Mapping the musician brain , 1993 .

[49]  Catherine J. Stevens,et al.  A comparison of connectionist models of music recognition and human performance , 1992, Minds and Machines.

[50]  DeLiang Wang,et al.  Primitive Auditory Segregation Based on Oscillatory Correlation , 1996, Cogn. Sci..

[51]  R. Monelle Linguistics and semiotics in music , 1992 .

[52]  Niall J. L. Griffith Modelling the Influence of Pitch Duration on the Induction of Tonality from Pitch-Use , 1994, ICMC.

[53]  Harry Wechsler,et al.  Detection of human speech using hybrid recognition models , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[54]  Mark Kahrs,et al.  Applications of digital signal processing to audio and acoustics , 1998 .

[55]  Gert Westermann,et al.  Classification in Music: A Computational Model for Paradigmatic Analysis , 1997, ICMC.

[56]  Marc Leman,et al.  Transputer Implementation of the Kohonen Feature Map for a Music Recognition Task , 1989 .

[57]  B. Feiten,et al.  Automatic indexing of a sound database using self-organizing neural nets , 1994 .

[58]  Meter as Mechanism : A Neural Network that Learns , 1996 .

[59]  Marc Leman,et al.  The theory of tone semantics: Concept, foundation, and application , 1992, Minds and Machines.

[60]  Khalid A. Al-Mashouq,et al.  A Three-Level Speech, Music, and Mixture Classifier , 2004 .

[61]  Gaël Richard,et al.  Musical instrument recognition based on class pairwise feature selection , 2004, ISMIR.

[62]  Petri Toiviainen,et al.  Musical timbre: Similarity ratings correlate with computational feature space distances* , 1995 .

[63]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[64]  B. Keith Jenkins,et al.  A Neural Network Model for Pitch Perception , 1989 .

[65]  Peter Kabal,et al.  Frame level noise classification in mobile environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[66]  Michael P. Toglia,et al.  New Directions in Cognitive Science , 1985 .

[67]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[68]  Dominik Hörnel,et al.  Learning Musical Structure and Style by Recognition, Prediction and Evolution , 1996, ICMC.

[69]  Xavier Rodet,et al.  Features extraction and temporal segmentation of acoustic signals , 1998, ICMC.

[70]  Stephen Travis Pope,et al.  Feature Extraction and Database Design for Music Software , 2004, ICMC.

[71]  Xavier Serra,et al.  Towards Instrument Segmentation for Music Content Description: a Critical Review of Instrument Classification Techniques , 2000, ISMIR.

[72]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[73]  J. Simon Spoken Language Generation and Understanding , 1980 .

[74]  Guy J. Brown,et al.  A comparison of auditory and blind separation techniques for speech segregation , 2001, IEEE Trans. Speech Audio Process..

[75]  K. Izdebski The Physics of Speech , 1980 .

[76]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[77]  Te-Won Lee,et al.  Blind source separation of nonlinear mixing models , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[78]  Stéphane H. Maes,et al.  A hierarchical approach to large-scale speaker recognition , 1999, EUROSPEECH.

[79]  Jonathan Berger,et al.  A Neural Network Model of Metric Perception and Cognition in the Audition of Functional Tonal Music , 1997, ICMC.

[80]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[81]  Douglas H. Keefe,et al.  The Representation of Pitch in a Neural Net Model of Chord Classification , 1989 .

[82]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[83]  Christopher Raphael,et al.  Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[84]  Ian J. Taylor,et al.  Neural Network Pitch Tracking Over the Pitch Continuum , 1995, ICMC.

[85]  Stephen Cox,et al.  Features and classifiers for the automatic classification of musical audio signals , 2004, ISMIR.

[86]  Ronald A. Cole,et al.  Pitch detection with a neural-net classifier , 1991, IEEE Trans. Signal Process..

[87]  Christoph Lischka,et al.  Understanding music cognition: a connectionist view , 1991 .

[88]  Daniel Lehmann,et al.  Harmonizing Melodies in Real-Time: the Connectionist Approach , 1997, ICMC.

[89]  Piero Cosi,et al.  Timbre Characterization with Mel-Cepstrum and Neural Nets , 1994, ICMC.

[90]  Jonathan Berger,et al.  Modeling the Degree of Realized Expectation in Functional Tonal Music: A Study of Perceptual and Cognitive Modeling Using Neural Networks , 1996, ICMC.

[91]  Matthew I. Bellgard,et al.  Harmonising music using a network of Boltzmann machines , 1992 .

[92]  Matthew I. Bellgard,et al.  On the Use of an Effective Boltzmann Machine for Musical Style Recognition and Harmonisation , 1996, ICMC.

[93]  J. Bharucha Tonality and expectation. , 1994 .

[94]  Bernice Laden A parallel learning model of musical pitch perception , 1994 .

[95]  Tamas Ungvary,et al.  Organization of Sounds with Neural Nets , 1991, ICMC.

[96]  Francesco Palmieri,et al.  Learning binaural sound localization through a neural network , 1991, Proceedings of the 1991 IEEE Seventeenth Annual Northeast Bioengineering Conference.

[97]  B. P. Bogert,et al.  The quefrency analysis of time series for echoes : cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking , 1963 .

[98]  Peter Ladefoged,et al.  Elements of Acoustic Phonetics , 1962 .

[99]  Ichiro Fujinaga,et al.  Automatic Genre Classification Using Large High-Level Musical Feature Sets , 2004, ISMIR.

[100]  Marc Leman,et al.  Symbolic and subsymbolic description of music , 1993 .

[101]  Jacqueline A. Jones,et al.  Connectionist Models for Tonal Analysis , 1989 .

[102]  William A. Ainsworth,et al.  Speech Recognition by Machine , 1988 .

[103]  Johannes Feulner,et al.  Neural Networks that Learn and Reproduce Various Styles of Harmonization , 1993, ICMC.

[104]  Marc Leman Artificial Neural Networks in Music Research , 1992 .

[105]  Axel Robel Neural Network Modeling of Speech and Music Signals , 1996 .

[106]  Eric Moulines,et al.  A blind source separation technique using second-order statistics , 1997, IEEE Trans. Signal Process..

[107]  Alan Marsden,et al.  Computer representations and models in music , 1992 .

[108]  Peter Fedor Principles of the Design of D-Neuronal Networks I: Net Representation for Computer Simulation of a Melody Compositional Process , 1992, Int. J. Neural Syst..

[109]  S Grossberg,et al.  A spectral network model of pitch perception. , 1995, The Journal of the Acoustical Society of America.

[110]  Peter M. Todd,et al.  Using Connectionist Models to Explore Complex Musical Patterns , 2003 .

[111]  Stephen Cox,et al.  Finding An Optimal Segmentation for Audio Genre Classification , 2005, ISMIR.

[112]  W. Einar Mencl,et al.  Effects of Tuning Sharpness on Tone Categorization by Self-Organizing Neural Networks , 1998 .

[113]  R. Jackendoff,et al.  A Generative Theory of Tonal Music , 1985 .

[114]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[115]  Reinhold Orglmeister,et al.  Blind source separation of real world signals , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[116]  J. Bharucha,et al.  Tonal cognition, artificial intelligence and neural nets , 1989 .

[117]  Axel Röbel Neural Networks for Modeling Time Series of Musical Instruments , 1995, ICMC.