Deformable Spectrograms

Speech and other natural sounds show high temporal correlation and smooth spectral evolution punctuated by a few, irregular and abrupt changes. In a conventional Hidden Markov Model (HMM), such structure is represented weakly and indirectly through transitions between explicit states representing ‘steps’ along such smooth changes. It would be more efficient and informative to model successive spectra astransformationsof their immediate predecessors, and we present a model which focuses on local deformations of adjacent bins in a timefrequency surface to explain an observed sound, using explicit representation only for those bins that cannot be predicted from their context. We further decompose the log-spectrum into two additive layers, which are able to separately explain and model the evolution of the harmonic excitation, and formant filtering of speech and similar sounds. Smooth deformations are modeled with hidden transformation variables in both layers, using Markov Random fields (MRFs) with overlapping subwindows as observations; inference is efficiently performed via loopy belief propagation. The model can fill-in deleted timefrequency cells without any signal model, and an entire signal can be compactly represented with a few specific states along with the deformation maps for both layers. We discuss several possible applications for this new model, including source separation.

[1]  Assaf Zomet,et al.  Learning to Perceive Transparency from the Statistics of Natural Scenes , 2002, NIPS.

[2]  Brendan J. Frey,et al.  Epitomic analysis of appearance and shape , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Daniel P. W. Ellis,et al.  Towards single-channel unsupervised source separation of speech mixtures: the layered harmonics/formants separation-tracking model , 2004, SAPA@INTERSPEECH.

[4]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[5]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[6]  William T. Freeman,et al.  Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology , 1999, Neural Computation.

[7]  Jeff A. Bilmes,et al.  Data-driven extensions to HMM statistical dependencies , 1998, ICSLP.

[8]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .