A Large Margin Algorithm for Speech-to-Phoneme and Music-to-Score Alignment

We describe and analyze a discriminative algorithm for learning to align an audio signal with a given sequence of events that tag the signal. We demonstrate the applicability of our method for the tasks of speech-to-phoneme alignment (ldquoforced alignmentrdquo) and music-to-score alignment. In the first alignment task, the events that tag the speech signal are phonemes while in the music alignment task, the events are musical notes. Our goal is to learn an alignment function whose input is an audio signal along with its accompanying event sequence and its output is a timing sequence representing the actual start time of each event in the audio signal. Generalizing the notion of separation with a margin used in support vector machines for binary classification, we cast the learning task as the problem of finding a vector in an abstract inner-product space. To do so, we devise a mapping of the input signal and the event sequence along with any possible timing sequence into an abstract vector space. Each possible timing sequence therefore corresponds to an instance vector and the predicted timing sequence is the one whose projection onto the learned prediction vector is maximal. We set the prediction vector to be the solution of a minimization problem with a large set of constraints. Each constraint enforces a gap between the projection of the correct target timing sequence and the projection of an alternative, incorrect, timing sequence onto the vector. Though the number of constraints is very large, we describe a simple iterative algorithm for efficiently learning the vector and analyze the formal properties of the resulting learning algorithm. We report experimental results comparing the proposed algorithm to previous studies on speech-to-phoneme and music-to-score alignment, which use hidden Markov models. The results obtained in our experiments using the discriminative alignment algorithm are comparable to results of state-of-the-art systems.

[1]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[2]  Yoram Singer,et al.  Learning to Align Polyphonic Music , 2004, ISMIR.

[3]  Simon King,et al.  Framewise phone classification using support vector machines , 2002, INTERSPEECH.

[4]  Ben-Zion Bobrovsky,et al.  Plosive spotting with margin classifiers , 2001, INTERSPEECH.

[5]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[6]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[7]  Yoram Singer,et al.  An Online Algorithm for Hierarchical Phoneme Classification , 2004, MLMI.

[8]  Christopher Raphael,et al.  Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Dan Chazan,et al.  Classification of transition sounds with application to automatic speech recognition , 2001, INTERSPEECH.

[10]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[11]  Arshia Cont Realtime Audio to Score Alignment for Polyphonic Music Instruments, using Sparse Non-Negative Constraints and Hierarchical HMMS , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[13]  Xavier Rodet,et al.  Improving polyphonic and poly-instrumental music to score alignment , 2003, ISMIR.

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[16]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[17]  Shlomo Dubnov,et al.  Robust temporal and spectral modeling for query By melody , 2002, SIGIR '02.

[18]  Yoram Singer,et al.  Online Learning Meets Optimization in the Dual , 2006, COLT.

[19]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[20]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[21]  Christopher Raphael,et al.  A Hybrid Graphical Model for Aligning Polyphonic Audio with Musical Scores , 2004, ISMIR.

[22]  John-Paul Hosom Automatic phoneme alignment based on acoustic-phonetic modeling , 2002, INTERSPEECH.

[23]  Luis A. Hernández Gómez,et al.  Automatic phonetic segmentation , 2003, IEEE Trans. Speech Audio Process..

[24]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.