Modeling Timing Structure in Multimedia Signals

Modeling and describing temporal structure in multimedia signals, which are captured simultaneously by multiple sensors, is important for realizing human machine interaction and motion generation. This paper proposes a method for modeling temporal structure in multimedia signals based on temporal intervals of primitive signal patterns. Using temporal difference between beginning points and the difference between ending points of the intervals, we can explicitly express timing structure; that is, synchronization and mutual dependency among media signals. We applied the model to video signal generation from an audio signal to verify the effectiveness

[1]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[2]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[3]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[4]  Takashi Matsuyama,et al.  Multiphase Learning for an Interval-Based Hybrid Dynamical System , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[5]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Takashi Matsuyama,et al.  Facial Expression Representation Based on Timing Structures in Faces , 2005, AMFG.

[8]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[9]  Kevin P. Murphy Hidden semi-Markov models ( HSMMs ) , 2002 .

[10]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[12]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[13]  Harry Shum,et al.  Motion texture: a two-level statistical model for character motion synthesis , 2002, ACM Trans. Graph..