Latent Mixture of Discriminative Experts

In this paper, we introduce a new model called Latent Mixture of Discriminative Experts which can automatically learn the temporal relationship between different modalities. Since, we train separate experts for each modality, LMDE is capable of improving the prediction performance even with limited amount of data. For model interpretation, we present a sparse feature ranking algorithm that exploits L1 regularization. An empirical evaluation is provided on the task of listener backchannel prediction (i.e., head nod). We introduce a new error evaluation metric called User-adaptive Prediction Accuracy that takes into account the difference in people's backchannel responses. Our results confirm the importance of combining five types of multimodal features: lexical, syntactic structure, part-of-speech, visual and prosody. Latent Mixture of Discriminative Experts model outperforms previous approaches.

[1]  Seiichi Nakagawa,et al.  A Spoken Dialog System for Chat-Like Conversations Considering Response Timing , 2007, TSD.

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Stacy Marsella,et al.  Natural Behavior of a Listening Agent , 2005, IVA.

[4]  Christopher M. Bishop,et al.  Bayesian Hierarchical Mixtures of Experts , 2002, UAI.

[5]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[6]  Liang Dong,et al.  Recognition of visual speech elements using adaptively boosted hidden Markov models , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  D. Fuchs Examiner Familiarity Effects on Test Performance , 1987 .

[8]  T. Kobayashi,et al.  A conversation robot using head gesture recognition as para-linguistic information , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[9]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[10]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[11]  Manuela M. Veloso,et al.  Feature selection in conditional random fields for activity recognition , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[13]  Vladimir Pavlovic,et al.  Boosted learning in dynamic Bayesian networks for multimodal speaker detection , 2003, Proc. IEEE.

[14]  Stefan Riezler,et al.  Incremental Feature Selection and l1 Regularization for Relaxed Maximum-Entropy Modeling , 2004, EMNLP.

[15]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Francis K. H. Quek The catchment feature model for multimodal language analysis , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[18]  Yukiko I. Nakano,et al.  Towards a Model of Face-to-Face Grounding , 2003, ACL.

[19]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Trevor Cohn,et al.  Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[22]  J. Burgoon,et al.  Interpersonal Adaptation: Dyadic Interaction Patterns , 1995 .

[23]  Jean Carletta,et al.  A shallow model of backchannel continuers in spoken dialogue , 2003 .

[24]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[25]  Louis-Philippe Morency,et al.  Predicting Listener Backchannels: A Probabilistic Multimodal Approach , 2008, IVA.

[26]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  Jian Peng,et al.  Conditional Neural Fields , 2009, NIPS.

[29]  Aggelos K. Katsaggelos,et al.  Feature space video stream consistency estimation for dynamic stream weighting in audio-visual speech recognition , 2008, 2008 15th IEEE International Conference on Image Processing.

[30]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[31]  Ning Wang,et al.  Creating Rapport with Virtual Agents , 2007, IVA.

[32]  J. Bavelas,et al.  Listeners as co-narrators. , 2000, Journal of personality and social psychology.

[33]  Matthew Stone,et al.  Living Hand to Mouth: Psychological Theories about Speech and Gesture in Interactive Dialogue Systems , 1999 .

[34]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[35]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[36]  Cristian Sminchisescu,et al.  Discriminative density propagation for 3D human motion estimation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[37]  Andrew Smith,et al.  Regularisation Techniques for Conditional Random Fields: Parameterised Versus Parameter-Free , 2005, IJCNLP.

[38]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[39]  Richard B. Reilly,et al.  Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features , 2003, AVBPA.

[40]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[41]  Vladimir Pavlovic Multimodal tracking and classification of audio-visual features , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[42]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[43]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[44]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[45]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[46]  Jean-Marc Odobez,et al.  Sports Event Recognition Using Layered HMMS , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[47]  S. Goldberg The Secrets of Successful Mediators , 2005 .

[48]  Ning Wang,et al.  Does the contingency of agents' nonverbal feedback affect users' social anxiety? , 2008, AAMAS.

[49]  Yoshiko Arimoto,et al.  Predicting Evidence of Understanding by Monitoring User's Task Manipulation in Multimodal Conversations , 2007, ACL.