Probabilistic nod generation model based on speech and estimated utterance categories

ABSTRACT We proposed and evaluated a probabilistic model that generates nod motions based on utterance categories estimated from the speech input. The model comprises two main blocks. In the first block, dialog act-related categories are estimated from the input speech. Considering the correlations between dialog acts and head motions, the utterances are classified into three categories having distinct nod distributions. Linguistic information extracted from the input speech is fed to a cluster of classifiers which are combined to estimate the utterance categories. In the second block, nod motion parameters are generated based on the categories estimated by the classifiers. The nod motion parameters are represented as probability distribution functions (PDFs) inferred from human motion data. By using speech energy features, the parameters are sampled from the PDFs belonging to the estimated categories. The effectiveness of the proposed model was evaluated using an android robot, through subjective experiments. Experiment results indicated that the motions generated by our proposed approach are considered more natural than those of a previous model using fixed nod shapes and hand-labeled utterance categories. GRAPHICAL ABSTRACT

[1]  Hiroshi Ishiguro,et al.  Speech-driven lip motion generation for tele-operated humanoid robots , 2011, AVSP.

[2]  Jui-Feng Yeh,et al.  Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation , 2016, Neurocomputing.

[3]  Takaaki Kuratate,et al.  Linking facial animation, head motion and speech acoustics , 2002, J. Phonetics.

[4]  Masashi Okubo,et al.  InterActor: Speech-driven embodied interactive actor , 2002, Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication.

[5]  Liangyu Chen,et al.  An Unsupervised Framework of Exploring Events on Twitter: Filtering, Extraction and Categorization , 2015, AAAI.

[6]  Alexander Y. Liu The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets , 2004 .

[7]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Volker Strom,et al.  Visual prosody: facial movements accompanying speech , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[9]  Hiroshi Ishiguro,et al.  Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[11]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[12]  Hiroshi Ishiguro,et al.  Probabilistic nod generation model based on estimated utterance categories , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Tomio Watanabe,et al.  InterRobot: speech-driven embodied interaction robot , 2001, Adv. Robotics.

[14]  Hiroshi Ishiguro,et al.  Analysis of relationship between head motion events and speech in dialogue conversations , 2014, Speech Communication.

[15]  Jeffery A. Jones,et al.  Visual Prosody and Speech Intelligibility , 2004, Psychological science.

[16]  Björn Granström,et al.  Visual correlates to prominence in several expressive modes , 2006, INTERSPEECH.

[17]  Tomio Watanabe,et al.  InterActor: Speech-Driven Embodied Interactive Actor , 2004, Int. J. Hum. Comput. Interact..

[18]  Hiroshi Ishiguro,et al.  Head motions during dialogue speech and nod timing control in humanoid robots , 2010, HRI 2010.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  A. Murat Tekalp,et al.  Combined Gesture-Speech Analysis and Speech Driven Gesture Synthesis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[23]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.