A Multimodal Approach to Improve Performance Evaluation of Call Center Agent

The paper proposes three modeling techniques to improve the performance evaluation of the call center agent. The first technique is speech processing supported by an attention layer for the agent’s recorded calls. The speech comprises 65 features for the ultimate determination of the context of the call using the Open-Smile toolkit. The second technique uses the Max Weights Similarity (MWS) approach instead of the Softmax function in the attention layer to improve the classification accuracy. MWS function replaces the Softmax function for fine-tuning the output of the attention layer for processing text. It is formed by determining the similarity in the distance of input weights of the attention layer to the weights of the max vectors. The third technique combines the agent’s recorded call speech with the corresponding transcribed text for binary classification. The speech modeling and text modeling are based on combinations of the Convolutional Neural Networks (CNNs) and Bi-directional Long-Short Term Memory (BiLSTMs). In this paper, the classification results for each model (text versus speech) are proposed and compared with the multimodal approach’s results. The multimodal classification provided an improvement of (0.22%) compared with acoustic model and (1.7%) compared with text model.

[1]  Betul Karakus,et al.  Call center performance evaluation using big data analytics , 2016, 2016 International Symposium on Networks, Computers and Communications (ISNCC).

[2]  Najim Dehak,et al.  Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts , 2018, INTERSPEECH.

[3]  Hui Yu,et al.  Character-level neural network model based on Nadam optimization and its application in clinical concept extraction , 2020, Neurocomputing.

[4]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[5]  Óscar González-Benito,et al.  Cultural vs. operational market orientation and objective vs. subjective performance: Perspective of production and operations , 2005 .

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[8]  Ngoc Thang Vu,et al.  Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech , 2017, INTERSPEECH.

[9]  N. Dhanpat,et al.  Exploring employee retention and intention to leave within a call centre , 2018 .

[10]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[12]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[13]  Khaled Shaalan,et al.  End-to-End Lexicon Free Arabic Speech Recognition Using Recurrent Neural Networks , 2018, Computational Linguistics, Speech and Image Processing for Arabic Language.

[14]  Dimitri Palaz,et al.  Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[15]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[16]  Yang Yang,et al.  Recurrent attention network using spatial-temporal relations for action recognition , 2018, Signal Process..

[17]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[18]  Said Echchakoui,et al.  Emotional Exhaustion in Offshore Call Centers: A Comparative Study , 2019 .

[19]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[20]  Khaled Shaalan,et al.  A Call Center Agent Productivity Modeling Using Discriminative Approaches , 2018 .

[21]  Khaled Shaalan,et al.  Agent Productivity Modeling in a Call Center Domain Using Attentive Convolutional Neural Networks , 2020, Sensors.

[22]  Sudarsan Vs,et al.  Voice call analytics using natural language processing , 2019 .

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Lochandaka Ranathunga,et al.  Automatic Evaluation Software for Contact Centre Agents' voice Handling Performance , 2019 .

[25]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[26]  Sylvain Meignier,et al.  S4D: Speaker Diarization Toolkit in Python , 2018, INTERSPEECH.

[27]  Khaled Shaalan,et al.  Agent Productivity Measurement in Call Center Using Machine Learning , 2016, AISI.

[28]  Anders Frederiksen,et al.  Subjective Performance Evaluations and Employee Careers , 2011 .

[29]  Namrata Dave,et al.  Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition , 2013 .

[30]  Dirk Sliwka,et al.  Social ties and subjective performance evaluations: an empirical investigation , 2013, SSRN Electronic Journal.

[31]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[33]  Daniel Willett,et al.  Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yasser Hifny,et al.  Efficient Arabic Emotion Recognition Using Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Sung Ho Ha,et al.  A web-based system for analyzing the voices of call center customers in the service industry , 2005, Expert Syst. Appl..

[36]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[37]  Shouyi Yin,et al.  PAGAN: A Phase-Adapted Generative Adversarial Networks for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Khaled Shaalan,et al.  Lexicon Free Arabic Speech Recognition Recipe , 2016, AISI.