A LDA-based Topic Classification Approach from highly Imperfect Automatic Transcriptions

Although the current transcription systems could achieve high recognition performance, they still have a lot of difficulties to transcribe speech in very noisy environments. The transcription quality has a direct impact on classification tasks using text features. In this paper, we propose to identify themes of telephone conversation services with the classical Term Frequency-Inverse Document Frequency using Gini purity criteria (TF-IDF-Gini) method and with a Latent Dirichlet Allocation (LDA) approach. These approaches are coupled with a Support Vector Machine (SVM) classification to resolve theme identification problem. Results show the effectiveness of the proposed LDA-based method compared to the classical TF-IDF-Gini approach in the context of highly imperfect automatic transcriptions. Finally , we discuss the impact of discriminative and non-discriminative words extracted by both methods in terms of transcription accuracy.

[1]  Timothy J. Hazen Topic Identification , 2014, Encyclopedia of Social Network Analysis and Mining.

[2]  Georges Linarès,et al.  The LIA Speech Recognition System: From 10xRT to 1xRT , 2007, TSD.

[3]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[4]  Tao Dong,et al.  An Improved Algorithm of Bayesian Text Categorization , 2011, J. Softw..

[5]  Alexander Clark,et al.  Word Distributions for Thematic Segmentation in a Support Vector Machine Approach , 2006, CoNLL.

[6]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[7]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[8]  Irene Koshik,et al.  Journal of the american society for information science and technology-2012 , 2012 .

[9]  Mohamed Morchid,et al.  Theme identification in telephone service conversations using quaternions of speech features , 2013, INTERSPEECH.

[10]  Mohamed Morchid,et al.  Event detection from image hosting services by slightly-supervised multi-span context models , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[11]  Shrikanth S. Narayanan,et al.  Acoustic topic model for audio information retrieval , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  M. N. Murty,et al.  Stopwords and Stylometry : A Latent Dirichlet Allocation Approach , 2009 .

[13]  Gunnar Rätsch,et al.  Predicting Time Series with Support Vector Machines , 1997, ICANN.

[14]  Thomas L. Griffiths,et al.  A probabilistic approach to semantic representation , 2019, Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society.

[15]  K. Vivekanandan,et al.  Improved Keyword and Keyphrase Extraction from Meeting Transcripts , 2012 .

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Jian-hua Yeh,et al.  Protein remote homology detection based on latent topic vector model , 2010, 2010 International Conference on Networking and Information Technology.

[18]  Jerome R. Bellegarda,et al.  A latent semantic analysis framework for large-Span language modeling , 1997, EUROSPEECH.

[19]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[20]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[21]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[22]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[23]  Mounir Zrigui,et al.  Arabic Text Classification Framework Based on Latent Dirichlet Allocation , 2012, J. Comput. Inf. Technol..

[24]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[25]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[26]  Sheng Tang,et al.  Pornprobe: an LDA-SVM based pornography detection system , 2009, ACM Multimedia.

[27]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[28]  Chew Lim Tan,et al.  A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Frédéric Béchet,et al.  DECODA: a call-centre human-human spoken conversation corpus , 2012, LREC.