Semantic data mining of short utterances

This paper introduces a methodology for speech data mining along with the tools that the methodology requires. We show how they increase the productivity of the analyst who seeks relationships among the contents of multiple utterances and ultimately must link some newly discovered context into testable hypotheses about new information. While, in its simplest form, one can extend text data mining to speech data mining by using text tools on the output of a speech recognizer, we have found that it is not optimal. We show how data mining techniques that are typically applied to text should be modified to enable an analyst to do effective semantic data mining on a large collection of short speech utterances. For the purposes of this paper, we examine semantic data mining in the context of semantic parsing and analysis in a specific situation involving the solution of a business problem that is known to the analyst. We are not attempting a generic semantic analysis of a set of speech. Our tools and methods allow the analyst to mine the speech data to discover the semantics that best cover the desired solution. The coverage, in this case, yields a set of Natural Language Understanding (NLU) classifiers that serve as testable hypotheses.

[1]  David C. Gibbon,et al.  Relevance Feedback using Support Vector Machines , 2001, ICML.

[2]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[3]  Allen L. Gorin,et al.  Construct Algebra: Analytical Dialog Management , 1999, ACL.

[4]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[5]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  Yoichi Yamashita,et al.  Improvement of Speech Summarization Using Prosodic Information , 2004 .

[8]  Joachim M. Buhmann,et al.  A maximum entropy approach to pairwise data clustering , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[9]  Martin Jansche,et al.  Information Extraction from Voicemail Transcripts , 2002, EMNLP.

[10]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[11]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[12]  Xiaowei Xu,et al.  A Hybrid Relevance-Feedback Approach to Text Retrieval , 2003, ECIR.

[13]  Srinivas Bangalore,et al.  Combining prior knowledge and boosting for call classification in spoken language dialogue , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[15]  Bob Carpenter,et al.  Dialogue Management in Vector-Based Call Routing , 1998, ACL.

[16]  David C. Gibbon,et al.  Support vector machines: relevance feedback and information retrieval , 2002, Inf. Process. Manag..

[17]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[18]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[24]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.