Classification of heterogeneous text data for robust domain-specific language modeling

The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.

[1]  D. Hladek,et al.  Dagger: The Slovak morphological classifier , 2012, Proceedings ELMAR-2012.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[3]  Marián Trnka,et al.  Advances in the Slovak Judicial Domain Dictation System , 2013, LTC.

[4]  Paul L. Rosin Edges: saliency measures and automatic thresholding , 1997, Machine Vision and Applications.

[5]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[6]  Charles L. A. Clarke,et al.  Improving document clustering using Okapi BM25 feature weighting , 2011, Information Retrieval.

[7]  Xijin Tang,et al.  Text Classification Using Semi-supervised Clustering , 2009, 2009 International Conference on Business Intelligence and Financial Engineering.

[8]  Milos Cernak,et al.  Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.

[9]  Amit Singhal AT&T at TREC-6 , 1997, TREC.

[10]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[11]  Walid Magdy,et al.  Multi-reference WER for evaluating ASR for languages with no orthographic rules , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Jozef Juhar,et al.  Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition , 2012 .

[15]  James C. Wetherbe,et al.  An Empirical Comparison of Four Text Mining Methods , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[16]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[17]  Constantin Volosencu,et al.  New Technologies - Trends, Innovations and Research , 2012 .

[18]  Tao Wang,et al.  Topic detection based on keyword , 2011, 2011 International Conference on Mechatronic Science, Electric Engineering and Computer (MEC).

[19]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[20]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[23]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[24]  Vida Melninkaite,et al.  Text Categorization Using Neural Networks Initialized with Decision Trees , 2004, Informatica.