论文信息 - Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively.

[1] Sung-Hyuk Cha. Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[2] Martin Lojka,et al. Slovak Automatic Transcription and Dictation System for the Judicial Domain , 2011 .

[3] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4] Steve Young,et al. The HTK book version 3.4 , 2006 .

[5] D. Hladek,et al. Dagger: The Slovak morphological classifier , 2012, Proceedings ELMAR-2012.

[6] Tao Wang,et al. Topic detection based on keyword , 2011, 2011 International Conference on Mechatronic Science, Electric Engineering and Computer (MEC).

[7] D. Hládek,et al. 0 Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition , 2012 .

[8] Anna-Lan Huang,et al. Similarity Measures for Text Document Clustering , 2008 .

[9] Ali R. Hurson,et al. TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[10] Milos Cernak,et al. Effective Triphone Mapping for Acoustic Modeling in Speech Recognition , 2011, INTERSPEECH.