论文信息 - Automatic Topic Identification for Large Scale Language Modeling Data Filtering

Automatic Topic Identification for Large Scale Language Modeling Data Filtering

The paper presents a module for topic identification that is embedded into a complex system for acquisition and storing large volumes of text data from the Web. The module processes each of the acquired data items and assigns keywords to them from a defined topic hierarchy that was developed for this purposes and is also described in the paper. The quality of the topic identification is evaluated in two ways - using classic precision-recall measures and also indirectly, by measuring the ASR performance of the topic-specific language models that are built using the automatically filtered data.

Pavel Ircing | Ales Prazák | Lucie Skorkovská | Jan Lehecka

[1] Ludek Müller,et al. Robust Statistic Estimates for Adaptation in the Task of Speech Recognition , 2010, TSD.

[2] Jakub Kanis,et al. Comparison of Different Lemmatization Approaches through the Means of Information Retrieval Performance , 2010, TSD.

[3] Jan Vanek,et al. Gender-Dependent Acoustic Models Fusion Developed for Automatic Subtitling of Parliament Meetings Broadcasted by the Czech TV , 2010, TSD.

[4] Ludek Müller,et al. Four-phase re-speaker training system , 2011, Proceedings of the International Conference on Signal Processing and Multimedia Applications.

[5] William J. Byrne,et al. Large vocabulary ASR for spontaneous czech in the MALACH project , 2003, INTERSPEECH.

[6] Céline Rouveirol,et al. Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[7] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[8] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9] Daniel Soutner,et al. Web Text Data Mining for Building Large Scale Language Modelling Corpus , 2011, TSD.

[10] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.