Estimating document frequencies in a speech corpus

Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df (w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.

[1]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[2]  Brian Roark,et al.  A General Weighted Grammar Library , 2004, CIAA.

[3]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[4]  Hal Daumé Notes on CG and LM-BFGS Optimization of Logistic Regression , 2008 .

[5]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[6]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  Douglas W. Oard,et al.  Combining evidence from unconstrained spoken term frequency estimation for improved speech retrieval , 2008 .

[9]  Kenneth Ward Church,et al.  Empirical Term Weighting and Expansion Frequency , 2000, EMNLP.

[10]  Brian Roark,et al.  Unsupervised language model adaptation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[12]  Timothy J. Hazen,et al.  Topic identification from audio recordings using word and phone recognition lattices , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[13]  Jonathan Foote,et al.  An overview of audio information retrieval , 1999, Multimedia Systems.