Towards link characterization from content

In processing large volumes of speech and language data, we are often interested in the distribution of languages, speakers, topics, etc. For large data sets, these distributions are typically estimated at a given point in time using pattern classification technology. Such estimates can be highly biased, especially for rare classes. While these biases have been addressed in some applications, they have thus far been ignored in the speech and language literature. This neglect causes significant error for low-frequency classes. Correcting this biased distribution involves exploiting uncertain knowledge of the classifier error patterns. The Metropolis-Hastings algorithm allows us to construct a Bayes estimator for the true class proportions. We experimentally evaluate this algorithm for a speaker recognition task. In this experiment, the Bayes estimator reduces maximum RMSE by a factor of five. Performance is furthermore more consistent, with range of RMSE reduced by a factor of 4.

[1]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[2]  N. Wald,et al.  Does breathing other people's tobacco smoke cause lung cancer? , 1986, British medical journal.

[3]  John Grothendieck,et al.  Tracking changes in language , 2005, IEEE Transactions on Speech and Audio Processing.

[4]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[5]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[6]  S D Walter,et al.  Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. , 1988, Journal of clinical epidemiology.

[7]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[8]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[9]  S.G. Eick,et al.  Hardware accelerated algorithms for semantic processing of document streams , 2006, 2006 IEEE Aerospace Conference.

[10]  L. Joseph,et al.  Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. , 1995, American journal of epidemiology.

[11]  Daniel J. Goebel,et al.  Coping with Information Overload in a Sales Environment , 2008 .

[12]  Heussner Rc,et al.  Coping with information overload. , 1994 .

[13]  H. Haario,et al.  An adaptive Metropolis algorithm , 2001 .

[14]  George Varghese,et al.  Automated Worm Fingerprinting , 2004, OSDI.