Speech signals convey information from many sources, but not all information sources are relevant to describe speaker identity. In fact, speech is affected by spurious events, artifacts (mouth breath, lip clicks), and noise (channel and background). Such unwanted information sources are shared by speakers and do not contribute in distinguishing between them. Furthermore, in most cases, training data are collected from different environments and it is of great importance that such data convey relevant joint information. This paper discusses a method for removing unwanted information in order to build more robust models. Two criteria are used to extract relevant information from the speech signal: the first criterion, which we call self-information criterion, is used to extract relevant information from data collected from a given environment; the second is called joint information criterion, and it is used when collected data are from different environments. Both criteria originate from information theory. Simulations on telephone speech have revealed the high efficency of the method.
[1]
J. Schmee.
An Introduction to Multivariate Statistical Analysis
,
1986
.
[2]
Thomas M. Cover,et al.
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
,
2006
.
[3]
Hynek Hermansky,et al.
Towards increasing speech recognition error rates
,
1995,
Speech Commun..
[4]
Thomas M. Cover,et al.
Elements of Information Theory
,
2005
.
[5]
Sang Joon Kim,et al.
A Mathematical Theory of Communication
,
2006
.
[6]
Douglas D. O'Shaughnessy,et al.
The use of typical sequences for robust speaker identification
,
2004,
INTERSPEECH.