论文信息 - Automatic Speaker Recognition and Characterization by means of Robust Vocal Source Features

Automatic Speaker Recognition and Characterization by means of Robust Vocal Source Features

Automatic Speaker Recognition is a wide research field, which encompasses many topics: signal processing, human vocal and auditory physiology, statistical modelling, cognitive sciences, and so on. The study of these techniques started about thirty years ago and, since then, the improvement has been dramatic. Nonetheless the field still poses open issues and many active research centers around the world are working towards more reliable and better performing systems. This thesis documents a Philosophiae Doctor project funded by the private held company RT - Radio Trevisan Elettronica Industriale S.p.A. The title of the fellowship is "Automatic speaker recognition with applications to security and intelligence". Part of the work was carried out during a six-month visit in the Speech, Music and Hearing Department of the KTH Royal Institute of Technology, Stockholm. Speaker Recognition research develops techniques to automatically associate a given human voice to a previously recorded version of it. Speaker Recognition is usually further defined as Speaker Identification or Speaker Verification; in the former the identity of a voice has to be found among a (possibly high) number of speaker voices, while in the latter the system is provided with both a voice and a claimed identity, and the association has to be verified as a true/false statement. The recognition systems also provides a confidence score about the found results. The first Part of the thesis reviews the state of the art of Speaker Recognition research. The main components of a recognition system are described: audio features extraction, statistical modelling, and performance assessment. During the years the research community developed a number of Audio Features, use to describe the information carried by the vocal signal in a compact and deterministic way. In every automatic recognition application, even speech or language, the feature extraction process is the first step, in charge of compressing substantially the size of the input data without loosing any important information. The choice of the best fitted features for a specific application, and their tuning, are crucial to obtain satisfactory recognition results; moreover the definition of innovative features is a lively research direction because it is generally recognized that existing features are still far from the exploitation of the whole information load carried by the vocal signal. There are audio features which during the years have proved to perform better than other; some of them are described in Part I: Mel-Frequency Cepstral Coefficients and Linear Prediction Coefficients. More refined and experimental features are also introduced, and will be explained in Part III. Statistical modelling is introduced, particularly by discussing the Gaussian Mixture Models structure and their training through the EM algorithm; specific modelling techniques for recognition, such as Universal Background Model, are described. Scoring is the last phase of a Speaker Recognition process and involves a number of normalizations; it compensates for different recording conditions or model issues. Part I continues presenting a number of audio databases that are commonly used in the literature as benchmark databases to compare results or recognition systems, in particular TIMIT and NIST Speaker Recognition Evaluation - SRE 2004. A recognition prototype system has been built during the PhD project, and it is detailed in Part II. The first Chapter describes the proposed application, referring to intelligence and security. The application fulfils specific requirements of the Authorities when investigations involve phone wiretapping or environmental interceptions. In these cases Authorities have to listen to a large amount of recordings, most of which are not related to the investigations. The application idea is to automatically detect and label speakers, giving the possibility to search for a specific speaker through the recording collection. This can avoid time wasting, resulting in an economical advantage. Many difficulties arises from the phone lines, which are known to degrade the speech signal and cause a reduction of the recognition performances; main issues are the narrow audio bandwidth, the additive noises and the convolution noise, the last resulting in phase distortion. The second Chapter in Part II describes in detail the developed Speaker Recognition system; a number of design choices are discussed. During the development the research scope of the system has been crucial: a lot of effort has been put to obtain a system with good performances and still easily and deeply modifiable. The assessment of results on different databases posed further challenges, which has been solved with a unified interface to the databases. The fundamental components of a speaker recognition system have been developed, with also some speed-up improvements. Lastly, the whole software can run on a cluster computer without any reconfiguration, a crucial characteristic in order to assess performance on big database in reasonable times. During the three-years project some works have been developed which are related to the Speaker Recognition, although not directly involved with it. These developments are described in Part II as extensions of the prototype. First a Voice Activity Detector suitable for noisy recordings is explained. The first step of feature extraction is to find and select, from a given record, only the segments containing voice; this is not a trivial task when the record is noisy and a simple "energy threshold" approach fails. The developed VAD is based on advanced features, computed from Wavelet Transforms, which are further processed using an adaptive threshold. One second developed application is Speaker Diarization: it permits to automatically segment an audio recording when it contains different speakers. The outputs of the diarization are a segmentation and a speaker label for each segment, resulting in a "who speaks when" answer. The third and last collateral work is a Noise Reduction system for voice applications, developed on a hardware DSP. The noise reduction algorithm adaptively detects the noise and reduces it, keeping only the voice; it works in real time using only a slight portion of the DSP computing power. Lastly, Part III discusses innovative audio features, which are the main novel contribution of this thesis. The features are obtained from the glottal flow, therefore the first Chapter in this Part describes the anatomy of the vocal folds and of the vocal tract. The working principle of the phonation apparatus is described and the importance of the vocal folds physics is pointed out. The glottal flow is an input air flow for the vocal tract, which acts as a filter; an open-source toolkit for the inversion of the vocal tract filter is introduced: it permits to estimate the glottal flow from speech records. A description of some methods used to give a numerical characterization to the glottal flow is given. In the subsequent Chapter, a definition of the novel glottal features is presented. The glottal flow estimates are not always reliable, so a first step detects and deletes unlikely flows. A numerical procedure then groups and sorts the flow estimates, preparing them for a statistical modelling. Performance measures are then discussed, comparing the novel features against the standard ones, applied on the reference databases TIMIT and SRE 2004. A Chapter is dedicated to a different research work, related with glottal flow characterization. A physical model of the vocal folds is presented, with a number of control rules, able to describe the vocal folds dynamic. The rules permit to translate a specific pharyngeal muscular set-up in mechanical parameters of the model, which results in a specific glottal flow (obtained after a computer simulation of the model). The so-called Inverse Problem is defined in this way: given a glottal flow it has to be found the muscular set-up which, used to drive a model simulation, can obtain the same glottal flow as the given one. The inverse problem has a number of difficulties in it, such as the non-univocity of the inversion and the sensitivity to slight variations in the input flow. An optimization control technique has been developed and is explained. The final Chapter summarizes the achievements of the thesis. Along with this discussion, a roadmap for the future improvements to the features is sketched. In the end, a resume of the published and submitted articles for both conferences and journals is presented.

Enrico Marchetto

[1] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2] Manfred R. Schroeder,et al. Code-excited linear prediction(CELP): High-quality speech at very low bit rates , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3] W·M·贝尔特曼,et al. Speech audio process , 2011 .

[4] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..