Voice modeling methods for automatic speaker recognition

Building a voice model means to capture the characteristics of a speaker’s voice in a data structure. This data structure is then used by a computer for further processing, such as comparison with other voices. Voice modeling is a vital step in the process of automatic speaker recognition that itself is the foundation of several applied technologies: (a) biometric authentication, (b) speech recognition and (c) multimedia indexing. Several challenges arise in the context of automatic speaker recognition. First, there is the problem of data shortage, i.e., the unavailability of sufficiently long utterances for speaker recognition. It stems from the fact that the speech signal conveys different aspects of the sound in a single, one-dimensional time series: linguistic (what is said?), prosodic (how is it said?), individual (who said it?), locational (where is the speaker?) and emotional features of the speech sound itself (to name a few) are contained in the speech signal, as well as acoustic background information. To analyze a specific aspect of the sound regardless of the other aspects, analysis methods have to be applied to a specific time scale (length) of the signal in which this aspect stands out of the rest. For example, linguistic information (i.e., which phone or syllable has been uttered?) is found in very short time spans of only milliseconds of length. On the contrary, speakerspecific information emerges the better the longer the analyzed sound is. Long utterances, however, are not always available for analysis. Second, the speech signal is easily corrupted by background sound sources (noise, such as music or sound effects). Their characteristics tend to dominate a voice model, if present, such that model comparison might then be mainly due to background features instead of speaker characteristics. Current automatic speaker recognition works well under relatively constrained circumstances, such as studio recordings, or when prior knowledge on the number and identity of occurring speakers is available. Under more adverse conditions, such as in feature films or amateur material on the web, the achieved speaker recognition scores drop below a rate that is acceptable for an end user or for further processing. For example, the typical speaker turn duration of only one second and the sound effect background in cinematic movies render most current automatic analysis techniques useless. In this thesis, methods for voice modeling that are robust with respect to short utterances and background noise are presented. The aim is to facilitate movie

[1]  HongJiang Zhang Multimedia content analysis and search: new perspectives and approaches , 2009, ACM Multimedia.

[2]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[3]  D.P. Skinner,et al.  The cepstrum: A guide to processing , 1977, Proceedings of the IEEE.

[4]  Fernando Pereira,et al.  MPEG-7 the generic multimedia content description standard, part 1 - Multimedia, IEEE , 2001 .

[5]  Bernd Freisleben,et al.  The Web Service Browser: Automatic Client Generation and Efficient Data Transfer for Web Services , 2009, 2009 IEEE International Conference on Web Services.

[6]  C A Pickover,et al.  Examining Usability, Acceptability, and Adoption of a Self-Directed, Technology-Based Intervention for Upper Limb Rehabilitation After Stroke: Cohort Study , 1986, The Journal of the Acoustical Society of America.

[7]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[8]  Andreas Stolcke,et al.  THE SRI NIST 2008 speaker recognition evaluation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  David A. van Leeuwen,et al.  NIST and NFI-TNO evaluations of automatic speaker recognition , 2006, Comput. Speech Lang..

[11]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[12]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[13]  Aladdin M. Ariyaeeinia,et al.  Discrimination Effectiveness of Speech Cepstral Features , 2008, BIOID.

[14]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[15]  Hsin-Min Wang,et al.  On the extraction of vocal-related information to facilitate the management of popular music collections , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16]  M. Köppen,et al.  The Curse of Dimensionality , 2010 .

[17]  Chin-Hui Lee,et al.  Minimax classification with parametric neighborhoods for noisy speech recognition , 2001, INTERSPEECH.

[18]  Bernd Freisleben,et al.  Semantic video analysis for psychological research on violence in computer games , 2007, CIVR '07.

[19]  Sacha Krstulovic,et al.  Mptk: Matching Pursuit Made Tractable , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  D. Goldstein Second Edition, Revised and Expanded , 2003 .

[21]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[22]  Marijn Huijbregts,et al.  Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[23]  Charu C. Aggarwal A framework for classification and segmentation of massive audio data streams , 2007, KDD '07.

[24]  S. R. Mahadeva Prasanna,et al.  Multiple frame size and rate analysis for speaker recognition under limited data condition , 2009 .

[25]  Daben Liu,et al.  Online speaker clustering , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  M. Al-Akaidi Fractal Speech Processing , 2004 .

[27]  Clifford A. Pickover Computers, Pattern, Chaos, and Beauty: Graphics from an Unseen World , 2001 .

[28]  Hagai Aronowitz,et al.  A distance measure between GMMs based on the unscented transform and its application to speaker recognition , 2005, INTERSPEECH.

[29]  R. J. Niederjohn Understanding speech corrupted by noise , 1996, Proceedings of the IEEE International Conference on Industrial Technology (ICIT'96).

[30]  Younghun Kwon,et al.  Similar speaker recognition using nonlinear analysis , 2004 .

[31]  Manuel Duarte Ortigueira,et al.  On the HHT, its problems, and some solutions , 2008 .

[32]  Jean-François Bonastre,et al.  NON DIRECTLY ACOUSTIC PROCESS FOR COSTLESS SPEAKER RECOGNITION AND INDEXATION , 1999 .

[33]  D. O'Shaughnessy,et al.  Pre-emphasis and speech recognition , 1995, Proceedings 1995 Canadian Conference on Electrical and Computer Engineering.

[34]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2006: Shot Boundary Detection and Rushes Task Results , 2006, TRECVID.

[35]  Yoseph Bar-Cohen,et al.  Biomimetics : Biologically Inspired Technologies , 2011 .

[36]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[37]  E. Ambikairajah,et al.  Group delay features for speaker recognition , 2007, 2007 6th International Conference on Information, Communications & Signal Processing.

[38]  Andreas Stolcke,et al.  Modeling duration patterns for speaker recognition , 2003, INTERSPEECH.

[39]  Constantine Kotropoulos,et al.  Speaker segmentation and clustering , 2008, Signal Process..

[40]  Mitch Weintraub,et al.  Filterbank-energy estimation using mixture and Markov models for recognition of noisy speech , 1993, IEEE Trans. Speech Audio Process..

[41]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[42]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[43]  Peter Ladefoged,et al.  Vowels and Consonants , 2000, Manchu Grammar.

[44]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[45]  Hsin-Min Wang,et al.  Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Phil Rose,et al.  Technical forensic speaker recognition: Evaluation, types and testing of evidence , 2006, Comput. Speech Lang..

[47]  Sridha Sridharan,et al.  Making Confident Speaker Verification Decisions With Minimal Speech , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Constantine Kotropoulos,et al.  Systematic comparison of BIC-based speaker segmentation systems , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[49]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[50]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[51]  Ananth N. Iyer,et al.  Speaker distinguishing distances: a comparative study , 2007, Int. J. Speech Technol..

[52]  Bernd Freisleben,et al.  Videana: A Software Toolkit for Scientific Film Studies , 2009, Digital Tools in Media Studies.

[53]  Scott L. Bain Emergent Design: The Evolutionary Nature of Professional Software Development (paperback) , 2008 .

[54]  Xiaodong Wang,et al.  Monte Carlo methods for signal processing: a review in the statistical signal processing context , 2005, IEEE Signal Processing Magazine.

[55]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Toshinori Munakata,et al.  Fundamentals of the New Artificial Intelligence - Neural, Evolutionary, Fuzzy and More, Second Edition , 2007, Texts in Computer Science.

[57]  John R. Kender,et al.  Accommodating sample size effect on similarity measures in speaker clustering , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[58]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[59]  Clifford A. Pickover,et al.  Fractal characterization of speech waveform graphs , 1986, Comput. Graph..

[60]  F. Yates Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[61]  Yi Hu,et al.  Subjective Comparison of Speech Enhancement Algorithms , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[62]  Sridha Sridharan,et al.  Factor analysis modelling for speaker verification with short utterances , 2008, Odyssey.

[63]  Doh-Suk Kim On the perceptually irrelevant phase information in sinusoidal representation of speech , 2001, IEEE Trans. Speech Audio Process..

[64]  Mark J. F. Gales,et al.  An improved approach to the hidden Markov model decomposition of speech and noise , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[66]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[67]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[68]  Gabriel Rilling,et al.  On empirical mode decomposition and its algorithms , 2003 .

[69]  Bernd Freisleben,et al.  WebVoice: A Toolkit for Perceptual Insights into Speech Processing , 2009, 2009 2nd International Congress on Image and Signal Processing.

[70]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[71]  Jialong He,et al.  On the use of orthogonal GMM in speaker recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[72]  A. Hussain,et al.  Nonlinear speech processing: Overview and applications , 2002 .

[73]  T. Subba Rao,et al.  Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB , 2004 .

[74]  E. Candes,et al.  11-magic : Recovery of sparse signals via convex programming , 2005 .

[75]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[76]  Stéphane H. Maes,et al.  A distance measure between collections of distributions and its application to speaker recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[77]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[78]  Daben Liu,et al.  Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[79]  John Tooby,et al.  Are humans good intuitive statisticians after all , 1996 .

[80]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[81]  Thomas Friese,et al.  Grid Workflow Modelling Using Grid-Specific BPEL Extensions , 2007 .

[82]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[83]  Sadaoki Furui,et al.  40 Years of Progress in Automatic Speaker Recognition , 2009, ICB.

[84]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[85]  Paul Over,et al.  High-level feature detection from video in TRECVid: a 5-year retrospective of achievements , 2009 .

[86]  Satoshi Nakamura,et al.  Efficient representation of short-time phase based on group delay , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[87]  Christian A. Müller,et al.  Prosodic and other Long-Term Features for Speaker Diarization , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[88]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[89]  Stephan Baumann Artificial Listening Systems - Modellierung und approximation der individuellen Perzeption von Musikähnlichkeit , 2005 .

[90]  Bernd Freisleben,et al.  Self-Supervised Learning of Face Appearances in TV Casts and Movies , 2007, Int. J. Semantic Comput..

[91]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[92]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[93]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[94]  M. Faundez-Zanuy,et al.  State-of-the-art in speaker recognition , 2005, IEEE Aerospace and Electronic Systems Magazine.

[95]  S. Guruprasad,et al.  AANN models for speaker recognition based on difference cepstrals , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[96]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[97]  C. Sekhar,et al.  Speaker Change Detection using Support Vector Machines , 2005 .

[98]  Sharon Gannot,et al.  Speech enhancement using a mixture-maximum model , 1999, IEEE Trans. Speech Audio Process..

[99]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[100]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[101]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[102]  Sadaoki Furui,et al.  Fifty years of progress in speech and speaker recognition , 2004 .

[103]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[104]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[105]  Patrick Kenny,et al.  Combining Gaussianized/Non-Gaussianized Features to Improve Speaker Diarization of Telephone Conversations , 2007, IEEE Signal Processing Letters.

[106]  Thomas Friese,et al.  Flex-SwA: Flexible Exchange of Binary Data Based on SOAP Messages with Attachments , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[107]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[108]  Michael Fink,et al.  Social- and Interactive-Television Applications Based on Real-Time Ambient-Audio Identification , 2006 .

[109]  Douglas E. Sturim,et al.  The MIT lincoln laboratory 2008 speaker recognition system , 2009, INTERSPEECH.

[110]  Francesco Camastra,et al.  Machine Learning for Audio, Image and Video Analysis - Theory and Applications , 2007, Advanced Information and Knowledge Processing.

[111]  Ying Li,et al.  Content-based movie analysis and indexing based on audiovisual cues , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[112]  Tony Andrews Business Process Execution Language for Web Services Version 1.1 , 2003 .

[113]  Rubo Zhang,et al.  Speech Enhancement Based on Hilbert-Huang Transform Theory , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[114]  Alvin F. Martin,et al.  NIST speaker recognition evaluation chronicles , 2004, Odyssey.

[115]  S. R. Mahadeva Prasanna,et al.  Extraction of speaker-specific excitation information from linear prediction residual of speech , 2006, Speech Commun..

[116]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2008: High-Level Feature Extraction , 2008, TRECVID.

[117]  Douglas E. Sturim,et al.  Speaker indexing in large audio databases using anchor models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[118]  Kah-Chye Tan,et al.  Postprocessing method for suppressing musical noise generated by spectral subtraction , 1998, IEEE Trans. Speech Audio Process..

[119]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[120]  Xu Shao,et al.  Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end , 2006, Speech Commun..

[121]  Hsin-Min Wang,et al.  Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[122]  Shrikanth S. Narayanan,et al.  Signature cluster model selection for incremental Gaussian mixture cluster modeling in agglomerative hierarchical speaker clustering , 2009, INTERSPEECH.

[123]  Zhiwu Lu,et al.  Semantic concept annotation based on audio PLSA model , 2009, MM '09.

[124]  Hsin-Min Wang,et al.  Improving GMM-UBM speaker verification using discriminative feedback adaptation , 2009, Comput. Speech Lang..

[125]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[126]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[127]  张国亮,et al.  Comparison of Different Implementations of MFCC , 2001 .

[128]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[129]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[130]  Douglas A. Reynolds,et al.  Blind clustering of speech utterances based on speaker and language characteristics , 1998, ICSLP.

[131]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[132]  Sadaoki Furui,et al.  Digital Speech Processing, Synthesis, and Recognition , 1989 .

[133]  Iasonas Kokkinos,et al.  Nonlinear speech analysis using models for chaotic systems , 2005, IEEE Transactions on Speech and Audio Processing.

[134]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[135]  Kishore Prahallad,et al.  AANN: an alternative to GMM for pattern recognition , 2002, Neural Networks.

[136]  Michael Picheny,et al.  Speech recognition using noise-adaptive prototypes , 1989, IEEE Trans. Acoust. Speech Signal Process..

[137]  Shrikanth S. Narayanan,et al.  Language-adaptive persian speech recognition , 2003, INTERSPEECH.

[138]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[139]  Dalei Wu,et al.  Discriminative preprocessing of speech: towards improving biometric authentication , 2006 .

[140]  André Adami,et al.  Modeling prosodic differences for speaker recognition , 2007, Speech Commun..

[141]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[142]  Abeer Alwan,et al.  Speech Coding: Fundamentals and Applications , 2003 .

[143]  S. Kizhner,et al.  On the Hilbert-Huang transform data processing system development , 2004, 2004 IEEE Aerospace Conference Proceedings (IEEE Cat. No.04TH8720).

[144]  John S. D. Mason,et al.  Short utterance-based video aided speaker recognition , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[145]  Andreas Spanias,et al.  Cepstrum-based pitch detection using a new statistical V/UV classification algorithm , 1999, IEEE Trans. Speech Audio Process..

[146]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[147]  Massimo Tistarelli,et al.  Nineteen Urgent Research Topics in Biometrics and Identity Management , 2008, BIOID.

[148]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[149]  J. A. Stewart,et al.  Nonlinear Time Series Analysis , 2015 .

[150]  S. Mallat A wavelet tour of signal processing , 1998 .

[151]  Gaël Richard,et al.  Temporal Integration for Audio Classification With Application to Musical Instrument Classification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[152]  Sridha Sridharan,et al.  Minimising Speaker Verification Utterance Length through Confidence Based Early Verification Decisions , 2009, ICB.

[153]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[154]  Bernd Freisleben,et al.  DAVO: A Domain-Adaptable, Visual BPEL4WS Orchestrator , 2009, 2009 International Conference on Advanced Information Networking and Applications.

[155]  Alfred Ultsch,et al.  U *-Matrix : a Tool to visualize Clusters in high dimensional Data , 2004 .

[156]  William M. Campbell,et al.  Phonetic Speaker Recognition with Support Vector Machines , 2003, NIPS.

[157]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[158]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[159]  Hsin-Min Wang,et al.  Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics , 2004, Computer Music Journal.

[160]  Ian Vince McLoughlin,et al.  Line spectral pairs , 2008, Signal Process..

[161]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[162]  Bernd Freisleben,et al.  Video Cut Detection without Thresholds , 2004 .

[163]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[164]  Dirk Van Compernolle,et al.  Synthesizing speech from speech recognition parameters , 2004, INTERSPEECH.

[165]  M. Pardo,et al.  Learning from data: a tutorial with emphasis on modern pattern recognition methods , 2002 .

[166]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[167]  Arne Ramsperger Strukturanalyse der Riboflavin Synthase aus Methanococcus jannaschii , 2005 .

[168]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[169]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[170]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[171]  Douglas A. Reynolds,et al.  Person authentication by voice: a need for caution , 2003, INTERSPEECH.

[172]  Data Mining Methoden : Einordnung und Überblick , 2001 .

[173]  Anthony J. Robinson,et al.  Enhancement and recognition of noisy speech within an autoregressive hidden Markov model framework using noise estimates from the noisy signal , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[174]  Li Deng,et al.  Speech trajectory discrimination using the minimum classification error learning , 1998, IEEE Trans. Speech Audio Process..

[175]  Hsin-Min Wang,et al.  A query-by-example framework to retrieve music documents by singer , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[176]  Werner Verhelst,et al.  An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[177]  E.J. Candes Compressive Sampling , 2022 .

[178]  Yuan-Fu Liao,et al.  Prosody modeling and eigen-prosody analysis for robust speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[179]  Lie Lu,et al.  Unsupervised speaker segmentation and tracking in real-time audio content analysis , 2005, Multimedia Systems.

[180]  Ji Li,et al.  alpha-Gaussian mixture modelling for speaker recognition , 2009, Pattern Recognit. Lett..

[181]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[182]  Bernd Freisleben,et al.  A Web Service Communication Policy for Describing Non-standard Application Requirements , 2008, 2008 International Symposium on Applications and the Internet.

[183]  Matjaz B. Juric,et al.  Business process execution language for web services , 2004 .

[184]  F. Kubala,et al.  Automatic Speaker Clustering , 1997 .

[185]  Paul Deléglise,et al.  The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[186]  Bernd Freisleben,et al.  LCDL: an extensible framework for wrapping legacy code , 2009, iiWAS.

[187]  D A Reynolds,et al.  The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations , 2004 .

[188]  Mark Hasegawa-Johnson,et al.  A factorial HMM approach to simultaneous recognition of isolated digits spoken by multiple talkers on one audio channel , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[189]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[190]  Ananth N. Iyer,et al.  ROBUST VOICED / UNVOICED CLASSIFICATION USING NOVEL FEATURES AND GAUSSIAN MIXTURE MODEL , 2003 .

[191]  Ralph Ewerth,et al.  Robust video content analysis via transductive learning methods , 2009 .

[192]  Belkacem Fergani,et al.  Unsupervised speaker indexing using one-class Support Vector Machines , 2006, 2006 14th European Signal Processing Conference.

[193]  L. Cosmides,et al.  Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty , 1996, Cognition.

[194]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[195]  Björn Lindblom,et al.  Do 'Dominant Frequencies' explain the listener's response to formant and spectrum shape variations? , 2009, Speech Commun..

[196]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[197]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[198]  Lie Lu,et al.  Real-time unsupervised speaker change detection , 2002, Object recognition supported by user interaction for service robots.

[199]  Kuldip K. Paliwal,et al.  Short-time phase spectrum in speech processing: A review and some experimental results , 2007, Digit. Signal Process..

[200]  C. Tomasi The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval , 1997 .

[201]  Kang Jingqiu,et al.  Improved Algorithm of Correlation Dimension Estimation and its Application in Fault Diagnosis for Industrial Fan , 2006, 2006 Chinese Control Conference.

[202]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2005: Shot Boundary Detection and Camera Motion Estimation Results , 2005, TRECVID.

[203]  Lawrence K. Saul,et al.  Markov Processes on Curves for Automatic Speech Recognition , 1998, NIPS.

[204]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[205]  Bernd Freisleben,et al.  MIRO: a mashup editor leveraging web, Grid and Cloud services , 2009, iiWAS.

[206]  Bernd Freisleben,et al.  A scalable service-oriented architecture for multimedia analysis, synthesis and consumption , 2009, Int. J. Web Grid Serv..

[207]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[208]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[209]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[210]  William H. Press,et al.  Numerical recipes in C , 2002 .

[211]  Sadaoki Furui,et al.  50 Years of Progress in Speech and Speaker Recognition Research , 1970 .

[212]  E. Jafer,et al.  Wavelet-based voiced/unvoiced classification algorithm , 2003, Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667).

[213]  M. Vetterli,et al.  From Lagrange to Shannon... and back: another look at sampling [DSP Education] , 2009, IEEE Signal Processing Magazine.

[214]  Shuang Zhang,et al.  Speaker Clustering Aided by Visual Dialogue Analysis , 2008, PCM.

[215]  Bernd Freisleben,et al.  Omnivore: Integration of Grid Meta-Scheduling and Peer-to-Peer Technologies , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[216]  Jean-François Bonastre,et al.  Step-by-step and integrated approaches in broadcast news speaker diarization , 2006, Comput. Speech Lang..

[217]  Roy D. Patterson,et al.  Auditory images:How complex sounds are represented in the auditory system , 2000 .

[218]  Daben Liu,et al.  Online speaker clustering , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[219]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[220]  Patricia A. Keating,et al.  Linguistic Voice Quality , 2006 .

[221]  Dennis DeCoste,et al.  Visualizing data mining models , 2001 .

[222]  Shrikanth S. Narayanan,et al.  A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system , 2007, INTERSPEECH.

[223]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[224]  José Manuel Pardo,et al.  Robust Speaker Diarization for meetings , 2006 .

[225]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[226]  Man-Wai Mak,et al.  Speaker Verification via High-Level Feature Based Phonetic-Class Pronunciation Modeling , 2007, IEEE Transactions on Computers.

[227]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[228]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[229]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[230]  Larry P. Heck,et al.  A lognormal tied mixture model of pitch for prosody based speaker recognition , 1997, EUROSPEECH.

[231]  Manuel Davy,et al.  An online kernel change detection algorithm , 2005, IEEE Transactions on Signal Processing.

[232]  Andrew C. Morris,et al.  PAPER Special Section/Issue on Corpus-Based Speech Technologies GMM based clustering and speaker separability in the Timit speech database , 2005 .

[233]  M. Palaniswami,et al.  Classification of multidimensional trajectories for acoustic modeling using support vector machines , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[234]  Biing-Hwang Juang,et al.  Auditory perception and cognition , 2008, IEEE Signal Processing Magazine.

[235]  Guoli Wang,et al.  LS-NMF: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates , 2006, BMC Bioinformatics.

[236]  Rajesh M. Hegde,et al.  Application of the modified group delay function to speaker identification and discrimination , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[237]  M. Demirekler,et al.  Comparison of parametric and non-parametric representations of speech for recognition , 1994, Proceedings of MELECON '94. Mediterranean Electrotechnical Conference.

[238]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[239]  Constantine Kotropoulos,et al.  Computationally Efficient and Robust BIC-Based Speaker Segmentation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[240]  François Pachet,et al.  Exploring Billions of Audio Features , 2007, 2007 International Workshop on Content-Based Multimedia Indexing.

[241]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[242]  Jürgen Schmidhuber,et al.  Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes , 2008, ABiALS.

[243]  Remco C. Veltkamp,et al.  Using transportation distances for measuring melodic similarity , 2003, ISMIR.

[244]  David R. Hill,et al.  Speaker Classification Concepts: Past, Present and Future , 2007, Speaker Classification.

[245]  Kishore Prahallad,et al.  Source and system features for speaker recognition using AANN models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[246]  Steven B. Smith,et al.  Digital Signal Processing: A Practical Guide for Engineers and Scientists , 2002 .

[247]  Hirotaka Nakasone,et al.  Forensic automatic speaker recognition , 2001, Odyssey.

[248]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[249]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[250]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[251]  H. Nyquist,et al.  Certain Topics in Telegraph Transmission Theory , 1928, Transactions of the American Institute of Electrical Engineers.

[252]  Mark Huckvale,et al.  How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification , 2007, Speaker Classification.

[253]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[254]  Bernd Freisleben,et al.  Unfolding speaker clustering potential: a biomimetic approach , 2009, ACM Multimedia.

[255]  Beth Logan,et al.  A music similarity function based on signal analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[256]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[257]  Martin Ester,et al.  Knowledge Discovery in Databases - Techniken und Anwendungen , 2000 .

[258]  Holger Kantz,et al.  Practical implementation of nonlinear time series methods: The TISEAN package. , 1998, Chaos.

[259]  K. Mathiak,et al.  Does Playing Violent Video Games Induce Aggression? Empirical Evidence of a Functional Magnetic Resonance Imaging Study , 2006 .

[260]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[261]  Bernd Freisleben,et al.  Eine service-orientierte Grid-Infrastruktur zur Unterstützung medienwissenschaftlicher Filmanalyse , 2009, GeNeMe.

[262]  Hai Huang,et al.  Speech pitch determination based on Hilbert-Huang transform , 2006, Signal Process..

[263]  Alfred Ultsch,et al.  Pareto Density Estimation: A Density Estimation for Knowledge Discovery , 2005 .

[264]  Allen Y. Yang,et al.  Feature Selection in Face Recognition: A Sparse Representation Perspective , 2007 .

[265]  Douglas E. Sturim,et al.  The 2004 MIT Lincoln Laboratory speaker recognition system , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[266]  Bernd Freisleben,et al.  University of Marburg at TRECVID 2007: Shot Boundary Detection and High Level Feature Extraction , 2007, TRECVID.

[267]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[268]  Rubo Zhang,et al.  Speech Detection Based on Hilbert-Huang Transform , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[269]  S. W. Beet,et al.  Visual representations of speech signals , 1993 .

[270]  Jakub Dabkowski,et al.  On Some Method of Analysing Time Series , 1998 .

[271]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[272]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[273]  Douglas A. Reynolds,et al.  Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[274]  Sancho Salcedo-Sanz,et al.  Offline speaker segmentation using genetic algorithms and mutual information , 2006, IEEE Transactions on Evolutionary Computation.

[275]  Shingo Kuroiwa,et al.  Nonparametric Speaker Recognition Method Using Earth Mover's Distance , 2006, IEICE Trans. Inf. Syst..

[276]  Belkacem Fergani,et al.  Speaker diarization using one-class support vector machines , 2008, Speech Commun..

[277]  R.W. Schafer,et al.  From frequency to quefrency: a history of the cepstrum , 2004, IEEE Signal Processing Magazine.

[278]  Werner Verhelst Overlap-add methods for time-scaling of speech , 2000, Speech Commun..

[279]  Bayya Yegnanarayana,et al.  Speaker change detection in casual conversations using excitation source features , 2008, Speech Commun..

[280]  Mauro Cettolo,et al.  Evaluation of BIC-based algorithms for audio segmentation , 2005, Comput. Speech Lang..

[281]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[282]  Horst Stöcker,et al.  Taschenbuch mathematischer Formeln und moderner Verfahren (3. Aufl.) , 1995 .

[283]  Jonathan Foote,et al.  Visualizing music and audio using self-similarity , 1999, MULTIMEDIA '99.

[284]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[285]  William J. Fitzgerald,et al.  A Class of Kernels For Sets of Vectors , 2005, ESANN.

[286]  Nuria Oliver,et al.  Understanding near-duplicate videos: a user-centric approach , 2009, ACM Multimedia.

[287]  Bernd Freisleben,et al.  Fast and Robust Speaker Clustering Using the Earth Mover'S Distance and Mixmax Models , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[288]  David Burshtein,et al.  Noise adaptation of HMM speech recognition systems using tied-mixtures in the spectral domain , 1997, IEEE Trans. Speech Audio Process..

[289]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[290]  Y. Ephraim,et al.  A Brief Survey of Speech Enhancement , 2003 .

[291]  Alfred Ultsch Proof of Pareto’s 80/20 Law and Precise Limits for ABC-Analysis , 2002 .

[292]  Kirk L. Kroeker,et al.  Face recognition breakthrough , 2009, Commun. ACM.

[293]  Steven Skiena,et al.  The Algorithm Design Manual , 2020, Texts in Computer Science.

[294]  Trevor Darrell,et al.  Fast contour matching using approximate earth mover's distance , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[295]  Bernd Freisleben,et al.  Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[296]  José Manuel Benítez,et al.  Consistency measures for feature selection , 2008, Journal of Intelligent Information Systems.

[297]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[298]  Masafumi Nishida,et al.  Speaker indexing for news articles, debates and drama in broadcasted TV programs , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.