Unsupervised pattern discovery in speech: applications to word acquisition and speaker segmentation

We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a pre-specified inventory of lexical units (i.e. phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multi-word phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream. We demonstrate two applications of our pattern discovery procedure. First, we propose and evaluate two methods for automatically identifying sound clusters generated through pattern discovery. Our results show that high identification accuracy can be achieved for single word clusters using a constrained isolated word recognizer. Second, we apply acoustic pattern matching to the problem of speaker segmentation by attempting to find word-level speech patterns that are repeated by the same speaker. When used to segment a ten hour corpus of multi-speaker lectures, we found that our approach is able to generate segmentations that correlate well to independently generated human segmentations. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  S. Crain,et al.  Language Acquisition , 2008 .

[4]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[5]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[6]  Michael Riley,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1996, ArXiv.

[7]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[8]  Kuldip K. Paliwal Lexicon-building methods for an acoustic sub-word based speech recognizer , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[10]  Torbjørn Svendsen,et al.  On the automatic segmentation of speech signals , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[12]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[13]  Xavier Rodet,et al.  Toward Automatic Music Audio Summary Generation from Signal Analysis , 2002, ISMIR.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Kuldip K. Paliwal,et al.  An improved sub-word based speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[16]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[17]  Timothy J. Hazen,et al.  ACOUSTIC MODELING IMPROVEMENTS IN A SEGMENT-BASED SPEECH RECOGNIZER , 1999 .

[18]  Timothy J. Hazen,et al.  Pronunciation modeling using a finite-state transducer representation , 2005, Speech Commun..

[19]  Han Shu,et al.  EM training of finite-state transducers and its application to pronunciation modeling , 2002, INTERSPEECH.

[20]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[22]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[23]  Mari Ostendorf,et al.  Using automatically-derived acoustic sub-word units in large vocabulary speech recognition , 1998, ICSLP.

[24]  Shih-Fu Chang,et al.  Unsupervised pattern discovery for multimedia sequences , 2005 .

[25]  P. Jusczyk,et al.  Infants′ Detection of the Sound Patterns of Words in Fluent Speech , 1995, Cognitive Psychology.

[26]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[27]  James R. Glass,et al.  Analysis and Processing of Lecture Audio Data: Preliminary Investigations , 2004, Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004 - SpeechIR '04.

[28]  Kuldip K. Paliwal,et al.  Speech recognition based on acoustically derived segment units , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  Seiichi Nakagawa,et al.  Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[30]  James R. Glass,et al.  Learning units for domain-independent out-of- vocabulary word modelling , 2001, INTERSPEECH.

[31]  Fernando Pereira,et al.  Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[32]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[33]  J. P. Egan Articulation testing methods , 1948, The Laryngoscope.

[34]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[35]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[36]  C. Habel,et al.  Language , 1931, NeuroImage.

[37]  S. Levinson,et al.  Considerations in dynamic time warping algorithms for discrete word recognition , 1978 .

[38]  James R. Glass,et al.  Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[39]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[40]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[41]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[42]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[43]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[44]  J. Saffran Constraints on Statistical Language Learning , 2002 .

[45]  S. Crain Language acquisition in the absence of experience , 1991, Behavioral and Brain Sciences.

[46]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[47]  Deniz Yuret,et al.  Discovery of linguistic relations using lexical attraction , 1998, ArXiv.

[48]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[49]  Timothy J. Hazen The use of speaker correlation information for automatic speech recognition , 1998 .

[50]  Lillian Lee,et al.  Mostly-unsupervised statistical segmentation of Japanese kanji sequences , 2002, Natural Language Engineering.

[51]  Leon Cohen,et al.  The scale representation , 1993, IEEE Trans. Signal Process..

[52]  Andrej Ljolje,et al.  The AT&T LVCSR-2000 System , 2000 .

[53]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[54]  Pietro Perona,et al.  Continuous dynamic time warping for translation-invariant curve alignment with applications to signature verification , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[55]  Trym Holter,et al.  Combined optimisation of baseforms and model parameters in speech recognition based on acoustic subword units , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[56]  Toshiyuki Takezawa,et al.  Analysis and effect of speaking style for dialogue speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[57]  Henrique S. Malvar,et al.  Using audio fingerprinting for duplicate detection and thumbnail generation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[58]  Mauro Cettolo,et al.  A DP algorithm for speaker change detection , 2003, INTERSPEECH.

[59]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[60]  Yaw-Ling Lin,et al.  Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis , 2002, J. Comput. Syst. Sci..

[61]  I. Lee Hetherington,et al.  An efficient implementation of phonological rules using finite-state transducers , 2001, INTERSPEECH.

[62]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[63]  Lalit R. Bahl,et al.  A new algorithm for the estimation of hidden Markov model parameters , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[64]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[65]  Jean-Luc Gauvain,et al.  Transcribing broadcast news for audio and video indexing , 2000, CACM.

[66]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[67]  James R. Glass Finding acoustic regularities in speech: applications to phonetic recognition , 1988 .

[68]  Alexander Clark,et al.  Unsupervised Language Acquisition: Theory and Practice , 2002, ArXiv.

[69]  Jonathan G. Fiscus,et al.  1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures , 1998 .

[70]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[71]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[72]  Andrew P. Witkin,et al.  Scale-space filtering: A new approach to multi-scale description , 1984, ICASSP.

[73]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[74]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[75]  Daniel P. W. Ellis,et al.  Using acoustic condition clustering to improve acoustic change detection on broadcast news , 2000, INTERSPEECH.

[76]  Mari Ostendorf,et al.  Speech recognition system design based on automatically derived units , 1999 .

[77]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[78]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[79]  Barry Vercoe,et al.  Structural analysis of musical signals for indexing and thumbnailing , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[80]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[81]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[82]  S. Pinker The Language Instinct , 1994 .

[83]  H. Gish,et al.  An unsupervised, sequential learning algorithm for the segmentation of speech waveforms with multiple speakers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[84]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[85]  Michael Picheny,et al.  Acoustic Markov models used in the Tangora speech recognition system , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[86]  James R. Glass,et al.  Unsupervised Word Acquisition from Speech using Pattern Discovery , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[87]  S. Dongen Graph clustering by flow simulation , 2000 .

[88]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[89]  Eytan Ruppin,et al.  Automatic Acquisition and Efficient Representation of Syntactic Structures , 2002, NIPS.

[90]  Noam Chomsky Knowledge of Language , 1986 .

[91]  Aladdin M. Ariyaeeinia,et al.  Unsupervised speaker change detection using probabilistic pattern matching , 2006, IEEE Signal Processing Letters.

[92]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[93]  Kai-Fu Lee,et al.  On large-vocabulary speaker-independent continuous speech recognition , 1988, Speech Commun..

[94]  Lars Kai Hansen,et al.  Unsupervised speaker change detection for broadcast news segmentation , 2006, 2006 14th European Signal Processing Conference.

[95]  Hiroaki Sakoe,et al.  A Dynamic Programming Approach to Continuous Speech Recognition , 1971 .

[96]  Masataka Goto,et al.  A chorus-section detecting method for musical audio signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[97]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[98]  J. Pind The Discovery of Spoken Language, Peter W. Jusczyk (Ed.). MIT Press (1997), ISBN 0 262 10058 4 , 1997 .

[99]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[100]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[101]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[102]  C. Myers,et al.  A level building dynamic time warping algorithm for connected word recognition , 1981 .

[103]  Beth Logan,et al.  Music summarization using key phrases , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[104]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[105]  T. Poggio,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 2001 .

[106]  Barbara Peskin,et al.  Speaker detection without models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[107]  Thomas Hain,et al.  THE CU-HTK MARCH 2000 HUB5E TRANSCRIPTION SYSTEM , 2000 .

[108]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[109]  Glenn Carroll,et al.  Two Experiments on Learning Probabilistic Dependency Grammars from Corpora , 1992 .

[110]  James R. Glass,et al.  Towards unsupervised pattern discovery in speech , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[111]  Ning Hu,et al.  Pattern Discovery Techniques for Music Audio , 2002, ISMIR.

[112]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[113]  Eytan Ruppin,et al.  Unsupervised Context Sensitive Language Acquisition from a Large Corpus , 2003, NIPS.

[114]  Roy D. Patterson,et al.  Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform , 2002, Speech Commun..

[115]  James R. Glass,et al.  A NOVEL DTW-BASED DISTANCE MEASURE FOR SPEAKER SEGMENTATION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[116]  Hervé Bourlard,et al.  Robust speaker change detection , 2004, IEEE Signal Processing Letters.

[117]  Herbert Gish,et al.  The 2000 BBN Byblos LVCSR system , 2000, INTERSPEECH.