Search Problems for Speech and Audio Sequences

The modern proliferation of very large audio and video databases has created a need for effective methods of indexing and searching highly variable or uncertain data. Classical search and indexing algorithms deal with clean input sequences. However, an index created from speech or music transcriptions is marked with errors and uncertainties stemming from the use of imperfect statistical models in the transcription process. This thesis presents novel algorithms, analyses, and general techniques and tools for effective indexing and search that not only tolerate but exploit this uncertainty. We have devised a new music identification technique in which each song is represented by a distinct sequence of music sounds, called "music phonemes." We learn the set of music phonemes, as well as a unique sequence of music phonemes characterizing each song, using an unsupervised algorithm. We also create a compact mapping of music phoneme sequences to songs. Using these techniques, we construct an efficient and robust large-scale music identification system. We have further designed new algorithms for compact indexing of uncertain inputs based on suffix and factor automata and given novel theoretical guarantees for their space requirements. We show that the suffix automaton or factor automaton of a set of strings U has at most 2Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. We also describe matching new linear-time algorithms for constructing the suffix automaton S or factor automaton F of U in time O(|S|). We have also defined a new quality measure for topic segmentation systems and designed a discriminative topic segmentation algorithm for speech inputs. The new quality measure improves on previously used criteria and is correlated with human judgment of topic-coherence. Our segmentation algorithm uses a novel general topical similarity score based on word co-occurrences. This new algorithm outperforms previous methods in experiments over speech and text streams. We further demonstrate that the performance of segmentation algorithms can be improved by using a lattice of competing hypotheses over the speech stream rather than just the one-best hypothesis as input.

[1]  E. Batlle,et al.  Automatic Song Identification in Noisy Broadcast Audio , 2002 .

[2]  Mehryar Mohri,et al.  General suffix automaton construction algorithm and space bounds , 2009, Theor. Comput. Sci..

[3]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[6]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[7]  Mehryar Mohri,et al.  Discriminative Topic Segmentation of Text and Speech , 2010, AISTATS.

[8]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[9]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[10]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[11]  Mari Ostendorf,et al.  Joint lexicon, acoustic unit inventory and model design , 1999, Speech Commun..

[12]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[13]  Hideki Kozima,et al.  Similarity between Words Computed by Spreading Activation on an English Dictionary , 1993, EACL.

[14]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[15]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[16]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[17]  David Pye,et al.  Content-based methods for the management of digital music , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[20]  Maxime Crochemore,et al.  Transducers and Repetitions , 1986, Theor. Comput. Sci..

[21]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[22]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[23]  Pedro J. Moreno,et al.  Music Identification with Weighted Finite-State Transducers , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Jeffrey C. Reynar Statistical Models for Topic Segmentation , 1999, ACL.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[27]  Mehryar Mohri,et al.  Factor Automata of Automata and Applications , 2007, CIAA.

[28]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[29]  Dominique Revuz,et al.  Minimisation of Acyclic Deterministic Automata in Linear Time , 1992, Theor. Comput. Sci..

[30]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[31]  Perry R. Cook,et al.  Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process , 2008, ISMIR.

[32]  Mehryar Mohri,et al.  A new quality measure for topic segmentation of text and speech , 2009, INTERSPEECH.

[33]  Shumeet Baluja,et al.  Audio Fingerprinting: Combining Computer Vision & Data Stream Processing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[34]  Arnaud Sahuguet,et al.  An audio indexing system for election video material , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[36]  David M. Blei,et al.  Topic segmentation with an aspect hidden Markov model , 2001, SIGIR '01.

[37]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[38]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[39]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[40]  Salim Roukos,et al.  Statistical methods for topic segmentation , 2000, INTERSPEECH.

[41]  Volker Tresp,et al.  Call-Based Fraud Detection in Mobile Communication Networks Using a Hierarchical Regime-Switching Model , 1998, NIPS.

[42]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[43]  Jun Wu,et al.  Building a topic-dependent maximum entropy model for very large corpora , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[45]  Thomas L. Griffiths,et al.  Unsupervised Topic Modelling for Multi-Party Spoken Discourse , 2006, ACL.

[46]  Dan Jurafsky,et al.  Statistical Natural Language Processing , 2010, Encyclopedia of Machine Learning.

[47]  Beth Logan,et al.  A music similarity function based on signal analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[48]  Cyril Allauzen,et al.  General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[49]  Malcolm Slaney,et al.  Analysis of Minimum Distances in High-Dimensional Musical Spaces , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Maxime Crochemore,et al.  Efficient Experimental String Matching by Weak Factor Recognition , 2001, CPM.

[51]  Padmini Srinivasan,et al.  A cluster-based approach to broadcast news , 2002 .

[52]  Andrej Ljolje,et al.  A spoken language system for automated call routing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[54]  Jaap A. Haitsma,et al.  Robust Audio Hashing for Content Identification , 2001 .

[55]  Hideki Kozima A Scene-based Model of Word Prediction , 2004 .

[56]  Roger K. Moore Computer Speech and Language , 1986 .

[57]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[58]  Heiko Hoffmann,et al.  Kernel PCA for novelty detection , 2007, Pattern Recognit..

[59]  Alex Park,et al.  ASR dependent techniques for speaker identification , 2002, INTERSPEECH.

[60]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[61]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[62]  Brian Roark,et al.  Generalized Algorithms for Constructing Statistical Language Models , 2003, ACL.

[63]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[64]  Mehryar Mohri,et al.  Robust Music Identification, Detection, and Analysis , 2007, ISMIR.

[65]  Timothy J. Hazen,et al.  Discriminative feature weighting using MCE training for topic identification of spoken audio recordings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[66]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[67]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[68]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[69]  Avery Wang,et al.  The Shazam music recognition service , 2006, CACM.

[70]  J.P. Yamron,et al.  Event tracking and text segmentation via hidden Markov models , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[71]  Michael Brady,et al.  Novelty detection for the identification of masses in mammograms , 1995 .

[72]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[73]  Giancarlo Mauri,et al.  On-line construction of compact directed acyclic word graphs , 2005, Discret. Appl. Math..

[74]  Shumeet Baluja,et al.  Waveprint: Efficient wavelet-based audio fingerprinting , 2008, Pattern Recognit..

[75]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[76]  Mehryar Mohri,et al.  Efficient and Robust Music Identification With Weighted Finite-State Transducers , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  Mehryar Mohri Edit-Distance Of Weighted Automata: General Definitions And Algorithms , 2003, Int. J. Found. Comput. Sci..

[78]  Satoshi Nakamura,et al.  Dialogue Speech Recognition by Combining Hierarchical Topic Classification and Language Model Switching , 2005, IEICE Trans. Inf. Syst..

[79]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[80]  Janet Andrea Blumer,et al.  Algorithms for the directed acyclic word graph and related structures (data structures, suffix trees, inverted file, automata, string algorithms) , 1985 .

[81]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[82]  Tsuhan Chen,et al.  Unsupervised Image Categorization and Object Localization using Topic Models and Correspondences between Images , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[83]  Pedro Cano,et al.  A Review of Audio Fingerprinting , 2005, J. VLSI Signal Process..