Data Driven Approaches to Speech and Language Processing

Speech and language processing systems can be categorised according to whether they make use of predefined linguistic information and rules or are data driven and therefore exploit machine learning techniques to automatically extract and process relevant units of information which are then indexed and retrieved as appropriate. As an example, most state of the art automatic speech processing systems rely on a representation based on predefined phonetic symbols. The use of language dependent representations, whilst linguistically intuitive, has several drawbacks i.e. portability across languages, development time. Therefore, in this article, we review and present our recent experiments exploiting the idea inherent in the ALISP (Automatic Language Independent Speech Processing) approach, with particular respect to speech processing, where the intermediate representation between the acoustic and linguistic levels area is automatically inferred from speech data. We then present prospective directions in which the ALISP principles could be exploited by different domains such as audio, speech, text, image and video processing.

[1]  Ralf D. Brown Automated Dictionary Extraction for “Knowledge-Free” Example-Based Translation , 2006 .

[2]  Gérard Chollet,et al.  Voice forgery using ALISP: indexation in a client memory , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Alvin F. Martin,et al.  The NIST Speaker Recognition Evaluations: 1996-2001 , 1998, Odyssey.

[4]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[5]  Chafic Mokbel,et al.  BECARS: a free software for speaker verification , 2004, Odyssey.

[6]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[7]  Josef Kittler,et al.  Audio- and Video-Based Biometric Person Authentication, 5th International Conference, AVBPA 2005, Hilton Rye Town, NY, USA, July 20-22, 2005, Proceedings , 2005, AVBPA.

[8]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  R. I. I. Damper,et al.  Data Mining Techniques in Speech Synthesis , 1998 .

[10]  Alvin F. Martin,et al.  NIST's Assessment of Text Independent Speaker Recognition Performance , 2002 .

[11]  G. Blelloch Introduction to Data Compression * , 2022 .

[12]  Gaston H. Gonnet,et al.  Handbook Of Algorithms And Data Structures , 1984 .

[13]  Gérard Chollet,et al.  Segmental Approaches for Automatic Speaker Verification , 2000, Digit. Signal Process..

[14]  Gérard Chollet,et al.  Speech spectrum representation and coding using multigrams with distance , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Alexander Kain,et al.  Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Alexander G. Hauptmann,et al.  Connectionist and Symbolic Processing in Speech-to-Speech Translation: The JANUS System , 1991 .

[17]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[18]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[19]  Eiichiro Sumita,et al.  A Translation Aid System Using Flexible Text Retrieval Based on Syntax-Matching , 1988 .

[20]  Aggelos K. Katsaggelos,et al.  Speech-to-video synthesis using MPEG-4 compliant visual features , 2003, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Masaaki Nagata,et al.  ATR's speech translation system: ASURA , 1993, EUROSPEECH.

[22]  Donald Ervin Knuth,et al.  The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[23]  Satoshi Sato Example-based machine translation , 1992 .

[24]  Padraig Cunningham,et al.  Adaptation Guided Retrieval in EBMT: A Case-Based Approach to Machine Translation , 1996, EWCBR.

[25]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.

[26]  Kevin McTait,et al.  A Building Blocks Approach to Translation Memory , 1999, TC.

[27]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[28]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[29]  Alexander H. Waibel,et al.  Decoding Algorithm in Statistical Machine Translation , 1997, ACL.

[30]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[31]  Eric Moulines,et al.  Statistical methods for voice quality transformation , 1995, EUROSPEECH.

[32]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[33]  Gérard Chollet,et al.  Toward ALISP: A proposal for Automatic Language Independent Speech Processing , 1999 .

[34]  Andy Way,et al.  Recent Advances in Example-Based Machine Translation , 2004 .

[35]  Yuji Matsumoto,et al.  Lexical Knowledge Acquisition from Bilingual Corpora , 1992, COLING.

[36]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[38]  Gérard Chollet,et al.  Linear and non-linear fusion of ALISP-based and GMM systems for text-independent speaker verification , 2004, Odyssey.

[39]  John R. Koza,et al.  Genetic programming (videotape): the movie , 1992 .

[40]  Hermann Ney,et al.  Statistical Methods for Machine Translation , 2000 .

[41]  Walid Karam,et al.  An Audio-Visual Imposture Scenario by Talking Face Animation , 2004, Summer School on Neural Networks.

[42]  Frédéric Bimbot,et al.  Introducing statistical dependencies and structural constraints in variable-length sequence models , 1996, ICGI.

[43]  Osamu Furuse,et al.  FORMALIZING TRANSLATION MEMORY , 2003 .

[44]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[45]  Gérard Chollet,et al.  A segmental approach to text-independent speaker verification , 1999, EUROSPEECH.

[46]  Keiichi Tokuda,et al.  Visual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches , 1998, AVSP.

[47]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[48]  Isabel Trancoso,et al.  Improving speaker recognisability in phonetic vocoders , 1998, ICSLP.

[49]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[50]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[51]  Yuk Ho Application of Minimal Perfect Hashing in Main Memory Indexing , 1994 .

[52]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[53]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[54]  Kenneth Ward Church,et al.  Work on Statistical Methods for Word Sense Disambiguation , 1992 .

[55]  Rudolf Bayer,et al.  Prefix B-trees , 1977, TODS.

[56]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[57]  Jan Cernock,et al.  Very Low Bit Rate Segmental Speech Coding Using Automatically Derived Units , .

[58]  John D. Lafferty,et al.  Towards History-based Grammars: Using Richer Models for Probabilistic Parsing , 1993, ACL.

[59]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[60]  Robert A. Wagner,et al.  An Extension of the String-to-String Correction Problem , 1975, JACM.

[61]  Darrel Hankerson,et al.  Introduction to Information Theory and Data Compression , 2003 .

[62]  Gérard Chollet,et al.  Advances in Very Low Bit Rate Speech Coding Using Recognition and Synthesis Techniques , 2002, TSD.

[63]  Kevin McTait,et al.  Translation Pattern Extraction and Recombination for Example-Based Machine Translation , 2001 .

[64]  Gérard Chollet,et al.  Searching through a Speech Memory for Text-Independent Speaker Verification , 2003, AVBPA.

[65]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[66]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, Speech Commun..

[67]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[68]  Francisco Casacuberta,et al.  Architectures for Speech-to-Speech Translation Using Finite-state Models , 2002, Speech-to-Speech Translation@ACL.

[69]  R. Pieraccini,et al.  Variable-length sequence modeling: multigrams , 1995, IEEE Signal Processing Letters.

[70]  Eluned S. Parris,et al.  Recurrent substrings and data fusion for language recognition , 1998, ICSLP.

[71]  Gérard Chollet,et al.  On the generation and use of a segment dictionary for speech coding, synthesis and recognition , 1983, ICASSP.

[72]  J.P. Eatock,et al.  A quantitative assessment of the relative speaker discriminating properties of phonemes , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  Sergei Nirenburg,et al.  A Full-Text Experiment in Example-Based Machine Translation , 1994 .

[74]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[75]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[76]  K. McTait,et al.  A language-neutral sparse-data algorithm for extracting translation patterns , 1999, TMI.

[77]  James R. Glass,et al.  Information-theoretic criteria for unit selection synthesis , 2002, INTERSPEECH.

[78]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[79]  Bishnu S. Atal,et al.  Efficient coding of LPC parameters by temporal decomposition , 1983, ICASSP.

[80]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[81]  Makoto Nagao,et al.  A framework of a mechanical translation between Japanese and English by analogy principle , 1984 .

[82]  Gérard Chollet,et al.  Quantization of spectral sequences using variable length spectral segments for speech coding at very low bit rate , 1997, EUROSPEECH.

[83]  Alex Waibel,et al.  JANUS: a speech-to-speech translation system using connectionist and symbolic processing strategies , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[84]  John Cocke,et al.  A Statistical Approach to Language Translation , 1988, COLING.

[85]  Gérard Chollet,et al.  Segmental vocoder-going beyond the phonetic approach , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[86]  Pamela W. Jordan,et al.  A survey of current paradigms in machine translation , 1999, Adv. Comput..

[87]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[88]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[89]  Ido Dagan,et al.  Similarity-Based Estimation of Word Cooccurrence Probabilities , 1994, ACL.

[90]  Ellen Riloff,et al.  Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing , 1996, Lecture Notes in Computer Science.

[91]  Gérard Chollet,et al.  Very Low Bit Rate Speech Coding: Comparison of Data-Driven Units with Syllable Segments , 1999, TSD.

[92]  Asmaa El Hannani,et al.  Segmental Scores Fusion for ALISP-Based GMM Text-Independent Speaker Verification , 2004, Summer School on Neural Networks.

[93]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[94]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[95]  Ian H. Witten,et al.  Learning language using genetic algorithms , 1995, Learning for Natural Language Processing.

[96]  Gérard Chollet,et al.  Data-driven speech segmentation for language identification and speaker verification , 2003, NOLISP.

[97]  Bob Carpenter,et al.  Vector-based Natural Language Call Routing , 1999, Comput. Linguistics.

[98]  Ji R Navrr Spoken Language Recognition -a Step towards Multilinguality in Speech Processing , 2001 .

[99]  Giuseppe Riccardi,et al.  Detecting acoustic morphemes in lattices for spoken language understanding , 2000, INTERSPEECH.

[100]  Rita Singh,et al.  TONGUES: rapid development of a speech-to-speech translation system , 2002 .

[101]  A.L. Gorin,et al.  An experiment in spoken language acquisition , 1992, IEEE Trans. Speech Audio Process..

[102]  Jérome Farinas,et al.  Modélisation phonotactique de grandes classes phonétiques en vue d'une approche différenciée en identification automatique des langues , 2001 .

[103]  Catherine Pelachaud,et al.  Greta: A Simple Facial Animation Engine , 2002 .

[104]  Alon Lavie,et al.  The Janus-III Translation System: Speech-to-Speech Translation in Multiple Domains , 2004, Machine Translation.

[105]  Thierry Dutoit Data-driven techniques in speech synthesis , 2002, Computational Linguistics.

[106]  Alexander H. Waibel,et al.  Modeling with Structures in Statistical Machine translation , 1998, ACL.

[107]  Alvin F. Martin,et al.  The NIST speaker recognition evaluation program , 2005 .

[108]  Mark Nelson,et al.  The Data Compression Book , 2009 .

[109]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[110]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[111]  Isabel Trancoso,et al.  Phonetic vocoder assessment , 2000, INTERSPEECH.

[112]  François Yvon Paradigmatic cascades: a linguistically sound model of pronunciation by analogy , 1997 .

[113]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[114]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[115]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[116]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[117]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[118]  Vidroha Debroy,et al.  Genetic Programming , 1998, Lecture Notes in Computer Science.

[119]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[120]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[121]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[122]  David Salesin,et al.  Modeling and Animating Realistic Faces from Images , 2002, International Journal of Computer Vision.

[123]  A. Gorin On automated language acquisition , 1989 .

[124]  Gérard Chollet,et al.  Text-independent speaker verification using automatically labelled acoustic segments , 1998, ICSLP.

[125]  G. H. Gonnet,et al.  Handbook of algorithms and data structures: in Pascal and C (2nd ed.) , 1991 .

[126]  Christophe d'Alessandro,et al.  A selection/concatenation text-to-speech synthesis system: databases development, system design, comparative evaluation , 2001, SSW.

[127]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[128]  A Elithorn,et al.  ARTIFICIAL AND HUMAN INTELLIGENCE , 1984 .

[129]  Stefan Harbeck,et al.  Multigrams for language identification , 1999, EUROSPEECH.

[130]  W. Bruce Croft,et al.  Relevance feedback and inference networks , 1993, SIGIR.

[131]  Richard V. Cox,et al.  A segmental speech coder based on a concatenative TTS , 2002, Speech Commun..

[132]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[133]  Satoshi Nakamura,et al.  Fusion of Audio-Visual Information for Integrated Speech Processing , 2001, AVBPA.

[134]  J. Palous,et al.  Machine Learning and Data Mining , 2002 .

[135]  Gérard Chollet,et al.  Modeling spectral speech transitions using temporal decomposition techniques , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[136]  David G. Stork,et al.  Pattern Classification , 1973 .

[137]  Jirí Navrátil,et al.  Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[138]  Gérard Chollet,et al.  Very-low-rate speech compression by indexation of polyphones , 2003, INTERSPEECH.

[139]  Sergei Nirenburg,et al.  Two Approaches to Matching in Example-Based Machine Translation , 1993, TMI.

[140]  Guillaume Gravier,et al.  Towards Fully Automatic Speech Processing Techniques for Interactive Voice Servers , 1999 .

[141]  Gérard Chollet,et al.  Speech synthesis by structured segments, using temporal decomposition and a glottal excitation , 1989, EUROSPEECH.

[142]  Aggelos K. Katsaggelos,et al.  An HMM-based speech-to-video synthesizer , 2002, IEEE Trans. Neural Networks.

[143]  Philippe Langlais,et al.  Sub-sentential exploitation of translation memories , 2001, MTSUMMIT.

[144]  C. Montacie,et al.  Temporal decomposition and acoustic-phonetic decoding of speech , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[145]  Ralf D. Brown,et al.  Example-Based Machine Translation in the Pangloss System , 1996, COLING.

[146]  Antje Schweitzer,et al.  Multimodal Speech Synthesis , 2006, SmartKom.

[147]  Davide Turcato Automatically Creating Bilingual Lexicons for Machine Translation from Bilingual Text , 1998, COLING-ACL.

[148]  Hiroyuki Kaji,et al.  Learning Translation Templates From Bilingual Text , 1992, COLING.

[149]  Chin-Hui Lee,et al.  A portability study on natural language call steering , 2001, INTERSPEECH.

[150]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[151]  Walter Daelemans,et al.  MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[152]  Gérard Chollet,et al.  Speech Processing, Recognition and Artificial Neural Networks , 1999 .

[153]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[154]  Fabio Pianesi,et al.  The NESPOLE! Speech-to-Speech Translation System , 2002, AMTA.

[155]  Yannis Stylianou,et al.  HNM: a simple, efficient harmonic+noise model for speech , 1993, Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[156]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[157]  G. Chollet,et al.  VoiceUNL : a proposal to represent speech control mechanisms within the Universal Networking Digital Language , 2003 .

[158]  Frédéric Bimbot,et al.  Inference of variable-length linguistic and acoustic units by multigrams , 1997, Speech Commun..

[159]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[160]  Kevin McTait Translation Patterns, Linguistic Knowledge and Complexity in an Approach to EBMT , 2003 .

[161]  Yeuvo Jphonen,et al.  Self-Organizing Maps , 1995 .

[162]  Tanja Schultz,et al.  Janus: Towards Multilingual Spoken Language Translation , 1995 .

[163]  Andreas Stolcke,et al.  An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities , 1994, CL.

[164]  Barry Smyth,et al.  Advances in Case-Based Reasoning , 1996, Lecture Notes in Computer Science.

[165]  Gérard Chollet,et al.  Codage de la parole a bas et tres bas debits , 2000, Ann. des Télécommunications.

[166]  Victor Sadler,et al.  Pilot Implementation of a Bilingual Knowledge Bank , 1990, COLING.

[167]  Yann LeCun,et al.  Memory-based character recognition using a transformation invariant metric , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[168]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[169]  Genevieve Baudoin,et al.  Vers une analyse acoustico-phonétique de la parole indépendante de la langue, basée sur ALISP , 2001 .