Object-based modelling for representing and processing speech corpora

This thesis deals with modelling data existing in large speech corpora using an object-oriented paradigm which captures important linguistic structures. Information from corpora is transformed into objects and are assigned properties regarding their behaviour. These objects, called speech units, are placed onto a multi-dimensional framework and have their relationships to other units explicitly defined through the use of links. Frameworks that model temporal utterances or atemporal information like speaker characteristics and recording conditions can be searched efficiently for contextual matches. Speech units that match desired contexts are the result of successful linguistically motivated queries and can be used in further speech processing tasks in the same computational environment. This allows for empirical studies of speech and its relation to linguistic structures to be carried out, and for the training and testing of applications like speech recognition and synthesis. Information residing in typical speech corpora is discussed first, followed by an overview of objectorientation which sets the tone for this thesis. Then the representation framework is introduced which is generated by a compiler and linker that rely on a set of domain-specific resources that transform corpus data into speech units. Operations on this framework are then presented along with a comparison between a relational and object-oriented model of identical speech data. The models described in this work are directly applicable to existing large speech corpora, and the methods developed here are tested against relational database methods. The object-oriented methods outperform the relational methods for typical linguistically relevant queries by about three orders of magnitude as measured by database search times. This improvement in simplicity of representation and search speed is crucial for the utilisation of large multi-lingual corpora in basic research on the detailed properties of speech, especially in relation to contextual variation.

[1]  S. Itahashi A Japanese language speech database , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Toomas Altosaar,et al.  Relational vs. object-oriented models for representing speech: a comparison using ANDOSL data , 1999, EUROSPEECH.

[3]  Jonathan Harrington,et al.  The mu + system for corpus based speech research , 1993, Comput. Speech Lang..

[4]  U.K. Laine,et al.  Time-frequency And Multiple-resolution Representations In Auditory Modeling , 1991, Final Program and Paper Summaries 1991 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  James L. Hieronymus ASCII Phonetic Symbols for the World''s Languages: Worldbet , 1993 .

[6]  Mark Liberman,et al.  ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation , 2000, LREC.

[7]  Toomas Altosaar,et al.  Transforming information in speech databases into knowledge , 1999 .

[8]  Toomas Altosaar,et al.  Diphone-based speech recognition using time-event neural networks , 1992, ICSLP.

[9]  Toomas Altosaar,et al.  Modeling the microprosody of pitch and loudness for speech synthesis with neural networks , 1998, ICSLP.

[10]  Toomas Altosaar,et al.  Forming generic models of speech for uniform database access , 1998, ICSLP.

[11]  Toomas Altosaar,et al.  Measuring the importance of morphological information for finnish speech synthesis , 2000, INTERSPEECH.

[12]  M. Karjalainen,et al.  DSP software integration by object-oriented programming: a case study of QuickSig , 1990, IEEE ASSP Magazine.

[13]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[14]  Aki Härmä,et al.  Frequency-warped autoregressive modeling and filtering , 2001 .

[15]  Juha Kuusela,et al.  Object-oriented technology for real-time systems: a practical approach using OMT and Fusion , 1996 .

[16]  Toomas Altosaar,et al.  Event-based recognition and analysis of speech by neural networks , 1991, EUROSPEECH.

[17]  Dafydd Gibbon,et al.  EUROM - a spoken language resource for the EU - the SAM projects , 1995, EUROSPEECH.

[18]  V. Vaelimaeki Fractional delay waveguide modeling of acoustic tubes , 1994 .

[19]  Jan P. M. Hendriks A formalism for speech database access , 1990, Speech Commun..

[20]  Nick Zacharov,et al.  Perceptual studies on spatial sound reproduction systems , 2000 .

[21]  V. Rich Personal communication , 1989, Nature.

[22]  Toomas Altosaar,et al.  Event-based multiple-resolution analysis of speech signals , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[23]  Unto K. Laine,et al.  Warped linear prediction (WLP) in speech and audio processing , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Paavo Alku,et al.  QuickSig-an object-oriented signal processing environment , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[25]  Frank Stajano,et al.  A Gentle Introduction to Relational and Object Oriented Databases , 1998 .

[26]  S. F. Actory,et al.  Personal correspondence , 1997 .

[27]  Toomas Altosaar,et al.  The QuickSig System and its Computer Music Applications , 1992 .

[28]  Toomas Altosaar,et al.  Applications for the hearing-impaired: evaluation of finnish phoneme recognition methods , 1997, EUROSPEECH.

[29]  Risto Näätänen,et al.  Intensity representation in the human auditory cortex , 1997 .

[30]  Toomas Altosaar,et al.  Towards a high quality Finnish talking head , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[31]  Toomas Altosaar,et al.  Three-dimensional modelling of speech corpora: added value through visualisation , 2001, INTERSPEECH.

[32]  Toomas Altosaar,et al.  Speaker recognition experiments in Estonian using multi-layer feed-forward neural nets , 1995, EUROSPEECH.

[33]  Toomas Altosaar,et al.  An efficient labeling tool for the Quicksig speech database , 1998, ICSLP.

[34]  Vesa Välimäki,et al.  Aktiivisen melunvaimennuksen signaalinkäsittelyalgoritmit , 1997 .

[35]  Toomas Altosaar,et al.  Modeling of pitch, loudness, and segmental durations in Finnish using neural networks , 1996 .

[36]  Tero Tolonen,et al.  Object-based sound source modeling , 2000 .

[37]  Toomas Altosaar,et al.  Applications for the hearing-impaired: comprehension of finnish text with phoneme errors , 1997, EUROSPEECH.

[38]  Toomas Altosaar,et al.  Pitch, loudness, and segmental duration correlates: towards a model for the phonetic aspects of Finnish prosody , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  J. Backman,et al.  Akustiikan laskennallinen mallintaminen , 1997 .

[40]  Toomas Altosaar,et al.  A knowledge-based approach to unlimited vocabulary speech recognition for the Finnish language , 1989, EUROSPEECH.

[41]  Peter Buneman,et al.  Towards a Query Language for Annotation Graphs , 2000, LREC.

[42]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[43]  Toomas Altosaar,et al.  Neural network models for Finnish prosody , 1999 .

[44]  Unto K. Laine,et al.  Crushing the delay: Tools for fractional delay filter design , 1994 .

[45]  Michael R. Blaha,et al.  Object-Oriented Modeling and Design for Database Applications , 1997 .

[46]  Matti Karjalainen,et al.  Työkoneiden ohjaamomelun häiritsevyys ja sen vähentäminen , 1998 .

[47]  Peter C. Jurs,et al.  Mathematica , 2019, J. Chem. Inf. Comput. Sci..

[48]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[49]  Jonathan Harrington,et al.  EMU: an Enhanced Hierarchical Speech Data Management System , 1996 .

[50]  Matti Karjalainen,et al.  Akustisten järjestelmien diskreettiaikaiset mallit ja soittimien mallipohjainen äänisynteesi , 1995 .

[51]  Gary E. Kopec The signal representation language SRL , 1983, ICASSP.

[52]  A. J. Fourcin,et al.  Levels of labelling , 1992 .

[53]  S. Adiga,et al.  Object-oriented databases , 1993 .

[54]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[55]  Paul Taylor,et al.  Heterogeneous relation graphs as a formalism for representing linguistic information , 2001, Speech Commun..

[56]  Stephanie Seneff,et al.  The development of the MIT Lisp-machine based speech research workstation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[57]  Vesa V Alim Aki Discrete-Time Modeling of Acoustic Tubes Using Fractional Delay Filters , 1995 .

[58]  Toomas Altosaar,et al.  Reduced impedance mismatch in speech database access , 2000, INTERSPEECH.

[59]  Paavo Alku,et al.  Speech processing in the object-oriented DSP environment quicksig , 1989, EUROSPEECH.

[60]  Toomas Altosaar,et al.  A multilingual phonetic representation and analysis system for different speech databases , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[61]  Toomas Altosaar,et al.  Speech synthesis using warped linear prediction and neural networks , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[62]  Cory S. Myers Signal representation for symbolic and numerical processing , 1986 .

[63]  Sonya E. Keene,et al.  Object-oriented programming in COMMON LISP - a programmer's guide to CLOS , 1989 .

[64]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[65]  Michael T. Goodrich,et al.  Education forum: Web Enhanced Textbooks , 1998, SIGA.

[66]  Toomas Altosaar,et al.  An object-oriented database for speech processing , 1993, EUROSPEECH.

[67]  Toomas Altosaar,et al.  Phoneme duration rules for speech synthesis by neural networks , 1991, EUROSPEECH.

[68]  U. Laine,et al.  An orthogonal set of frequency and amplitude modulated (FAM) functions for variable resolution signal analysis , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[69]  Bjarne Stroustrup,et al.  C++ : programovací jazyk : The C++ programming language (Orig.) , 1997 .

[70]  Julie Vonwiller,et al.  Speaker and Material Selection for the Australian National Database of Spoken Language , 1995, J. Quant. Linguistics.

[71]  Toomas Altosaar,et al.  Object-oriented Access to the Estonian Phonetic Database , 2000, LREC.