Toward Constructing A Multilingual Speech Corpus for Taiwanese (Min-nan), Hakka, and Mandarin Chinese

The Formosa speech database (ForSDat) is a multilingual speech corpus collected at Chang Gung University and sponsored by the National Science Council of Taiwan. It is expected that a multilingual speech corpus will be collected, covering the three most frequently used languages in Taiwan: Taiwanese (Min-nan), Hakka, and Mandarin. This 3-year project has the goal of collecting a phonetically abundant speech corpus of more than 1,800 speakers and hundreds of hours of speech. Recently, the first version of this corpus containing speech of 600 speakers of Taiwanese and Mandarin was finished and is ready to be released. It contains about 49 hours of speech and 247,000 utterances.

[1]  Francisco Javier Caminero Gil,et al.  Discriminative training of GMM for speaker identification , 1996, ICASSP.

[2]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[3]  Tatsuya Kawahara,et al.  Task adaptation using MAP estimation in N-gram language modeling , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Clifford Nass,et al.  The media equation - how people treat computers, television, and new media like real people and places , 1996 .

[5]  Ching-Tang Hsieh,et al.  Robust speech features based on wavelet transform with application to speaker identification , 2002 .

[6]  Jerome R. Bellegarda Large vocabulary speech recognition with multispan statistical language models , 2000, IEEE Trans. Speech Audio Process..

[7]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[8]  Ruth F. Eisenberg Talking to a machine , 1979 .

[9]  M. Bradley,et al.  Emotion, attention, and the startle reflex. , 1990, Psychological review.

[10]  Hsin-Min Wang,et al.  Eigenspace-based maximum a posteriori linear regression for rapid speaker adaptation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[12]  Alon Lavie,et al.  Janus-III: speech-to-speech translation in multiple languages , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Zhou Guodong,et al.  Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition , 1999 .

[14]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[15]  Jerome R. Bellegarda,et al.  A statistical language modeling approach integrating local and global constraints , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[16]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[17]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[18]  Ching-Tang Hsieh,et al.  Robust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model , 2003, J. Inf. Sci. Eng..

[19]  Chung-Hsien Wu,et al.  台語多聲調音節合成單元資料庫暨文字轉語音雛形系統之發展 (Establish Taiwanese 7-Tones Syllable-based Synthesis Units Database for the Prototype Development of Text-To-Speech System) [In Chinese] , 1999, ROCLING.

[20]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[21]  Li Deng,et al.  Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition , 2003, IEEE Trans. Speech Audio Process..

[22]  Bonnie J. Dorr,et al.  Machine Translation: A View from the Lexicon , 1994, CL.

[23]  S. Fukuda,et al.  Extracting emotion from voice , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[24]  Worldbet,et al.  ASCII Phonetic Symbols for the World s Languages Worldbet , 1994 .

[25]  Vladimir Lifschitz,et al.  is stronger than , 1979 .

[26]  ZU Yiqing,et al.  A SUPER PHONETIC SYSTEM AND MULTI-DIALECT CHINESE SPEECH CORPUS FOR SPEECH RECOGNITION , 2002 .

[27]  Gary S. Katz,et al.  Bimodal expression of emotion by face and voice , 1998, MULTIMEDIA '98.

[28]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Aaron E. Rosenberg,et al.  Speaker identification using minimum classification error training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[30]  Marcello Federico,et al.  Bayesian estimation methods for n-gram language model adaptation , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[31]  Michel Simard,et al.  Translation Spotting for Translation Memories , 2003, ParallelTexts@NAACL-HLT.

[32]  Ching-Tang Hsieh,et al.  A Robust Speaker Identification System Based on Wavelet Transform , 2001 .

[33]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[34]  Douglas D. O'Shaughnessy,et al.  Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition , 1999, IEEE Trans. Speech Audio Process..

[35]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[37]  Ren-Yuan Lyu,et al.  Automatic selection of phonetically distributed sentence sets for speaker adaptation with application to large vocabulary Mandarin speech recognition , 1999, Comput. Speech Lang..

[38]  S. Furui,et al.  Vector-quantization-based speech recognition and speaker recognition techniques , 1991, [1991] Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems & Computers.

[39]  Yasunari Yoshitomi,et al.  Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face , 2000, Proceedings 9th IEEE International Workshop on Robot and Human Interactive Communication. IEEE RO-MAN 2000 (Cat. No.00TH8499).

[40]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[41]  01 New Aurora Activity for Standardization of a Front-End Extension for Tonal Language Recognition and Speech Reconstruction , 2001 .

[42]  Frederick Jelinek,et al.  Self-organizing language modeling for speech recognition , 1990 .

[43]  Hermann Ney,et al.  Speech-to-speech translation based on finite-state transducers , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[44]  Chung-Hsien Wu,et al.  Multi-keyword spotting of telephone speech using a fuzzy search algorithm and keyword-driven two-level CBSM , 2001, Speech Commun..

[45]  I. Daubechies Orthonormal bases of compactly supported wavelets , 1988 .

[46]  Brendan J. Frey,et al.  Towards non-stationary model-based noise adaptation for large vocabulary speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[47]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[48]  Richard J. Mammone,et al.  Use of non-negative matrix factorization for language model adaptation in a lecture transcription task , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[49]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[50]  Naftali Z. Tisby On the application of mixture AR hidden Markov models to text independent speaker recognition , 1991, IEEE Trans. Signal Process..

[51]  Ren-Yuan Lyu,et al.  A bi-lingual Mandarin/taiwanese (min-nan), large vocabulary, continuous speech recognition system based on the tong-yong phonetic alphabet (TYPA) , 2000, INTERSPEECH.

[52]  Wivun Taiffalo Chiung Articles on Language Planning and Romanization : Romanization and Language Planning in Taiwan , 2001 .

[53]  Chiyomi Miyajima,et al.  Speaker identification using Gaussian mixture models based on multi-space probability distribution , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[54]  Pero Subasic,et al.  Affect analysis of text using fuzzy semantic typing , 2001, IEEE Trans. Fuzzy Syst..

[55]  Jhing-Fa Wang,et al.  國語文句翻台語語音系統之研究 (A Study for Mandarin Text to Taiwanese speech System) [In Chinese] , 1999, ROCLING.

[56]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[57]  Vassilios Digalakis,et al.  Quantization of cepstral parameters for speech recognition over the World Wide Web , 1999, IEEE J. Sel. Areas Commun..

[58]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1993, IEEE Trans. Speech Audio Process..

[59]  Yuang-chin Chiang,et al.  An efficient algorithm to select phonetically balanced scripts for constructing a speech corpus , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[60]  Nikki Mirghafori,et al.  Combining connectionist multi-band and full-band probability streams for speech recognition of natural numbers , 1998, ICSLP.

[61]  Toshiyuki Takezawa,et al.  End-to-end evaluation in ATR-MATRIX: speech translation system between English and Japanese , 1999, EUROSPEECH.

[62]  Chin-Hui Lee,et al.  On stochastic feature and model compensation approaches to robust speech recognition , 1998, Speech Commun..

[63]  Guodong Zhou,et al.  Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition , 1999, Comput. Speech Lang..

[64]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[65]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[66]  Chung-Hsien Wu,et al.  Emotion recognition from textual input using an emotional semantic network , 2002, INTERSPEECH.

[67]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Ryosuke Isotani,et al.  A Speech Translation System with Mobile Wireless Clients , 2003, ACL.

[69]  Imre Kiss,et al.  Noise robust speech parameterization using multiresolution feature extraction , 2001, IEEE Trans. Speech Audio Process..

[70]  Jasha Droppo,et al.  A noise-robust ASR front-end using Wiener filter constructed from MMSE estimation of clean speech and noise , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[71]  James Nga-Kwok Liu,et al.  A hybrid model for Chinese-English machine translation , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[72]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[73]  Takeshi Kawabata,et al.  Back-off method for n-gram smoothing based on binomial posteriori distribution , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[74]  J. Buck,et al.  Text-dependent speaker recognition using vector quantization , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[75]  Chin-Hui Lee,et al.  A study on speaker adaptation of continuous density HMM parameters , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[76]  Parcor Coeff,et al.  Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features , 1981 .

[77]  J.H.L. Hansen,et al.  An efficient scoring algorithm for Gaussian mixture model based speaker identification , 1998, IEEE Signal Processing Letters.

[78]  Kenji Suzuki,et al.  The Humanization, Personalization and Authentication Issues in the Design of Interactive Service System , 2003, Trans. SDPS.

[79]  Misha Pavel,et al.  Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[80]  J. Véronis,et al.  Evaluation of parallel text alignment systems The ARCADE project , 2000 .

[81]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[82]  Hynek Hermansky,et al.  Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[83]  Harry Shum,et al.  Emotion Detection from Speech to Enrich Multimedia Content , 2001, IEEE Pacific Rim Conference on Multimedia.

[84]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[85]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[86]  Chiu-yu Tseng,et al.  MAT-2000 - design, collection, and validation of a Mandarin 2000-speaker telephone speech database , 2000, INTERSPEECH.

[87]  Michel Simard,et al.  TransSearch: A Free Translation Memory on the World Wide Web , 2000, LREC.

[88]  Xerox Corpora,et al.  Speech Recognition Experiments with Linear Predication, Bandpass Filtering, and Dynamic Programming , 1975 .

[89]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.