A prototype voice-response questionnaire for the u.s. census

APROTOTYPEVOICE-RESPONSEQUESTIONNAIREFORTHEU.S.CENSUSRonald Cole, David G. Novick, Mark Fanty,Pieter Vermeulen, Stephen Sutton, Dan Burnett and Johan SchalkwykCenter for Sp oken Language UnderstandingOregon Graduate Institute of Science and Technology20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USAABSTRACTThis pap er describ es a study conducted to determine thefeasibilityof using a sp okenquestionnaireto collect infor-mationfortheYear2000CensusinUSA.Tore nethedialogueandtotrainrecognizers,wecollectedcom-pleteproto colsfromover4000callers.Fortheresp onseslab eled(ab outhalf ),over99p ercentoftheanswerscon-tain the desired information.The recognizers trained so farrangeinp erformancefrom75p ercentcorrectonyearofbirth to over 99 p ercent for maritalstatus.We develop eda prototyp esystemthat engagesthecallersina dialoguetoobtainthedesiredinformation,reviewsrecognizedinformationatthe endof thecall,andasksthecallertoidentify the resp onse categories that are incorrect.1.INTRODUCTIONWehavconductedastudytodeterminethefeasibilityof usingan automatedsp okenquestionnaireto collectin-formationfortheYear2000CensusinUnitedStatesofAmerica.Thegoalthestudywastodevelopandevaluate a telephone questionnaire that automaticall y cap-turesandrecognizesthefollowinginformation:(1)fullname, (2) sex, (3) birth date, (4) marital status (now mar-ried, widowed, divorced, separated, never married|cho oseone),(5)Hispanicorigin(yesorno);ifHispanic:Mexi-can, Mexican-American, Chicano, Puerto Rican, Cuban orother (sp eci y), (6) race:White, Black or Negro, AmericanIndian(sp ecifytrib e),Eskimo,Aleut,Chinese,Japanese,Filipino ,AsianIndian,Hawaiian,Samoan,Korean,Gua-manian, Vietnamese or other (sp ecify).After preliminaryrounds of data collectionto re ne theselectionandwordingof the system prompts,a large,re-gionallydiversedatacollectione ortresultedinapproxi-mately4000calls.Thispap erdescrib esthee ectivenessof theproto colinelicitingthedesiredinformationanditdescrib es the sp oken language system that resulted.2.SYSTEM2.1.RecognitionSignal Pro cessing.The caller's resp onse is transmitted overthe digital phone line as a 8 kHz mu-law enco ded digital sig-nal.A seventhorderPerceptualLinearPredictive(PLP)analysis [1] is p erformed every 6 msec using a 10 msec win-dow.Phonetic Classi cation.Each 6 msec frame of the signalisclassi edphoneticall ybyathreelaerneuralnetwork.To achieve maximum p erformance, a separate vo cabulary-dep endentnetworkistrainedforeachresp onsecategory,using a phoneme set particular to the exp ected pronuncia-tionsof words in that resp onsecategory.This consistsofthesubsetofstandardphonemeswhicho ccurinvo-cabulary, plus any additional context-dep endent phonemeswhichweredeemednecessary(e.g.[tw]forthe[t]in\twenty" and \telve").The background noise and silenceare mo deled by a sp ecial phoneme [.pau].For each frame of sp eech, the neural network is providedwith 70 inputs, which consists of eight PLP co ecients andtwovoicingoutputsfromtheframetob eclassi edandaveraged oer each of the following regions b efore and afterthe frame to b e classi ed:6 to 18 msec, 36 to 48 msec and72 to 84 msec.The two inputs that estimate voicing for each frame areprovidedby a separate three-layerneuralnetwork trainedonvoicedandoicelesssp eechframesfromtendi erentlanguages.Althoughthe voicingclassi eristrainedwiththe same PLP features describ edab ove, exp eriments haeshown that includin gthese features improves classi catio np erformance.The outputs of the network fall in the range (0,1) b ecauseof the sigmoid transfer function, and, ideally,approximatethea posterioriprobability of that phoneme given the input[2]. These values are divided by the prior probabili ty of thephoneme in the training set [3].Training the Classi ers.rainingthe neural network re-quiredphoneticall ysegmenteddata.Weusedasemi-automaticpro cedurethatinvolvedhandtranscriptionatthe word levelof ab out a quarter of the corpus and auto-maticgenerationof\forced"phoneticalignmentthesetranscriptionsusing a classi er trained on a di erent task.Anewclassi erwasthentrainedonautomaticallyaligned census data and used to realign it. The pro cess wasrep eated a couple of times until p erformance asymptoted.Anequalnumb eroftrainingsamples(approximately1000) was used for each phoneme class.As a consequence,rarephonemesweresampledmore nelythancommonphonemes.Trainingexamplesforbackgroundnoiseandsilencewerechosensuchthatatleasthalfo ccurclosetophonemeb oundaries.Thisbalancingwas needed to trainfor prop er discriminati on b etween the background class andunvoiced closures.Theneuralnetworkas trainedusingbackpropagationProcedings of. ICSLP-94, Sept.19941IEEE 1994

[1]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[2]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[3]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Ronald A. Cole,et al.  English alphabet recognition with telephone speech , 1991, EUROSPEECH.

[6]  Horacio Franco,et al.  Context-Dependent Multiple Distribution Phonetic Modeling with MLPs , 1992, NIPS.

[7]  Hervé Bourlard,et al.  A new approach towards keyword spotting , 1993, EUROSPEECH.