Applying multiple regression models for predicting word duration in a corpus of spontaneous speech

Using word duration as a representative of pronunciation variation, the objective of this research was to delineate a set of variables known to affect word duration and determine the total amount of variation in duration accounted for by them in a multiple linear regression model. More importantly, computing the amount of variation each variable contributes (independently of the others) is crucial in proving its predictive power. Authors such as [1] claim that probabilistic measures such as unigram probability greatly affect whether a word is likely to be reduced in its pronunciation (i.e. the more likely a word is to appear, the greater the chance of it being reduced). However, after performing a regression analysis on word durations from the Variation in Conversation (ViC) corpus of spontaneous speech, and computing partial correlation coefficients of each factor, the results showed that probabilistic measures such as unigram and bigram probability account for less than 1% of the variation in word duration. This finding suggests that the predictive power of certain variables is dependent on the type of corpus being examined — in the case of the spontaneous speech studies in [1], the examined corpus consisted of phone conversations, while the ViC corpus contains monologues.