Data sampling for improved speech recognizer training

Proper data selection for training a speech recognizer can be important for reducing costs of developing systems on new tasks and exploratory experiments, but it is also useful for efficient leveraging of the increasingly large speech resources available for training large vocabulary systems. In this work, we investigate various sampling methods, comparing the likelihood criterion to new acoustic measures motivated by work in child language acquisition. The acoustic criteria can be used with or without pre-existing transcriptions or models. When applied to the problem of selecting a small training set, the best results are obtained using modulation spectrum features and a discriminant function trained on child vs. adult-directed speech. For large corpora, none of the methods outperforms random sampling, but reduced training costs are obtained by using multistage training and initializing with the small corpus.