The development of file formats for very large speech corpora: SPHERE and SHORTEN

The performance of large vocabulary speech recognition systems is currently thought to be limited by the size of the corpus used to train the recognition system. Hence several very large speech corpora have been created recently and many more are planned. A significant problem in the generation of these corpora is the definition of their format to minimize distribution costs and maximize ease of use. This paper describes the development of a "standard" lossless compressed waveform file format which minimizes the media required for corpora distribution while maximizing accessibility. This paper contains two primary contributions: 1) The use of a "standard" file format for speech corpora which supports embedded compression and the development of a software interface toolkit which supports automatic waveform compression/decompression; 2) The use of lossless data compression for speech corpora. This task differs from mainstream speech coding in that the compression must be fast and lossless. Fast approximations to the standard techniques of linear prediction and residual coding have been developed and are employed.<<ETX>>