Rapid and accurate spoken term detection

We present a state-of-the-art system for performing spoken term detection on continuous telephone speech in multiple languages. The system compiles a search index from deep word lattices generated by a large-vocabulary HMM speech recognizer. It estimates word posteriors from the lattices and uses them to compute a detection threshold that minimizes the expected value of a user-specified cost function. The system accommodates search terms outside the vocabulary of the speechto-text engine by using approximate string matching on induced phonetic transcripts. Its search index occupies less than 1Mb per hour of processed speech and it supports sub-second search times for a corpus of hundreds of hours of audio. This system had the highest reported accuracy on the telephone speech portion of the 2006 NIST Spoken Term Detection evaluation, achieving 83% of the maximum possible accuracy score in English.