Abstract In this paper, we propose an improved approach for spokenterm detection using pseudo-relevance feedback. To remedy theproblem of unmatched acoustic models with respect to spokenutterances produced under different acoustic conditions, whichmay give relatively poor recognition output, we integrate therelevance scores derived from the lattices with the DTW dis-tances derived from the feature space of MFCC parametersor phonetic posteriorgrams. These DTW distances are evalu-ated for a carefully selected set of pseudo-relevant utterances,which obtained from the first-pass returned list given by thesearch engine. The utterances on the first-pass returned list arethen reranked accordingly and finally shown to the user. Veryencouraging, performance improvements were obtained in thepreliminary experiments, especially when the acoustic modelsare poorly matched to the spoken utterances.Index Terms: spoken term detection, pseudo-relevance feed-back 1. Introduction Spoken term detection is to return a list of spoken utterancescontaining the term requested by the user. In many approachesof spoken term detection, the spoken utterances are first recog-nized and transformed into transcriptions or lattices by speechrecognition technologies, and then the search engine looksthrough all the transcriptions or lattices very similar to the text-based information retrieval. In this process much of the in-formation in the acoustic signals may be lost in the stage ofspeech recognition, especially when the acoustic models usedare not well matched to the characteristics of the acoustic sig-nals, which naturally results in degraded recognition accuracyand poor detection performance. This is very common in thescenario of spoken term detection, because the huge quantitiesof spoken utterances available over the Internet are naturallyproduced by many different people under many different acous-tic conditions, it is thus very difficult to train a set of acousticmodels well matched to so many different acoustic conditions.As a result, when the relevance scores such as the posteriorprobabilities of the query term derived from transcriptions orlattices are used to rank the retrieved utterances, it is hard tojudge whether a word hypothesis of the query in the transcrip-tions or lattices is a positive target or a false alarm when therecognition output is unreliable. Although many efficient ap-proaches [1, 2, 3] have been proposed to enhance the detectionperformance due to the relatively poor recognition output, thecompensative information straightly from the feature space isnecessary.In text-based information retrieval, even if the texts to beretrieved include all precise words, it is still difficult to retrieveall documents relevant to the query term because many of themdo not include the very short query term entered by the user.However, because many related terms may co-occur in manyrelated documents, a document containing some words appear-ing in some documents identified to be relevant by the searchengine may have high probability to be relevant, even if it doesnot include the query term. For example, a document includingthe words ”George Bush”, ”US”, ”Middle East” may be relevantto a query term of ”White House”, even if it does not includethe query term of ”White House”. In other words, it is possi-ble to retrieve the relevant documents without the query termsince they are ”similar” to some retrieved relevant documentsin some way. Pseudo-relevance feedback, also known as blindrelevance feedback, is one way to realize the above idea. In thisapproach, it is assumed that the set of documents appearing onthe top of the retrieved document list are relevant (or ”pseudo-relevant”), so documents somehow similar to those ”pseudo-relevant” documents can be retrieved, for example, by expand-ing the query with keywords from those ”pseudo-relevant” doc-uments [4]. Similar idea of pseudo-relevance feedback has beenapplied on spoken term detection [5].In this paper, we try to perform similar pseudo-relevancefeedback for spoken term detection as shown in Figure 1. Theupper half of Figure 1 is the conventional spoken term detec-tion. MFCC features were obtained from all spoken utterancesin the archive, speech recognition produces lattices for the ut-terances, and the retrieved engine selects the utterances basedon the relevance scores evaluated from the lattices with respectto the query Qentered by the user. The approach proposedhere in this paper is shown in the lower half of Figure 1. Thefirst-pass returned list is not shown to the user, but instead a”pseudo-relevant utterance set X
[1]
Bhuvana Ramabhadran,et al.
Balancing false alarms and hits in Spoken Term Detection
,
2010,
2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[2]
Simon King,et al.
Term-dependent confidence for out-of-vocabulary term detection
,
2009,
INTERSPEECH.
[3]
Jithendra Vepa,et al.
Using posterior-based features in template matching for speech recognition
,
2006,
INTERSPEECH.
[4]
Timothy J. Hazen,et al.
Query-by-example spoken term detection using phonetic posteriorgram templates
,
2009,
2009 IEEE Workshop on Automatic Speech Recognition & Understanding.
[5]
Carmel Domshlak,et al.
Better than the real thing?: iterative pseudo-query processing using cluster-based language models
,
2005,
SIGIR '05.
[6]
Bhuvana Ramabhadran,et al.
Query-by-example Spoken Term Detection For OOV terms
,
2009,
2009 IEEE Workshop on Automatic Speech Recognition & Understanding.
[7]
Sridha Sridharan,et al.
Optimising Figure of Merit for phonetic spoken term detection
,
2010,
2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[8]
Peng Yu,et al.
Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures
,
2006,
NAACL.
[9]
Lin-Shan Lee,et al.
Integrating recognition and retrieval with user feedback: A new framework for spoken term detection
,
2010,
2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[10]
Wei-Ying Ma,et al.
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
,
2003,
WWW '03.