Content-Based Search in Multilingual Audiovisual Documents Using the International Phonetic Alphabet

We present in this paper an approach based on the use of the International Phonetic Alphabet (IPA) for content-based indexing and retrieval of multilingual audiovisual documents. The approach works even if the languages of the document are unknown. It has been validated in the context of the "Star Challenge" search engine competition organized by the Agency for Science, Technology and Research (A*STAR) of Singapore. Our approach includes the building of an IPA-based multilingual acoustic model and a dynamic programming based method for searching document segments by "IPA string spotting". Dynamic programming allows for retrieving the query string in the document string even with a significant transcription error rate at the phone level. The methods that we developed ranked us as first and third on the monolingual (English) search task, as fifth on the multilingual search task and as first on the multimodal (audio and image) search task.

[1]  Jean-Luc Gauvain,et al.  A method for connected word recognition and word spotting on a microprocessor , 1982, ICASSP.

[2]  Tanja Schultz,et al.  Acoustic-Phonetic Unit Similarities For Context Dependent Acoustic Model Portability , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Tien Ping Tan,et al.  Improving pronunciation modeling for non-native speech recognition , 2008, INTERSPEECH.

[4]  Jean-François Serignat,et al.  Spoken and Written Language Resources for Vietnamese , 2004, LREC.

[5]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[6]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[7]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[8]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  M. Topi,et al.  Texture classification by multi-predicate local binary pattern operators , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[13]  Stéphane Ayache,et al.  Image and Video Indexing Using Networks of Operators , 2007, EURASIP J. Image Video Process..

[14]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[15]  Sylvain Meignier,et al.  SPEAKER DIARIZATION IN THE ELISA CONSORTIUM OVER THE LAST 4 YEARS , 2004 .