Developing an SDR Test Collection from Japanese Lecture Audio Data

The lecture is one of the most valuable genres of audiovisual data. However, spoken lectures are difficult to reuse because browsing and efficient searching within spoken lectures is difficult. To promote the research activities in the spoken lecture retrieval, this paper reports a test collection for its evaluation. The test collection consists of the target spoken documents of about 2,700 lectures (604 hours) taken from the Corpus of Spontaneous Japanese (CSJ), 39 retrieval queries, the relevant passages in the target documents for each query, and the automatic transcription of the target speech data. We report the retrieval performance targeting the constructed test collection by applying a standard spoken document retrieval (SDR) method, which serves as a baseline for the forthcoming SDR studies using the test collection. We also introduce the several studies conducted by the users of the test collection.

[1]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[2]  Tsuneaki Kato,et al.  An Overview of NTCIR-5 QAC3 , 2005, NTCIR.

[3]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[4]  Seiichi Nakagawa,et al.  Useful Contents of Classroom Lecture Speech and a Browsing System , 2008 .

[5]  Katunobu Itou,et al.  LODEM: A system for on-demand video lectures , 2006, Speech Commun..

[6]  Noriko Kando,et al.  Overview of Patent Retrieval Task at NTCIR-5 , 2005, NTCIR.

[7]  Lin-Shan Lee,et al.  Learning on demand - course lecture distillation by information extraction and semantic structuring for spoken documents , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Hayato Yamana,et al.  Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2) , 2005, NTCIR.

[9]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Sixth NTCIR Workshop , 2005, NTCIR.

[10]  Tatsuya Kawahara Benchmark test for speech recognition using the Corpus of Spontaneous Japanese , 2003 .

[11]  Simone Teufel,et al.  An Overview of Evaluation Methods in TREC Ad Hoc Information Retrieval and TREC Question Answering , 2007 .

[12]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[13]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[14]  Manabu Okumura,et al.  Text Summarization Challenge 2 text summarization evaluation at NTCIR workshop 3 , 2004, SIGF.

[15]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[16]  Yusuke Yokota,et al.  Spoken document retrieval by translating recognition candidates into correct transcriptions , 2008, INTERSPEECH.

[17]  Tetsuya Sakai,et al.  Lessons from BMIR-J2: a test collection for Japanese IR systems , 1998, SIGIR '98.

[18]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.

[19]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.