Test Collections for Spoken Document Retrieval from Lecture Audio Data

The Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing of the Information Processing Society of Japan, is developing a test collection for evaluation of spoken document retrieval systems. A prototype of the test collection consists of a set of textual queries, relevant segment lists, and transcriptions by an automatic speech recognition system, allowing retrieval from the Corpus of Spontaneous Japanese (CSJ). From about 100 initial queries, application of the criteria that a query should have more than five relevant segments that consist of about one minute speech segments yielded 39 queries. Targeting the test collection, an ad hoc retrieval experiment was also conducted to assess the baseline retrieval performance by applying a standard method for spoken document retrieval.

[1]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[2]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.

[3]  Manabu Okumura,et al.  Text Summarization Challenge 2 text summarization evaluation at NTCIR workshop 3 , 2004, SIGF.

[4]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[5]  Tetsuya Sakai,et al.  Lessons from BMIR-J2: a test collection for Japanese IR systems , 1998, SIGIR '98.

[6]  Manabu Okumura,et al.  Text summarization challenge 2: text summarization evaluation at NTCIR workshop 3 , 2001, HLT-NAACL 2003.

[7]  Noriko Kando,et al.  Overview of Patent Retrieval Task at NTCIR-5 , 2005, NTCIR.

[8]  Tatsuya Kawahara Benchmark test for speech recognition using the Corpus of Spontaneous Japanese , 2003 .

[9]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[10]  Hayato Yamana,et al.  Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2) , 2005, NTCIR.

[11]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Sixth NTCIR Workshop , 2005, NTCIR.

[12]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.

[13]  Tsuneaki Kato,et al.  An Overview of NTCIR-5 QAC3 , 2005, NTCIR.

[14]  Simone Teufel,et al.  An Overview of Evaluation Methods in TREC Ad Hoc Information Retrieval and TREC Question Answering , 2007 .

[15]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[16]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.