论文信息 - Statistical Sentence Extraction for Information Distillation

Statistical Sentence Extraction for Information Distillation

Information distillation aims to extract the most useful pieces of information related to a given query from massive, possibly multilingual, audio and textual document sources. One critical component in a distillation engine is detecting sentences to be extracted from each relevant document. In this paper, we present a statistical sentence extraction approach for distillation. Basically, we frame this tack as a classification problem, where each candidate sentence in documents is classified as a relevant to the query or not. These documents may be textual or audio format and in a number of languages. For audio documents, we use both manual and automatic transcriptions, for non-English documents, we use automatic translations. In this work, we use AdaBoost, a discriminative classification method with both lexical and semantic features. The results indicate 11%-13% relative improvement over a baseline keyword-spotting-based approach. We also show the robustness of our method on the audio subset of the document sources using manual and automatic transcriptions.

Gökhan Tür | Dilek Z. Hakkani-Tür | Gökhan Tür

[1] Gökhan Tür,et al. The AT&T spoken language understanding system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Julia Hirschberg,et al. Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization , 2005, INTERSPEECH.

[3] W. Bruce Croft,et al. Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[4] Donna K. Harman,et al. The Text REtrieval Conference (TREC) , 1999, NTCIR.

[5] Francine Chen,et al. A trainable document summarizer , 1995, SIGIR '95.

[6] Hoa Trang Dang,et al. Overview of DUC 2005 , 2005 .

[7] Dilek Z. Hakkani-Tür,et al. The ICSI+ multilingual sentence segmentation system , 2006, INTERSPEECH.

[8] Ralph Grishman,et al. NYU's English ACE 2005 System Description , 2005 .

[9] Jade Goldstein-Stewart,et al. Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[10] Yoram Singer,et al. BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.