A Comparative Study of Probabilistic Ranking Models for Chinese Spoken Document Summarization

Extractive document summarization automatically selects a number of indicative sentences, passages, or paragraphs from an original document according to a target summarization ratio, and sequences them to form a concise summary. In this article, we present a comparative study of various probabilistic ranking models for spoken document summarization, including supervised classification-based summarizers and unsupervised probabilistic generative summarizers. We also investigate the use of unsupervised summarizers to improve the performance of supervised summarizers when manual labels are not available for training the latter. A novel training data selection approach that leverages the relevance information of spoken sentences to select reliable document-summary pairs derived by the probabilistic generative summarizers is explored for training the classification-based summarizers. Encouraging initial results on Mandarin Chinese broadcast news data are demonstrated.

[1]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[2]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[3]  Konstantinos Koumpis,et al.  Automatic summarization of voicemail messages using lexical and prosodic features , 2005, TSLP.

[4]  Julia Hirschberg,et al.  Summarizing Speech Without Text Using Hidden Markov Models , 2006, NAACL.

[5]  Vibhu O. Mittal,et al.  Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries , 1999, SIGIR '99.

[6]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jerome R. Bellegarda,et al.  Statistical language model adaptation: review and perspectives , 2004, Speech Commun..

[8]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[9]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Junlan Feng,et al.  Speech and language processing over the web , 2008, IEEE Signal Processing Magazine.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Phyllis B. Baxendale,et al.  Machine-Made Index for Technical Literature - An Experiment , 1958, IBM J. Res. Dev..

[13]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[14]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[15]  Berlin Chen,et al.  Chinese Spoken Document Summarization Using Probabilistic Latent Topical Information , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[17]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[18]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[19]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Julia Hirschberg,et al.  Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization , 2005, INTERSPEECH.

[22]  Kam-Fai Wong,et al.  Extractive Summarization Using Supervised and Semi-Supervised Learning , 2008, COLING.

[23]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[24]  Berlin Chen,et al.  Extractive spoken document summarization for information retrieval , 2008, Pattern Recognit. Lett..

[25]  Yuji Matsumoto,et al.  A new approach to unsupervised text summarization , 2001, SIGIR '01.

[26]  Hsin-Min Wang,et al.  Extractive Chinese Spoken Document Summarization Using Probabilistic Ranking Models , 2006, ISCSLP.

[27]  Heidi Christensen,et al.  A Cascaded Broadcast News Highlighter , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Pascale Fung,et al.  Speech Summarization Without Lexical Features for Mandarin Broadcast News , 2007, NAACL.

[29]  Sadaoki Furui,et al.  Sentence extraction-based presentation summarization techniques and evaluation metrics , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[30]  Hsin-Min Wang,et al.  MATBN: A Mandarin Chinese Broadcast News Corpus , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..

[31]  Berlin Chen,et al.  Lightly supervised and data-driven approaches to Mandarin broadcast news transcription , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Sadaoki Furui,et al.  Speech-to-text and speech-to-speech summarization of spontaneous speech , 2004, IEEE Transactions on Speech and Audio Processing.

[33]  WangHsin-Min,et al.  A Comparative Study of Probabilistic Ranking Models for Chinese Spoken Document Summarization , 2009 .

[34]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[35]  Xavier L. Aubert,et al.  An overview of decoding techniques for large vocabulary continuous speech recognition , 2002, Comput. Speech Lang..

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  David G. Stork,et al.  Pattern Classification , 1973 .

[38]  Jean Carletta,et al.  Extractive summarization of meeting recordings , 2005, INTERSPEECH.

[39]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[40]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[41]  Berlin Chen,et al.  Exploring the use of latent topical information for statistical Chinese spoken document retrieval , 2006, Pattern Recognit. Lett..

[42]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[43]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[44]  Berlin Chen,et al.  Word Topic Models for Spoken Document Retrieval and Transcription , 2009, TALIP.

[45]  Jian Zhang,et al.  A comparative study on speech summarization of broadcast news and lecture speech , 2007, INTERSPEECH.

[46]  Andreas G. Andreou,et al.  Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition , 1997 .

[47]  Lin-shan Lee,et al.  Spoken document understanding and organization , 2005, IEEE Signal Processing Magazine.

[48]  Berlin Chen,et al.  Training data selection for improving discriminative training of acoustic models , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[49]  Yang Liu,et al.  Impact of automatic sentence segmentation on meeting summarization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Vibhu O. Mittal,et al.  Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries (poster abstract). , 1998, SIGIR 1999.

[51]  Hsin-Min Wang,et al.  A unified probabilistic generative framework for extractive spoken document summarization , 2007, INTERSPEECH.

[52]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.