Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization

Sentence clustering plays a pivotal role in theme-based summarization, which discovers topic themes defined as the clusters of highly related sentences in order to avoid redundancy and cover more diverse information. As the length of sentences is short and the content it contains is limited, the bag-of-words cosine similarity traditionally used for document clustering is no longer reasonably suitable. Special treatment for measuring sentence similarity is necessary. In this paper, we propose a ranking-based clustering framework that utilizes ranking distribution of documents and terms to help generate high quality sentence clusters. The effectiveness of the proposed framework is demonstrated by both the cluster quality analysis and the summarization evaluation conducted on the DUC 2004 and DUC2007 datasets.

[1]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[2]  James L. McClelland,et al.  Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences , 1986 .

[3]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[4]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[5]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[6]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[7]  J. Steinberger,et al.  LSA-Based Multi-Document Summarization , 2007 .

[8]  Vasileios Hatzivassiloglou,et al.  Event-Based Extractive Summarization , 2004 .

[9]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[10]  Wenjie Li,et al.  A Context-Sensitive Manifold Ranking Approach to Query-Focused Multi-document Summarization , 2010, PRICAI.

[11]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[12]  Karen Spärck Jones Automatic summarising: The state of the art , 2007, Inf. Process. Manag..

[13]  Regina Barzilay,et al.  Towards Multidocument Summarization by Reformulation: Progress and Prospects , 1999, AAAI/IAAI.

[14]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[15]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[16]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[17]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[18]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[19]  Wenjie Li,et al.  Enhancing sentence‐level clustering with integrated and interactive frameworks for theme‐based summarization , 2011, J. Assoc. Inf. Sci. Technol..

[20]  Tomek Strzalkowski,et al.  A Robust Practical Text Summarization , 1998 .

[21]  Furu Wei,et al.  Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization , 2008, SIGIR '08.

[22]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[23]  Kok-Leong Ong,et al.  Enhancing the Effectiveness of Clustering with Spectra Analysis , 2007, IEEE Transactions on Knowledge and Data Engineering.

[24]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[25]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[26]  Xuanjing Huang,et al.  Fudan University at DUC 2006 , 2005 .

[27]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[29]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[30]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[31]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[32]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[33]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[34]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .