Enhancing sentence‐level clustering with integrated and interactive frameworks for theme‐based summarization

Sentence clustering plays a pivotal role in theme-based summarization, which discovers topic themes defined as the clusters of highly related sentences to avoid redundancy and cover more diverse information. As the length of sentences is short and the content it contains is limited, the bag-of-words cosine similarity traditionally used for document clustering is no longer suitable. Special treatment for measuring sentence similarity is necessary. In this article, we study the sentence-level clustering problem. After exploiting concept- and context-enriched sentence vector representations, we develop two co-clustering frameworks to enhance sentence-level clustering for theme-based summarization—integrated clustering and interactive clustering—both allowing word and document to play an explicit role in sentence clustering as independent text objects rather than using word or concept as features of a sentence in a document set. In each framework, we experiment with two-level co-clustering (i.e., sentence-word co-clustering or sentence-document co-clustering) and three-level co-clustering (i.e., document-sentence-word co-clustering). Compared against concept- and context-oriented sentence-representation reformation, co-clustering shows a clear advantage in both intrinsic clustering quality evaluation and extrinsic summarization evaluation conducted on the Document Understanding Conferences (DUC) datasets. (Xiaoyan Cai is now at College of Information Engineering, Northwest A&F University.)

[1]  Kok-Leong Ong,et al.  Enhancing the Effectiveness of Clustering with Spectra Analysis , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[3]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[4]  Karen Spärck Jones Automatic summarising: The state of the art , 2007, Inf. Process. Manag..

[5]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[6]  Vasileios Hatzivassiloglou,et al.  Event-Based Extractive Summarization , 2004 .

[7]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[8]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[9]  Furu Wei,et al.  Applying two-level reinforcement ranking in query-oriented multidocument summarization , 2009, J. Assoc. Inf. Sci. Technol..

[10]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[11]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[12]  J. Steinberger,et al.  LSA-Based Multi-Document Summarization , 2007 .

[13]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[14]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[15]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[16]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[17]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[19]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[20]  Wei-Ying Ma,et al.  A unified framework for clustering heterogeneous Web objects , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[21]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[22]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[23]  Ryan T. McDonald A Study of Global Inference Algorithms in Multi-document Summarization , 2007, ECIR.

[24]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[25]  Wenjie Li,et al.  A Context-Sensitive Manifold Ranking Approach to Query-Focused Multi-document Summarization , 2010, PRICAI.

[26]  Furu Wei,et al.  Applying two-level reinforcement ranking in query-oriented multidocument summarization , 2009 .

[27]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[28]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[29]  Tharam S. Dillon,et al.  Tree model guided candidate generation for mining frequent subtrees from XML documents , 2008, TKDD.

[30]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[31]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[32]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[33]  Furu Wei,et al.  Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization , 2008, SIGIR '08.

[34]  Mihalis Yannakakis,et al.  The Complexity of Multiterminal Cuts , 1994, SIAM J. Comput..

[35]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[36]  Alex A. Freitas,et al.  Document Clustering and Text Summarization , 2000 .