Exploiting neighborhood knowledge for single document summarization and keyphrase extraction

Document summarization and keyphrase extraction are two related tasks in the IR and NLP fields, and both of them aim at extracting condensed representations from a single text document. Existing methods for single document summarization and keyphrase extraction usually make use of only the information contained in the specified document. This article proposes using a small number of nearest neighbor documents to improve document summarization and keyphrase extraction for the specified document, under the assumption that the neighbor documents could provide additional knowledge and more clues. The specified document is expanded to a small document set by adding a few neighbor documents close to the document, and the graph-based ranking algorithm is then applied on the expanded document set to make use of both the local information in the specified document and the global information in the neighbor documents. Experimental results on the Document Understanding Conference (DUC) benchmark datasets demonstrate the effectiveness and robustness of our proposed approaches. The cross-document sentence relationships in the expanded document set are validated to be beneficial to single document summarization, and the word cooccurrence relationships in the neighbor documents are validated to be very helpful to single document keyphrase extraction.

[1]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[2]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[3]  Kathleen McKeown,et al.  Cut and Paste Based Text Summarization , 2000, ANLP.

[4]  Wai Lam,et al.  Collaborative Information Extraction and Mining from Multiple Web Documents , 2006, SDM.

[5]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[6]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[7]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[8]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[9]  Sanda M. Harabagiu,et al.  Topic themes for multi-document summarization , 2005, SIGIR '05.

[10]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[11]  Xiaojun Wan,et al.  CollabSum: exploiting multiple document clustering for collaborative single document summarizations , 2007, SIGIR.

[12]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[13]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[14]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[15]  Richard K. Belew,et al.  Exporting phrases: a statistical analysis of topical language , 1991 .

[16]  Xiaojun Wan,et al.  CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction , 2008, COLING.

[17]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[18]  Min Song,et al.  KPSpotter: a flexible information gain-based keyphrase extraction system , 2003, WIDM '03.

[19]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[20]  Xiaojun Wan,et al.  Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[21]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[22]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[23]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[24]  Kathleen R. McKeown,et al.  Generating natural language summaries from multiple on-line sources , 1998 .

[25]  Qiang Yang,et al.  Scalable collaborative filtering using cluster-based smoothing , 2005, SIGIR '05.

[26]  Alberto Muñoz,et al.  Compound Key Word Generation from Document Databases Using A Hierarchical Clustering ART Model , 1997, Intell. Data Anal..

[27]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[28]  Mohamed S. Kamel,et al.  CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[29]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[30]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[31]  Rada Mihalcea,et al.  Explorations in Automatic Book Summarization , 2007, EMNLP.

[32]  Hongyan Jing,et al.  Sentence Reduction for Automatic Text Summarization , 2000, ANLP.

[33]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[34]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[35]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[36]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[37]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[38]  Hsinchun Chen,et al.  Using sentence-selection heuristics to rank text segments in TXTRACTOR , 2002, JCDL '02.

[39]  Giuseppe Carenini,et al.  Summarizing email conversations with clue words , 2007, WWW '07.

[40]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[41]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[42]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[43]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[44]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[45]  Saturnino Luz,et al.  Automatic Hypertext Keyphrase Detection , 2005, IJCAI.

[46]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[47]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[48]  Jugal K. Kalita,et al.  Summarization as feature selection for text categorization , 2001, CIKM '01.

[49]  Daniel Marcu,et al.  Summarization beyond sentence extraction: A probabilistic approach to sentence compression , 2002, Artif. Intell..

[50]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[51]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[52]  Alberto Muòoz,et al.  Compound Key Word Generation from Document Databases Using A Hierarchical Clustering ART Model , 1997 .

[53]  Hsinchun Chen,et al.  Summary in context: Searching versus browsing , 2006, TOIS.

[54]  Kathleen F. McCoy,et al.  Efficient text summarization using lexical chains , 2000, IUI '00.

[55]  Bruce Krulwich,et al.  Learning user information interests through extraction of semantically significant phrases , 1996 .

[56]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[57]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[58]  Daniel Marcu,et al.  Bayesian Query-Focused Summarization , 2006, ACL.

[59]  Xiaojun Wan,et al.  Single Document Summarization with Document Expansion , 2007, AAAI.

[60]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[61]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[62]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[63]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[64]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[65]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[66]  Yuji Matsumoto,et al.  A new approach to unsupervised text summarization , 2001, SIGIR '01.

[67]  Rada Mihalcea,et al.  A Language Independent Algorithm for Single and Multiple Document Summarization , 2005, IJCNLP.

[68]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[69]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[70]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[71]  Paul Over,et al.  Intrinsic Evaluation of Generic News Text Summarization Systems , 2003 .

[72]  Wei-Ying Ma,et al.  Web page clustering enhanced by summarization , 2004, CIKM '04.

[73]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[74]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[75]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.