Graph-Based Text Representation for Novelty Detection

We discuss several feature sets for novelty detection at the sentence level, using the data and procedure established in task 2 of the TREC 2004 novelty track. In particular, we investigate feature sets derived from graph representations of sentences and sets of sentences. We show that a highly connected graph produced by using sentence-level term distances and pointwise mutual information can serve as a source to extract features for novelty detection. We compare several feature sets based on such a graph representation. These feature sets allow us to increase the accuracy of an initial novelty classifier which is based on a bag-of-word representation and KL divergence. The final result ties with the best system at TREC 2004.

[1]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[2]  Vladimir Batagelj,et al.  Pajek - Program for Large Network Analysis , 1999 .

[3]  Jianfeng Gao,et al.  Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations , 2002, SIGIR '02.

[4]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[5]  Donna K. Harman,et al.  Overview of the TREC 2003 Novelty Track , 2003, TREC.

[6]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[7]  Alan F. Smeaton,et al.  Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for TREC 2004 , 2004, TREC.

[8]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[9]  Li Zhou,et al.  Novelty, Question Answering and Genomics: The University of Iowa Response , 2004, TREC.

[10]  Tomohiro Takagi,et al.  Meiji University Web, Novelty and Genomic Track Experiments , 2004, TREC.

[11]  Susan T. Dumais,et al.  Newsjunkie: providing personalized newsfeeds via analysis of information novelty , 2004, WWW '04.

[12]  Ellen M. Voorhees,et al.  Overview of TREC 2004 , 2004, TREC.

[13]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[14]  Hongbo Xu,et al.  Experiments in TREC 2004 Novelty Track at CAS-ICT , 2004, TREC.

[15]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[16]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[17]  Günes Erkan The University of Michigan in Novelty 2004 , 2004, TREC.

[18]  Kathleen McKeown,et al.  Columbia University in the Novelty Track at TREC 2004 , 2004, TREC.

[19]  Noah A. Smith,et al.  Parsing with Soft and Hard Constraints on Dependency Length , 2005 .

[20]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[21]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .