Finding "Similar but Different" Documents Based on Coordinate Relationship

Traditional search technologies are based on similarity relationship such that they return content similar documents in accordance with a given one. However, such similarity-based search does not always result in good results, e.g., similar documents will bring little additional information so that it is difficult to increase information gain. In this paper, we propose a method to find similar but different documents of a user-given one by distinguishing coordinate relationship from similarity relationship between documents. Simply, a similar but different document denotes the document with the same topic as that of the given document, but describing different events or concepts. For example, given as the input a news article stating the occurrence of the Oregon school shooting, articles stating the occurrence of other school shooting events, such as the Virginia Tech shooting, are detected and returned to users. Experiments conducted on the New York Times Annotated Corpus verify the effectiveness of our method and illustrate the importance of incorporating coordinate relationship to find similar but different documents.

[1]  James Allan,et al.  Incident threading for news passages , 2009, CIKM.

[2]  Bin Wang,et al.  A probabilistic model for retrospective news event detection , 2005, SIGIR '05.

[3]  Ramesh Nallapati,et al.  Event threading within news topics , 2004, CIKM '04.

[4]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[5]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[6]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[7]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[8]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[9]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[10]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[11]  James Allan,et al.  Finding and linking incidents in news , 2007, CIKM '07.

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[14]  Katsumi Tanaka,et al.  Sibling Page Search by Page Examples , 2006, ICADL.

[15]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[18]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[19]  Katsumi Tanaka,et al.  Searching Coordinate Terms with Their Context from the Web , 2006, WISE.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.