Document similarity search is to find documents similar to a query document in a text corpus and return a ranked list of documents to users, which is widely used in recommender systems in library or web applications. The popular approach to similarity search is to calculate the similarities between the query document and documents in the corpus and then rank the documents. In this paper, we investigate the use of document summarization techniques to improve the effectiveness of document similarity search. In the proposed summary-based approach, the query document is summarized and similarity searches are performed with the new query of the produced summary instead of the original document. Different retrieval models and different summarization methods are investigated in the experiments. Experimental results demonstrate the higher effectiveness of the summary-based similarity search.
[1]
Karen Spärck Jones,et al.
Generic summaries for indexing in information retrieval
,
2001,
SIGIR '01.
[2]
Philip S. Yu,et al.
On effective conceptual indexing and similarity search in text data
,
2001,
Proceedings 2001 IEEE International Conference on Data Mining.
[3]
Gareth J. F. Jones,et al.
Applying summarization techniques for term selection in relevance feedback
,
2001,
SIGIR '01.
[4]
Jade Goldstein-Stewart,et al.
The use of MMR, diversity-based reranking for reordering documents and producing summaries
,
1998,
SIGIR '98.
[5]
Jugal K. Kalita,et al.
Summarization as feature selection for text categorization
,
2001,
CIKM '01.
[6]
Thorsten Brants,et al.
Finding Similar Documents in Document Collections
,
2002
.