Document Similarity Search Based on Generic Summaries

Document similarity search is to find documents similar to a query document in a text corpus and return a ranked list of documents to users, which is widely used in recommender systems in library or web applications. The popular approach to similarity search is to calculate the similarities between the query document and documents in the corpus and then rank the documents. In this paper, we investigate the use of document summarization techniques to improve the effectiveness of document similarity search. In the proposed summary-based approach, the query document is summarized and similarity searches are performed with the new query of the produced summary instead of the original document. Different retrieval models and different summarization methods are investigated in the experiments. Experimental results demonstrate the higher effectiveness of the summary-based similarity search.