Extract Salient Words with WordRank for Effective Similarity Search in Text Data

We propose a method named WordRank to extract a few salient words from the target document and then use these words to retrieve similar documents based on popular retrieval functions. The set of extracted words is a concise and topic-oriented representation of the target document and reduces the ambiguous and noisy information in the document, so as to improve the retrieval performance. Experiments and results demonstrate the high effectiveness of the proposed approach.

[1]  Thorsten Brants,et al.  Finding Similar Documents in Document Collections , 2002 .

[2]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[3]  Philip S. Yu,et al.  On effective conceptual indexing and similarity search in text data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.