论文信息 - Vector model improvement using suffix trees

Vector model improvement using suffix trees

There are many ways how to search for documents in document collections. These methods take advantage of Boolean, vector, probabilistic and other models for representation of documents, queries, rules and procedures which can determine correspondence between user requests and documents. Each of these models have several restrictions. These restrictions do not allow a user to find all relevant documents. There are many irrelevant documents among returned ones by the system and some relevant documents missing at all. In the article there is a new method suggested which uses suffix trees for the vector query improvement. This method treats with documents as a, set of phrases (sentences) not just as a set of words. The sentence has a specific, semantic meaning (words in the sentence are ordered). This is advantage in comparison with the treated document just like with, a bag of words.

Václav Snásel | Jan Martinovic | Tomás Novosad

[1] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[2] Clement T. Yu,et al. A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[3] Michael Rodeh,et al. Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[4] Michael W. Berry,et al. Survey of Text Mining , 2003, Springer New York.

[5] David Haussler,et al. A new distance metric on strings computable in linear time , 1988, Discret. Appl. Math..

[6] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7] Peter Weiner,et al. Linear Pattern Matching Algorithms , 1973, SWAT.

[8] Jan Martinovic,et al. Improvement of Text Compression Parameters Using Cluster Analysis , 2007, DATESO.

[9] Jerzy Dydak,et al. Dimension zero at all scales , 2006 .

[10] Hinrich Schütze,et al. Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[11] Václav Snásel,et al. Query Expansion and Evolution of Topic in Information Retrieval Systems , 2004, DATESO.