Vector model improvement using suffix trees

There are many ways how to search for documents in document collections. These methods take advantage of Boolean, vector, probabilistic and other models for representation of documents, queries, rules and procedures which can determine correspondence between user requests and documents. Each of these models have several restrictions. These restrictions do not allow a user to find all relevant documents. There are many irrelevant documents among returned ones by the system and some relevant documents missing at all. In the article there is a new method suggested which uses suffix trees for the vector query improvement. This method treats with documents as a, set of phrases (sentences) not just as a set of words. The sentence has a specific, semantic meaning (words in the sentence are ordered). This is advantage in comparison with the treated document just like with, a bag of words.

[1]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[2]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[3]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[4]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[5]  David Haussler,et al.  A new distance metric on strings computable in linear time , 1988, Discret. Appl. Math..

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[8]  Jan Martinovic,et al.  Improvement of Text Compression Parameters Using Cluster Analysis , 2007, DATESO.

[9]  Jerzy Dydak,et al.  Dimension zero at all scales , 2006 .

[10]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[11]  Václav Snásel,et al.  Query Expansion and Evolution of Topic in Information Retrieval Systems , 2004, DATESO.

[12]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[13]  Mark T. Maybury,et al.  Information Storage and Retrieval Systems: Theory and Implementation , 2000 .

[14]  Tsunenori Ishioka,et al.  Improving heuristic function of cost-based abduction system using real-time heuristic search , 2004 .

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[17]  Fionn Murtagh,et al.  On Ultrametricity, Data Coding, and Computation , 2004, J. Classif..

[18]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[19]  Jan Martinovic,et al.  Vector model improvement by FCA and Topic Evolution , 2005, DATESO.

[20]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[21]  Natasa Milic-Frayling,et al.  Evaluation of Syntactic Phrase Indexing -- CLARIT NLP Track Report , 1996, TREC.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[24]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[25]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.