Modelagem Vetorial Estendida por Regras de Associação

The goal of this work is to present an extension to the vector model that accounts for the correlation among query terms, by using association rules, a popular data mining technique. In Information Retrieval, the vector model allows retrieving a set of documents from a termbased query, where both query terms and documents are vectors in a vector space. Although the vector model has been used succesfully for decades, there are no practical and efficient mechanisms that account for correlations among query terms in each document from the collection until now. The novelty of this work is the proposal of a method for computing the correlations among query terms. The changes to the original vector model are minimal, and experimental results show that our extended vector model enhances the precision of the results for all the collections evaluated.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[3]  Vijay V. Raghavan,et al.  Experiments on the determination of the relationships between terms , 1979, ACM Trans. Database Syst..

[4]  Clement T. Yu,et al.  Precision Weighting—An Effective Automatic Indexing Method , 1976, J. ACM.

[5]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[6]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[7]  C. J. van Rijsbergen,et al.  An Evaluation of feedback in Document Retrieval using Co‐Occurrence Data , 1978, J. Documentation.

[8]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[9]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[10]  Yonatan Aumann,et al.  Maximal Association Rules: A New Tool for Mining for Keyword Co-Occurrences in Document Collections , 1997, KDD.

[11]  Edward A. Fox,et al.  Characterization of Two New Experimental Collections in Computer and Information Science Containing Textual and Bibliographic Concepts , 1983 .

[12]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[13]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[14]  Henry G. Small,et al.  The relationship of information science to the social sciences: A co-citation analysis , 1981, Inf. Process. Manag..

[15]  Clement T. Yu,et al.  An Evaluation of Term Dependence Models in Information Retrieval , 1982, SIGIR.

[16]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[17]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[18]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[19]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .