Integration of Collocation Statistics into the Probabilistic Retrieval Model

The paper presents a method of combining corpus information on word collocations with the probabilistic model of information retrieval. Corpus term dependencies are used to modify the probabilistic retrieval based on the term independence assumption. Collocates are derived from windows around term occurrences in the corpus. Statistical measures of mutual information and Z score are applied to select significantly associated collocates which are later used in query expansion. The results of the lexico-semantic analysis of significant collocates and their comparison with engineered term networks and thesauri are also discussed.

[1]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[3]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[4]  Gerard Salton,et al.  Automatic term class construction using relevance--A summary of work in automatic pseudoclassification , 1980, Inf. Process. Manag..

[5]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[6]  Marti A. Hearst,et al.  A Method for Re ning Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results , 1992 .

[7]  Peter W. Foltz,et al.  Interactive Information Retrieval Using Term Relationship Networks , 1997, TREC.

[8]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[9]  K. Sparck Jones,et al.  A Probabilistic Model of Information Retrieval : Development and Status , 1998 .

[10]  William S. Cooper,et al.  Inconsistencies and Misnomers in Probabilistic IR. , 1991, SIGIR 1991.

[11]  Marti A. Hearst,et al.  Refining Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results , 1992 .

[12]  Susan Jones A thesaurus data model for an intelligent retrieval system , 1993, J. Inf. Sci..

[13]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[14]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[15]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[16]  Stephen E. Robertson,et al.  Overview of the Okapi projects , 1997, J. Documentation.

[17]  John Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, Annual Meeting of the Association for Computational Linguistics.