Pure High-Order Word Dependence Mining via Information Geometry

The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the "high-order pure dependence" among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.

[1]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[2]  R. Randles,et al.  Multivariate Nonparametric Tests of Independence , 2005 .

[3]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[4]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[5]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[6]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[7]  Shun-ichi Amari,et al.  Information-Geometric Measure for Neural Spikes , 2002, Neural Computation.

[8]  Sen Zhang,et al.  An Effective Combination of Different Order N-grams , 2003, PACLIC.

[9]  Thomas Niesler,et al.  A variable-length category-based n-gram language model , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[11]  Qiang Huang,et al.  Facilitating Query Decomposition in Query Language Modeling by Association Rule Mining Using Multiple Sliding Windows , 2008, ECIR.

[12]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.

[13]  Shun-ichi Amari,et al.  Information geometry on hierarchy of probability distributions , 2001, IEEE Trans. Inf. Theory.

[14]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[17]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[18]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[19]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[20]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.