N-gram IDF: A Global Term Weighting Scheme Based on Information Distance

This paper first reveals the relationship between Inverse Document Frequency (IDF), a global term weighting scheme, and information distance, a universal metric defined by Kolmogorov complexity. We concretely give a theoretical explanation that the IDF of a term is equal to the distance between the term and the empty string in the space of information distance in which the Kolmogorov complexity is approximated using Web documents and the Shannon-Fano coding. Based on our findings, we propose N-gram IDF, a theoretical extension of IDF for handling words and phrases of any length. By comparing weights among N-grams of any N, N-gram IDF enables us to determine dominant N-grams among overlapping ones and extract key terms of any length from texts without using any NLP techniques. To efficiently compute the weight for all possible N-grams, we adopt two string processing techniques, i.e., maximal substring extraction using enhanced suffix array and document listing using wavelet tree. We conducted experiments on key term extraction and Web search query segmentation, and found that N-gram IDF was competitive with state-of-the-art methods that were designed for each application using additional resources and efforts. The results exemplified the potential of N-gram IDF.

[1]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[2]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[3]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[4]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[7]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[8]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[9]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[10]  A. Shiryayev On Tables of Random Numbers , 1993 .

[11]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[12]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[13]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[14]  W. R. Grei,et al.  A theory of term weighting based on exploratory data analysis , 1998, SIGIR 1998.

[15]  R. Landauer,et al.  Irreversibility and heat generation in the computing process , 1961, IBM J. Res. Dev..

[16]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[17]  Kishore Papineni,et al.  Why Inverse Document Frequency? , 2001, NAACL.

[18]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[19]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[20]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[21]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[22]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[23]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[24]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[25]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[26]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[27]  Constantin Orasan,et al.  A Comparison of Summarisation Methods Based on Term Specificity Estimation , 2004, LREC.

[28]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[29]  Pavel Pecina An Extensive Empirical Study of Collocation Extraction Methods , 2005, ACL.

[30]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[31]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[32]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[33]  Donald Metzler,et al.  Generalized inverse document frequency , 2008, CIKM '08.

[34]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[35]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[36]  Jun'ichi Tsujii,et al.  Text Categorization with All Substring Features , 2009, SDM.

[37]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[38]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Tu Bao Ho,et al.  Improving effectiveness of mutual information for substantival multiword expression extraction , 2009, Expert Syst. Appl..

[40]  J. Silva,et al.  A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[41]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[42]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[43]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[44]  Xiaoyan Zhu,et al.  Measuring the Non-compositionality of Multiword Expressions , 2010, COLING.

[45]  Rishiraj Saha Roy,et al.  Unsupervised query segmentation using only query logs , 2011, WWW.

[46]  Rishiraj Saha Roy,et al.  An IR-based evaluation framework for web search query segmentation , 2012, SIGIR '12.

[47]  T. Honkela,et al.  Term Weighting in Short Documents for Document Categorization , Keyword Extraction and Query Expansion , 2012 .

[48]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[49]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[50]  Hideo Bannai,et al.  Efficient Computation of Substring Equivalence Classes with Suffix Arrays , 2007, Algorithmica.