Effect of word density on measuring words association

The study of mining the associated words is not new. Because of its wide ranges of applications, it is still an important issue in Information Retrieval. The existing estimators such as joint probability, words association norm do not consider the density of the words present in each window. In this paper, we incorporate the word density and propose estimator based on word density to measure the association between the words. From various experimental results based on the human judgments and precision collected from search engines, we find that the precision of the estimators could be improved by incorporating word density. For all ranges of the size of the windows, our estimator outperforms all other estimators. We also observe that all these estimators (both existing and proposed one) perform relatively better when the windows contain around five sentences. We also show by using Spearman rank-order correlation coefficient that our estimator returns better quality of the ranking of the associated terms.

[1]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[2]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[3]  Moustafa Ghanem,et al.  Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1) , 2002, SKDD.

[4]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[5]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[6]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[7]  Amiel Feinstein,et al.  Transmission of Information. , 1962 .

[8]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[11]  Fumihiro Matsuo,et al.  A Method of Extracting Related Words Using Standardized Mutual Information , 2003, Discovery Science.

[12]  Hang Li,et al.  Learning Word Association Norms Using Tree Cut Pair Models , 1996, ICML.

[13]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[14]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[15]  H. Edmund Stiles,et al.  The Association Factor in Information Retrieval , 1961, JACM.

[16]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[17]  Nega Alemayehu Analysis of performance variation using query expansion , 2003, J. Assoc. Inf. Sci. Technol..

[18]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[19]  Christian Jacquemin,et al.  Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology , 1999, EACL.