Managing Word Mismatch Problems in Information Retrieval: A Topic-Based Query Expansion Approach

Word mismatch represents a fundamental information retrieval challenge that has become increasingly important as electronic document repositories (e.g., Web resources, digital libraries) grow in number and sheer volume. In general, word mismatch refers to the phenomenon in which a concept is described by different terms in user queries and in source documents. Query expansion represents a promising avenue to address such problems. Previous research predominantly approaches query expansion on the basis of global or local analysis. However, these approaches emphasize a global perspective rather than taking a topic-specific view of term associations. As a consequence, their effectiveness can be severely constrained when the document corpus spans a diverse set of topics. In this study, we propose a topic-based approach for query expansion and develop and empirically evaluate two novel methods—namely, nonfuzzy and fuzzy topic-based query expansion—to address word mismatch problems. According to our evaluation results, the proposed topic-based approach is more effective than a benchmark global analysis method, particularly when user queries consist of multiple query terms.

[1]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Vijay V. Raghavan,et al.  Information Retrieval on the World Wide Web , 1997, IEEE Internet Comput..

[4]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[5]  L. Sacks,et al.  Evaluating fuzzy clustering for relevance-based information access , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[6]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[7]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[8]  Hisham M. Haddad,et al.  Proceedings of the 2002 ACM Symposium on Applied Computing (SAC), March 10-14, 2002, Madrid, Spain , 2002, SAC.

[9]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[10]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[11]  Mark D. Miller,et al.  Examining differences across journal rankings , 2005, CACM.

[12]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[13]  Chih-Ping Wei,et al.  Combining preference- and content-based approaches for improving document clustering effectiveness , 2006, Inf. Process. Manag..

[14]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[15]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[16]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[17]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[18]  Nikolaos A. Mylonopoulos,et al.  Global perceptions of IS journals , 2001 .

[19]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[20]  Donald H. Kraft,et al.  Combining fuzzy clustering and fuzzy inferencing in information retrieval , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[21]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[22]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[23]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[24]  M. Shamim Khan,et al.  Enhanced Web document retrieval using automatic query expansion , 2004, J. Assoc. Inf. Sci. Technol..

[25]  Jay F. Nunamaker,et al.  Verifying the Proximity and Size Hypothesis for Self-Organizing Maps , 2000, J. Manag. Inf. Syst..

[26]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[27]  Hsinchun Chen,et al.  Document clustering for electronic meetings: an experimental comparison of two techniques , 1999, Decis. Support Syst..

[28]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[29]  Félix de Moya Anegón,et al.  Document organization using Kohonen's algorithm , 2002, Inf. Process. Manag..

[30]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[31]  Andrea Omicini,et al.  Proceedings of the 2005 ACM Symposium on Applied Computing (SAC), Santa Fe, New Mexico, USA, March 13-17, 2005 , 2005, SAC.

[32]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[33]  Zhenyu Liu,et al.  Knowledge-based query expansion to support scenario-specific retrieval of medical free text , 2005, SAC '05.

[34]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[35]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[36]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[37]  Peter Willett,et al.  Hierarchic document classification using Ward's clustering method , 1986, SIGIR '86.

[38]  Jianying Wang,et al.  A corpus analysis approach for automatic query expansion and its extension to multiple databases , 1999, TOIS.

[39]  Bernard J. Jansen,et al.  A review of Web searching studies and a framework for future research , 2001, J. Assoc. Inf. Sci. Technol..

[40]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[41]  Yiyu Yao,et al.  An Information-Theoretic Measure of Term Specificity , 1992, J. Am. Soc. Inf. Sci..

[42]  Terumasa Ehara,et al.  An efficient document clustering algorithm and its application to a document browser , 1999, Inf. Process. Manag..

[43]  Sang-goo Lee,et al.  A semi-supervised document clustering technique for information organization , 2000, CIKM '00.

[44]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[45]  Ángel F. Zazo Rodríguez,et al.  Reformulation of queries using similarity thesauri , 2005, Inf. Process. Manag..

[46]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[47]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[48]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[49]  Sang-goo Lee,et al.  An effective document clustering method using user-adaptable distance metrics , 2002, SAC '02.

[50]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[51]  Nikolaos A. Mylonopoulos,et al.  On site: global perceptions of IS journals , 2001, CACM.

[52]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[53]  Claudio Carpineto,et al.  Improving retrieval feedback with multiple term-ranking function combination , 2002, TOIS.

[54]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[55]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.

[56]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[57]  Victor Maojo,et al.  A context vector model for information retrieval , 2002, J. Assoc. Inf. Sci. Technol..

[58]  Chih-Ping Wei,et al.  Managing document categories in e-commerce environments: an evolution-based approach , 2002, Eur. J. Inf. Syst..

[59]  Chih-Ping Wei,et al.  Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach , 2006, J. Manag. Inf. Syst..

[60]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..