Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but comparable to those in the NTCIR-4 and NTCIR-5 workshops. We do not know whether the lower level performance was due to the VSM's inherent deficiencies or to a less effective normalization of document length. Hence we evaluated the VSM with various pivoted normalizations of document length using the NTCIR-3 collection for confirmation. We found that VSM's retrieval effectiveness with pivoted normalization was comparable to other competitive retrieval models (for example, 2-Poisson), and that VSM's retrieval speed with pivoted normalization was similar to competitive retrieval models (2-Poisson). We proposed a novel adaptive scheme that automatically estimates the (near) best parameters for pivoted document-length normalization based on query size; the new normalization is called adaptive pivoted document-length normalization. This scheme achieved good retrieval effectiveness, sometimes for short (title) queries and sometimes for long queries, without manually adjusting parameter values. We found that unique, adaptive pivoted normalization can enhance fixed pivoted normalizations for different test collections (TREC-5 and TREC-6). We also evaluated the VSM with the adaptive pivoted normalization using the pseudo-relevance feedback (PRF) and found that this type of VSM performs similarly to the competitive retrieval models (2-Poisson) with PRF. Hence, we conclude that the VSM with unique (adaptive) pivoted document-length normalization is effective for Chinese IR and that its retrieval effectiveness is comparable to that of other competitive retrieval models with or without PRF for the reference test collections used in this evaluation.

[1]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[2]  Robert Wing Pong Luk Different Retrieval Models and Hybrid Term Indexing , 2002, NTCIR.

[3]  Yuen-Hsien Tseng,et al.  Uniform Indexing and Retrieval Scheme for Chinese, Japanese, and Korean , 2002, NTCIR.

[4]  Kui-Lam Kwok,et al.  A network approach to probabilistic information retrieval , 1995, TOIS.

[5]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[6]  Jian Zhang,et al.  On the use of words and n-grams for Chinese information retrieval , 2000, IRAL '00.

[7]  Yiming Yang,et al.  CMU in Cross-Language Information Retrieval at NTCIR-3 , 2002, NTCIR.

[8]  Noriko Kando NTCIR Workshop: Japanese- and Chinese-English Cross-Lingual Information Retrieval and Multi-grade Relevance Judgments , 2000, CLEF.

[9]  James Allan,et al.  INQUERY Does Battle With TREC-6 , 1997, TREC.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Ophir Frieder,et al.  Document normalization revisited , 2002, SIGIR '02.

[12]  Jacques Savoy,et al.  Comparative study of monolingual and multilingual search models for use with asian languages , 2005, TALIP.

[13]  Noriko Kando,et al.  An empirical study on retrieval models for different document genres: patents and newspaper articles , 2003, SIGIR '03.

[14]  R. Luk,et al.  Hybrid Chinese Term Indexing and the 2-Poisson Model , 2003 .

[15]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[16]  Jacques Savoy Report on CLIR Task for the NTCIR-4 Evaluation Campaign , 2004, NTCIR.

[17]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[18]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[19]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[20]  Fredric C. Gey,et al.  Full Text Retrieval based on Probalistic Equations with Coefficients fitted by Logistic Regression , 1993, TREC.

[21]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[22]  Kui-Lam Kwok,et al.  A comparison of Chinese document indexing strategies and retrieval models , 2002, TALIP.

[23]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[24]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[25]  LeeDik Lun,et al.  Adapting pivoted document-length normalization for query size , 2006 .

[26]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[27]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[28]  Lin Du,et al.  ISCAS at NTCIR-3: Monolingual, Bilingual and MultiLingual IR Tasks , 2002, NTCIR.