Term frequency with average term occurrences for textual information retrieval

In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.

[1]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  Ellen M. Voorhees,et al.  Overview of TREC 2004 , 2004, TREC.

[4]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[7]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[8]  Michael McGill,et al.  An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. , 1979 .

[9]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[10]  David W. Corne,et al.  Towards modernised and Web-specific stoplists for Web document analysis , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[11]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[12]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[13]  Ronan Cummins,et al.  Weighting in Information Retrieval Using Genetic Programming: A Three Stage Process , 2006, ECAI.

[14]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[15]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[18]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[19]  Sung-Hyon Myaeng,et al.  A novel term weighting scheme based on discrimination power obtained from past retrieval results , 2012, Inf. Process. Manag..

[20]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[21]  Kin Keung Lai,et al.  Credit scoring using support vector machines with direct search for parameters selection , 2008, Soft Comput..

[22]  Dario Landa Silva,et al.  A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections , 2014, 2014 14th UK Workshop on Computational Intelligence (UKCI).

[23]  Alessandro Vinciarelli,et al.  Application of information retrieval techniques to single writer documents , 2005, Pattern Recognit. Lett..

[24]  David W. Corne,et al.  Evolving Better Stoplists for Document Clustering and Web Intelligence , 2003, HIS.

[25]  Harith Alani,et al.  On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter , 2014, LREC.

[26]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[27]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[28]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[29]  Oscar Cordón,et al.  A review on the application of evolutionary computation to information retrieval , 2003, Int. J. Approx. Reason..

[30]  Michael McGill,et al.  A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment , 1980, SIGIR '80.

[31]  Rong Jin,et al.  Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall , 2001, SIGIR '01.

[32]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[33]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[34]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[35]  Gabriella Kazai,et al.  Overview of the TREC 2012 Crowdsourcing Track , 2012, TREC.

[36]  Luo Si,et al.  Learn to weight terms in information retrieval using category information , 2005, ICML.

[37]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[38]  Stephan M. Winkler,et al.  Data-based prediction of sentiments using heterogeneous model ensembles , 2015, Soft Comput..

[39]  Joseph L. Zinnes,et al.  Theory and Methods of Scaling. , 1958 .

[40]  Thomas Villmann,et al.  Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines , 2015, Soft Comput..

[41]  Ian Soboroff,et al.  A comparison of pooled and sampled relevance judgments , 2007, EVIA@NTCIR.