Effects of central tendency measures on term weighting in textual information retrieval

It has become evident that term weighting has a significant effect on relevant document retrieval for which various methods are proposed. However, the main question that arises is which weighting method is the best? In this paper, it is shown that proper aggregation of weights generated by carefully selected basic weighting methods improves retrieval of the relevant documents with respect to the user’s needs. Toward this aim, it is shown that even using simple central tendency measures such as average, median or mid-range over an appropriate subset of basic weighting methods provides term weight that not only outperforms using each basic weighting method but also results in more effective weights in comparison with recently proposed complicated weighting methods. Based on exploiting the proposed method on various datasets, we have studied the effects of normalization of the basic weights, normalization of the vector lengths, the use of different components in the term frequency factor, etc. Results reveal the criteria for selecting an appropriate subset of basic weighting methods that would be fed to the aggregator in order to achieve higher retrieval precision.

[1]  Alper Kursat Uysal,et al.  Improved inverse gravity moment term weighting for text classification , 2019, Expert Syst. Appl..

[2]  Hans Friedrich Witschel Global term weights in distributed environments , 2008, Inf. Process. Manag..

[3]  Hang Li Learning to Rank for Information Retrieval and Natural Language Processing , 2011, Synthesis Lectures on Human Language Technologies.

[4]  Miles Efron Linear time series models for term weighting in information retrieval , 2010 .

[5]  Fragkiskos D. Malliaros,et al.  Graph-based term weighting for text categorization , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[6]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[7]  Samba Ndiaye,et al.  A Novel Term Weighting Scheme Model , 2018, ICFET '18.

[8]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[9]  Massih-Reza Amini,et al.  Exploring the space of information retrieval term scoring functions , 2017, Inf. Process. Manag..

[10]  Ronan Cummins,et al.  Evolving local and global weighting schemes in information retrieval , 2006, Information Retrieval.

[11]  Gloria Bordogna,et al.  Extending Boolean information retrieval: a fuzzy model based on linguistic variables , 1992, [1992 Proceedings] IEEE International Conference on Fuzzy Systems.

[12]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[13]  Gloria Bordogna,et al.  Controlling retrieval through a user-adaptive representation of documents , 1995, Int. J. Approx. Reason..

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Jihong Ouyang,et al.  Exploring coherent topics by topic modeling with term weighting , 2018, Inf. Process. Manag..

[16]  Andrea Esuli,et al.  Learning to Weight for Text Classification , 2019, IEEE Transactions on Knowledge and Data Engineering.

[17]  Yuanhua Lv,et al.  A Pólya Urn Document Language Model for Improved Information Retrieval , 2015, ACM Trans. Inf. Syst..

[18]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[19]  W. Bruce Croft,et al.  The History of Information Retrieval Research , 2012, Proceedings of the IEEE.

[20]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[21]  R. H. Goudar,et al.  An Ontology-based Term Weighting Technique for Web Document Categorization , 2018 .

[22]  Akshay Deepak,et al.  Query Expansion Techniques for Information Retrieval: a Survey , 2017, Inf. Process. Manag..

[23]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[24]  Karen Sparck Jones Information Retrieval Experiment , 1971 .

[25]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[26]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[27]  Srikanta J. Bedathur,et al.  Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance , 2018, CIKM.

[28]  Xiaodong Gu,et al.  Balancing between over-weighting and under-weighting in supervised term weighting , 2016, Inf. Process. Manag..

[29]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[30]  S. Robertson The probability ranking principle in IR , 1997 .

[31]  R. Lakshmi,et al.  Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms , 2019, Expert Syst. Appl..

[32]  Gabriella Pasi,et al.  Aggregation operators in Information Retrieval , 2017, Fuzzy Sets Syst..

[33]  Jin Zhang,et al.  A New Term Significance Weighting Approach , 2005, Journal of Intelligent Information Systems.

[34]  Mike Thelwall,et al.  A Study of Information Retrieval Weighting Schemes for Sentiment Analysis , 2010, ACL.

[35]  Miles Efron,et al.  Linear time series models for term weighting in information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[36]  Weiguo Fan,et al.  Genetic Programming-Based Discovery of Ranking Functions for Effective Web Search , 2005, J. Manag. Inf. Syst..

[37]  Tu Bao Ho,et al.  Semantic term weighting for clinical texts , 2018, Expert Syst. Appl..

[38]  Markus Hofmann,et al.  A Wikipedia powered state-based approach to automatic search query enhancement , 2018, Inf. Process. Manag..

[39]  Bela Gipp,et al.  TF-IDuF : A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections , 2017 .

[40]  Ciprian-Octavian Truica,et al.  Comparing Different Term Weighting Schemas for Topic Modeling , 2016, 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[41]  Jimmy J. Lin,et al.  Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants , 2020, ECIR.

[42]  Hao Zhang,et al.  Turning from TF-IDF to TF-IGM for term weighting in text classification , 2016, Expert Syst. Appl..

[43]  Yogesh Gupta,et al.  A new fuzzy logic based ranking function for efficient Information Retrieval system , 2015, Expert Syst. Appl..

[44]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[45]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[46]  Tushar Bihany,et al.  A Complete Survey on Web Document Ranking , 2014 .

[47]  Sung-Hyon Myaeng,et al.  A novel term weighting scheme based on discrimination power obtained from past retrieval results , 2012, Inf. Process. Manag..

[48]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[49]  Dario Landa Silva,et al.  Term frequency with average term occurrences for textual information retrieval , 2016, Soft Comput..

[50]  G. Pasi,et al.  A Fuzzy Linguistic Approach Generalizing Boolean Information Retrieval: a Model and its Evaluation , 1993 .

[51]  Germana Scepi,et al.  Combining different evaluation systems on social media for measuring user satisfaction , 2018, Inf. Process. Manag..

[52]  D. Kraft,et al.  An extended fuzzy linguistic approach to generalize Boolean information retrieval , 1994 .

[53]  Herbert F. Mitchell The use of the Univ AC FAC-tronic system in the library reference field , 1953 .

[54]  Taoufiq Gadi,et al.  Ranking of text documents using TF-IDF weighting and association rules mining , 2018, 2018 4th International Conference on Optimization and Applications (ICOA).

[55]  Aun Irtaza,et al.  Fuzzy topic modeling approach for text mining over short text , 2019, Inf. Process. Manag..

[56]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[57]  So Young Sohn,et al.  Term discrimination for text search tasks derived from negative binomial distribution , 2018, Inf. Process. Manag..

[58]  Irma S. Wachtel,et al.  Unit terms in coordinate indexing , 1952 .

[59]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[60]  Enrique Herrera-Viedma Modeling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach , 2001 .

[61]  Donald H. Kraft,et al.  Fuzzy Information Retrieval Systems: A Historical Perspective , 2015, Fifty Years of Fuzzy Logic and its Applications.

[62]  Masoud Rahgozar,et al.  A query term re-weighting approach using document similarity , 2016, Inf. Process. Manag..

[63]  Xueqi Cheng,et al.  DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval , 2017, CIKM.

[64]  Aytuğ Onan,et al.  Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks , 2020, Concurr. Comput. Pract. Exp..

[65]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[66]  Weiguo Fan,et al.  A generic ranking function discovery framework by genetic programming for information retrieval , 2004, Inf. Process. Manag..

[67]  Suthira Plansangket,et al.  New weighting schemes for document ranking and ranked query suggestion , 2017 .

[68]  Ammar Ismael Kadhim Term Weighting for Feature Extraction on Twitter: A Comparison Between BM25 and TF-IDF , 2019, 2019 International Conference on Advanced Science and Engineering (ICOASE).