Combinatoric Models of Information Retrieval Ranking Methods and Performance Measures for Weakly-Ordered Document Collections

LEWIS CHURCH: Combinatoric Models of Information Retrieval Ranking Methods and Performance Measures for Weakly-Ordered Document Collections (Under the direction of Robert M. Losee) This dissertation answers three research questions: (1) What are the characteristics of a combinatoric measure, based on the Average Search Length (ASL), that performs the same as a probabilistic version of the ASL?; (2) Does the combinatoric ASL measure produce the same performance result as the one that is obtained by ranking a collection of documents and calculating the ASL by empirical means?; and (3) When does the ASL and either the Expected Search Length, MZ-based E, or Mean Reciprocal Rank measure both imply that one document ranking is better than another document ranking? Concepts and techniques from enumerative combinatorics and other branches of mathematics were used in this research to develop combinatoric models and equations for several information retrieval ranking methods and performance measures. Empirical, statistical, and simulation means were used to validate these models and equations. The document cut-off performance measure equation variants that were developed in this dissertation can be used for performance prediction and to help study any vector V of ranked documents, at arbitrary document cut-off points, provided that (1) relevance is binary and (2) the following information can be determined from the ranked output: the document equivalence classes and their relative sequence, the number of documents in each equivalence class, and the number of relevant documents that each class contains. The performance measure equations yielded correct values for both stronglyand weaklyordered document collections.

[1]  Sndor Dominich Mathematical Foundations of Information Retrieval , 2002, Computational Linguistics.

[2]  H. Raiffa,et al.  Introduction to Statistical Decision Theory , 1996 .

[3]  Robert M. Losee Probabilistic retrieval and coordination level matching , 1987 .

[4]  T. Koornwinder,et al.  BASIC HYPERGEOMETRIC SERIES (Encyclopedia of Mathematics and its Applications) , 1991 .

[5]  Ivar Jacobson,et al.  The Unified Software Development Process , 1999 .

[6]  Phil Spector,et al.  Data manipulation with R , 2008 .

[7]  Pertti Vakkari,et al.  Changes in relevance criteria and problem stages in task performance , 2000, J. Documentation.

[8]  Padmini Srinivason On generalizing the Two-Poisson model , 1989 .

[9]  Amanda Spink,et al.  From Highly Relevant to Not Relevant: Examining Different Regions of Relevance , 1998, Inf. Process. Manag..

[10]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[11]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[12]  Fred S. Roberts,et al.  Applied Combinatorics, Second Edition , 2009 .

[13]  Stephen E. Robertson,et al.  On sample sizes for non-matched-pair IR experiments , 1990, Inf. Process. Manag..

[14]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[15]  Wen-Lian Hsu,et al.  Biological question answering with syntactic and semantic feature matching and an improved mean reciprocal ranking measurement , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[16]  M. Bóna A Walk Through Combinatorics: An Introduction to Enumeration and Graph Theory , 2006 .

[17]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[18]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[19]  Toby J. Teorey,et al.  Database modeling and design: the entity-relationship approach , 1990 .

[20]  Robert M. Losee Text retrieval and filtering: analytic models of performance , 1998 .

[21]  George E. Andrews,et al.  q-series : their development and application in analysis, number theory, combinatorics, physics, and computer algebra , 1986 .

[22]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[23]  Robert M. Losee,et al.  Information retrieval with distributed databases: analytic models of performance , 2004, IEEE Transactions on Parallel and Distributed Systems.

[24]  Eugene L. Margulis,et al.  Modelling Documents with Multiple Poisson Distributions , 1993, Inf. Process. Manag..

[25]  S. Lando Lectures on Generating Functions , 2003 .

[26]  William S. Cooper,et al.  A definition of relevance for information retrieval , 1971, Inf. Storage Retr..

[27]  Stephen Robertson,et al.  The methodology of information retrieval experiment , 1981 .

[28]  Robert R. Trippi STRATEGIES FOR SOLVING ECONOMIC PROBLEMS INVOLVING PERMUTATIONS , 1975 .

[29]  Charles W. Bachman,et al.  Data structure diagrams , 1969, DATB.

[30]  C. J. van Rijsbergen,et al.  A Case Study for Automatic Query Expansion Based on Divergence , 2004 .

[31]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[32]  Joemon M. Jose,et al.  Automatic query expansion based on divergence , 2001, CIKM '01.

[33]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[34]  Thomas S. Huang,et al.  Information Retrieval Beyond the Text Document , 1999, Libr. Trends.

[35]  N. J. Fine,et al.  Basic Hypergeometric Series and Applications , 1988 .

[36]  R. B. Johnson,et al.  Educational Research: Quantitative, Qualitative, and Mixed Approaches , 2007 .

[37]  Marie-Francine Moens,et al.  Automatic Indexing and Abstracting of Document Texts , 2000, Computational Linguistics.

[38]  Donald L. Kreher,et al.  Combinatorial algorithms: generation, enumeration, and search , 1998, SIGA.

[39]  Thomas A. Bruce,et al.  Designing Quality Databases With IDEF1X Information Models , 1991 .

[40]  Richard L. Scheaffer,et al.  Elementary Survey Sampling , 1971 .

[41]  Stephen E. Robertson Evaluation in Information Retrieval , 2000, ESSIR.

[42]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[43]  Colin Robson,et al.  Real World Research: A Resource for Social Scientists and Practitioner-Researchers , 1993 .

[44]  Steven Skiena,et al.  Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica ® , 2009 .

[45]  Vijay V. Raghavan,et al.  Evaluation of the 2-Poisson model as a basis for using term frequency data in searching , 1983, SIGIR '83.

[46]  Cheng-Shang Chang Calculus , 2020, Bicycle or Unicycle?.

[47]  E. Reingold,et al.  Combinatorial Algorithms: Theory and Practice , 1977 .

[48]  Robert M. Losee When information retrieval measures agree about the relative quality of document rankings , 2000 .

[49]  J.-M. Griffiths,et al.  US information retrieval system evolution and evaluation (1945-1975) , 2002, IEEE Annals of the History of Computing.

[50]  George E. Andrews,et al.  Applications of Basic Hypergeometric Functions , 1974 .

[51]  Vijay V. Raghavan,et al.  Retrieval system evaluation using recall and precision: problems and answers , 1989, SIGIR '89.

[52]  C. L. Liu,et al.  Introduction to Combinatorial Mathematics. , 1971 .

[53]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[54]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[55]  William M. Shaw,et al.  Termrelevance Computations and Perfect Retrieval Performance , 1995, Inf. Process. Manag..

[56]  D. Blumenfeld Operations Research Calculations Handbook , 2001 .

[57]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[58]  Witold A. J. Kosmala,et al.  Advanced Calculus: A Friendly Approach , 1998 .

[59]  A. Benjamin,et al.  Proofs that Really Count: The Art of Combinatorial Proof , 2003 .

[60]  Marc Najork,et al.  Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores , 2008, ECIR.

[61]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[62]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[63]  Robert Burchfield,et al.  Glossary of linguistic terms , 1994 .

[64]  R. Stanley Enumerative Combinatorics: Volume 1 , 2011 .

[65]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[66]  Daniel J. Velleman How to Prove It: A Structured Approach , 1994 .

[67]  Charles T. Meadow,et al.  Text information retrieval systems , 1992 .

[68]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[69]  Jane Greenberg,et al.  Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions , 2006, Int. J. Metadata Semant. Ontologies.

[70]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[71]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[72]  S. E. Robertson,et al.  On Relevance weight estimation and Query Expansion , 1986, J. Documentation.

[73]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[74]  Donald H. Kraft,et al.  Stopping rules and their effect on expected search length , 1979, Inf. Process. Manag..

[75]  Charalambos A. Charalambides,et al.  Combinatorial Methods in Discrete Distributions , 2005 .

[76]  William S. Cooper,et al.  On selecting a measure of retrieval effectiveness , 1973, J. Am. Soc. Inf. Sci..

[77]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[78]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[79]  Stephen P. Harter,et al.  Evaluation of information retrieval systems : Approaches, issues, and methods , 1997 .

[80]  Lucy Joan Slater,et al.  Generalized hypergeometric functions , 1966 .

[81]  Gerard Salton,et al.  Another look at automatic text-retrieval systems , 1986, CACM.

[82]  Kenneth H. Rosen Elementary Number Theory: And Its Applications , 2010 .

[83]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[84]  Paul Solomon,et al.  Information mosaics: patterns of action that structure , 1999 .

[85]  G. Andrews,et al.  Integer Partitions: Ferrers graphs , 2004 .

[86]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[87]  Ronald E. Walpole Probability and statistics for engineers and scientists / Ronald E. Walpole, Raymond H. Myers , 1990 .

[88]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[89]  M. Naaman,et al.  Lost in memories: interacting with photo collections on PDAs , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[90]  Douglas G. Schultz,et al.  A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching. Final Report to the National Science Foundation. Volume II, Appendices. , 1967 .

[91]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[92]  John Algeo,et al.  Glossary of Linguistic Terms , 2001 .

[93]  Emmanuel Vincent,et al.  The 2005 Music Information retrieval Evaluation Exchange (MIREX 2005): Preliminary Overview , 2005, ISMIR.

[94]  Jonathan L. Gross,et al.  Combinatorial Methods with Computer Applications , 2007 .

[95]  Nicholas J. Belkin,et al.  Ask for Information Retrieval: Part I. Background and Theory , 1997, J. Documentation.

[96]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[97]  Elliot B. Koffman,et al.  Problem solving and structured programming in FORTRAN 77 , 1987 .

[98]  Abraham Bookstein,et al.  Information retrieval: A sequential learning process , 1983, J. Am. Soc. Inf. Sci..

[99]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[100]  Peter Dalgaard,et al.  Introductory statistics with R, 2nd Edition , 2020, Statistics and computing.

[101]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[102]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[103]  John A. Swets,et al.  Effectiveness of information retrieval methods , 1969 .

[104]  Martin Aigner,et al.  A Course in Enumeration , 2007 .

[105]  Ingram Olkin,et al.  Probability Models and Applications , 2019 .

[106]  Frank Harary Review: John Riordan, An introduction to combinatorial analysis , 1959 .

[107]  John M. Chambers,et al.  Software for Data Analysis: Programming with R , 2008 .

[108]  Donna K. Harman,et al.  The TREC Ad Hoc Experiments , 2005 .

[109]  Stanley Lemeshow,et al.  Sampling of Populations: Methods and Applications , 1991 .

[110]  Michael David Cooper,et al.  Evaluation of Information Retrieval Systems: A Simulation and Cost Approach. , 1971 .

[111]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[112]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[113]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[114]  Paul Walton Purdom,et al.  The Analysis of Algorithms , 1995 .

[115]  Richard H. Browne Using the Sample Range as a Basis for Calculating Sample Size in Power Calculations , 2001 .

[116]  I. Goulden,et al.  Combinatorial Enumeration , 2004 .

[117]  L. Lovász Combinatorial problems and exercises , 1979 .

[118]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[119]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[120]  John D. Sterman,et al.  A Skeptic’s Guide to Computer Models , 1997 .

[121]  Peter Bollmann-Sdorra,et al.  Measurement-theoretical investigation of the MZ-metric , 1980, SIGIR '80.

[122]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[123]  Bruno Landi,et al.  AMARYLLIS: an evaluation experiment on search engines in a French-speaching context , 1998 .

[124]  N. Ravi Shankar,et al.  Estimating the Mean and Variance of Activity Duration in PERT , 2010 .

[125]  Alessandra Conversi,et al.  Comparative Analysis , 2009, Encyclopedia of Database Systems.

[126]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[127]  Robert R. Korfhage,et al.  Information Storage and Retrieval , 1963 .

[128]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[129]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[130]  Sung Been Moon Enhancing performance of full-text retrieval systems using relevance feedback , 1993 .

[131]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[132]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[133]  Robert M. Losee,et al.  Feedback in Information Retrieval. , 1996 .

[134]  Medhat Ahmed Rakha,et al.  Application of basic hypergeometric series , 2004, Appl. Math. Comput..

[135]  Qigang Gao,et al.  Using controlled query generation to evaluate blind relevance feedback algorithms , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[136]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[137]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[138]  Gary Geunbae Lee,et al.  Probabilistic information retrieval model for a dependency structured indexing system , 2005, Inf. Process. Manag..

[139]  Arjen P. de Vries,et al.  Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[140]  Lei Dong,et al.  Improving Efficiency and Relevance Ranking in Information Retrieval , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[141]  John D. W. Morecroft,et al.  System dynamics and microworlds for policymakers , 1988 .

[142]  G. Jones,et al.  Information and Coding Theory , 2000 .

[143]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[144]  Philippe Flajolet,et al.  Analytic Combinatorics , 2009 .

[145]  Jochen R. Moehr,et al.  Terminological Problems in Information Retrieval , 2003, Journal of Medical Systems.

[146]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[147]  Robert M. Losee,et al.  Measuring search-engine quality and query difficulty: ranking with Target and Freestyle , 1999 .

[148]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[149]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[150]  Rong Tang,et al.  Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[151]  Mounia Lalmas,et al.  Evaluating XML retrieval effectiveness at INEX , 2007, SIGF.

[152]  Ellen M. Voorhees TREC: Improving information access through evaluation , 2006 .

[153]  Mizan Rahman,et al.  Encyclopedia of Mathematics and its Applications , 1990 .

[154]  D. M. Hutton,et al.  Handbook of Discrete and Combinatorial Mathematics , 2001 .

[155]  Robert M. Losee,et al.  Determining Information Retrieval and Filtering Performance without Experimentation , 1995, Inf. Process. Manag..

[156]  Claudia J. Gollop,et al.  Library and Information Science Education: Preparing Librarians for a Multicultural Society , 1999 .

[157]  Ruxandra Domenig,et al.  SPIDER Retrieval System at TREC-5 , 1996, TREC.

[158]  Albert Nijenhuis,et al.  Combinatorial Algorithms for Computers and Calculators , 1978 .

[159]  Edward A. Fox,et al.  Characterization of Two New Experimental Collections in Computer and Information Science Containing Textual and Bibliographic Concepts , 1983 .

[160]  Garrison W. Cottrell,et al.  Adaptive combination of evidence for information retrieval , 1999 .

[161]  Iseult White Using the Booch Method: A Rational Approach , 1994 .

[162]  Colin Rose,et al.  Mathematical Statistics with Mathematica , 2002 .

[163]  David Flanagan,et al.  Java in a Nutshell , 1996 .

[164]  Shamkant B. Navathe,et al.  Conceptual Database Design: An Entity-Relationship Approach , 1991 .

[165]  Fred R. McFadden,et al.  Modern database management (4th ed.) , 1994 .

[166]  Emil Grosswald,et al.  The Theory of Partitions , 1984 .

[167]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[168]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[169]  David G. Luenberger,et al.  Information Science , 2006 .

[170]  Vladimir A. Dobrushkin,et al.  Methods in Algorithmic Analysis , 2008 .

[171]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[172]  G. Salton,et al.  A Generalized Term Dependence Model in Information Retrieval , 1983 .

[173]  Robert Burgin,et al.  The Monte Carlo Method and the Evaluation of Retrieval System Performance , 1999, J. Am. Soc. Inf. Sci..

[174]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[175]  L. Comtet,et al.  Advanced Combinatorics: The Art of Finite and Infinite Expansions , 1974 .

[176]  Bjarne Stroustrup,et al.  The C++ Programming Language: Special Edition , 2000 .

[177]  Jean M. Tague,et al.  The pragmatics of information retrieval experimentation , 1981 .

[178]  Padmini Srinivasan,et al.  On generalizing the Two-Poisson Model , 1990, J. Am. Soc. Inf. Sci..

[179]  Tadao Takaoka,et al.  An O(1) Time Algorithm for Generating Multiset Permutations , 1999, ISAAC.

[180]  Eric W. Weisstein,et al.  The CRC concise encyclopedia of mathematics , 1999 .

[181]  Clement T. Yu,et al.  Probabilistic models for document retrieval: a comparison of perfromance on exterimental and synthetic data bases , 1986, SIGIR '86.

[182]  Roi Blanco,et al.  ECIR 2008 Workshop on Efficiency Issues on Information Retrieval , 2008, SIGF.

[183]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[184]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[185]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[186]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming) , 2005 .

[187]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[188]  Maria L. Rizzo,et al.  Statistical Computing with R , 2007 .

[189]  M. Bona Introduction to Enumerative Combinatorics , 2005 .

[190]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[191]  Patrick Gallinari,et al.  On Effectiveness Measures and Relevance Functions in Ranking INEX Systems , 2005, AIRS.

[192]  Peter Schäuble,et al.  Highlighting Relevant Passages for Users of the Interactive SPIDER Retrieval System , 1995, TREC.

[193]  John Riordan,et al.  Introduction to Combinatorial Analysis , 1959 .

[194]  Franklin A. Graybill,et al.  Introduction to The theory , 1974 .

[195]  Gerard Salton,et al.  On the application of syntactic methodologies in automatic text analysis , 1989, SIGIR '89.

[196]  Charalambos A. Charalambides,et al.  Enumerative combinatorics , 2018, SIGA.

[197]  Gonzalo R. Arce,et al.  Order statistic filter banks , 1996, IEEE Trans. Image Process..

[198]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[199]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.