Lexical measurements for information retrieval : a quantum approach

The problem of determining whether a document is about a loosely defined topic is at the core of text Information Retrieval (IR). An automatic IR system should be able to determine if a document is likely to convey information on a topic. In most cases, it has to do it solely based on measure- ments of the use of terms in the document (lexical measurements). In this work a novel scheme for measuring and representing lexical information from text documents is proposed. This scheme is inspired by the concept of ideal measurement as is described by Quantum Theory (QT). We apply it to Information Retrieval through formal analogies between text processing and physical measurements. The main contribution of this work is the development of a complete mathematical scheme to describe lexical measurements. These measurements encompass current ways of repre- senting text, but also completely new representation schemes for it. For example, this quantum-like representation includes logical features such as non-Boolean behaviour that has been suggested to be a fundamental issue when extracting information from natural language text. This scheme also provides a formal unification of logical, probabilistic and geometric approaches to the IR problem. From the concepts and structures in this scheme of lexical measurement, and using the principle of uncertain conditional, an “Aboutness Witness” is defined as a transformation that can detect docu- ments that are relevant to a query. Mathematical properties of the Aboutness Witness are described in detail and related to other concepts from Information Retrieval. A practical application of this concept is also developed for ad hoc retrieval tasks, and is evaluated with standard collections. Even though the introduction of the model instantiated here does not lead to substantial perfor- mance improvements, it is shown how it can be extended and improved, as well as how it can generate a whole range of radically new models and methodologies. This work opens a number of research possibilities both theoretical and experimental, like new representations for documents in Hilbert spaces or other forms, methodologies for term weighting to be used either within the proposed framework or independently, ways to extend existing methodologies, and a new range of operator-based methods for several tasks in IR.

[1]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[2]  P. Dirac Principles of Quantum Mechanics , 1982 .

[3]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[4]  Andrei Khrennikov Quantum-like formalism for cognitive measurements. , 2003, Bio Systems.

[5]  Ingo Schmitt,et al.  Quantum Query Processing: Unifying Database Querying and Information Retrieval , 2006 .

[6]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[7]  Sndor Dominich Mathematical Foundations of Information Retrieval , 2002, Computational Linguistics.

[8]  Diederik Aerts,et al.  Applications of Quantum Statistics in Psychological Studies of Decision Processes , 1995 .

[9]  D. R. Swanson Historical note: information retrieval and the future of an illusion , 1997 .

[10]  G. Buchdahl Theory Construction: The Work of Norman Robert Campbell , 1964, Isis.

[11]  ChengXiang Zhai,et al.  An exploration of axiomatic approaches to information retrieval , 2005, SIGIR '05.

[12]  Pedro Carpena,et al.  Keyword detection in natural languages and DNA , 2002 .

[13]  William S. Cooper,et al.  Inconsistencies and Misnomers in Probabilistic IR. , 1991, SIGIR 1991.

[14]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[15]  Pawel Wocjan,et al.  Characterization of Combinatorially Independent Permutation Separability Criteria , 2005, Open Syst. Inf. Dyn..

[16]  Christian Jacquemin,et al.  Term Extraction and Automatic Indexing , 2005 .

[17]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[18]  Vannevar Bush,et al.  As we may think , 1945, INTR.

[19]  G. Boole An Investigation of the Laws of Thought: On which are founded the mathematical theories of logic and probabilities , 2007 .

[20]  Louise Guthrie,et al.  Document Classification By Machine: Theory and Practice , 1994, COLING.

[21]  William Thomson,et al.  Popular Lectures and Addresses: ELECTRICAL UNITS OF MEASUREMENT , 2011 .

[22]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[23]  Diederik Aerts,et al.  Quantum Interference and Superposition in Cognition: Development of a Theory for the Disjunction of Concepts , 2007, 0705.0975.

[24]  L. Hardy Quantum Theory From Five Reasonable Axioms , 2001, quant-ph/0101012.

[25]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[26]  Fabio Crestani,et al.  Soft Information Retrieval : Applications of Fuzzy Set Theory and Neural Networks , 1999 .

[27]  M. E. Maron,et al.  Full-text information retrieval: Further analysis and clarification , 1990, Inf. Process. Manag..

[28]  A. Shimony,et al.  Bell’s theorem without inequalities , 1990 .

[29]  Peter Ingwersen,et al.  Information Retrieval Interaction , 1992 .

[30]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[31]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[32]  C. J. van Rijsbergen,et al.  Characterising through Erasing: A Theoretical Framework for Representing Documents Inspired by Quantum Theory , 2008, ArXiv.

[33]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[34]  J. Neumann Mathematical Foundations of Quantum Mechanics , 1955 .

[35]  C. D. Meyer,et al.  Generalized inverses of linear transformations , 1979 .

[36]  Peter Ingwersen,et al.  The Turn - Integration of Information Seeking and Retrieval in Context , 2005, The Kluwer International Series on Information Retrieval.

[37]  Rudolf Carnap,et al.  Philosophical Foundations of Physics an Introduction to the Philosophy of Science , 1966 .

[38]  A. Gleason Measures on the Closed Subspaces of a Hilbert Space , 1957 .

[39]  W. Kohn An essay on condensed matter physics in the twentieth century , 1999 .

[40]  R. Jozsa Fidelity for Mixed Quantum States , 1994 .

[41]  C. J. van Rijsbergen,et al.  Information retrieval and situation theory , 1996, SIGF.

[42]  Marimuthu Palaniswami,et al.  Fourier domain scoring: a novel document ranking method , 2004, IEEE Transactions on Knowledge and Data Engineering.

[43]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[44]  Thomas Roelleke,et al.  Semi-subsumed Events: A Probabilistic Semantics of the BM25 Term Frequency Quantification , 2009, ICTIR.

[45]  Ingo Schmitt,et al.  QQL: A DB&IR Query Language , 2007, The VLDB Journal.

[46]  Marie-Francine Moens,et al.  A Belief Model of Query Difficulty That Uses Subjective Logic , 2009, ICTIR.

[47]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[48]  Dominic Widdows,et al.  Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and Document Retrieval , 2003, ACL.

[49]  Kam-Fai Wong,et al.  Aboutness from a commonsense perspective , 2000, J. Am. Soc. Inf. Sci..

[50]  C. Darwin The Origin of Species by Means of Natural Selection, Or, The Preservation of Favoured Races in the Struggle for Life , 1859 .

[51]  J. Michell Measurement in psychology: A critical history of a methodological concept. , 1999 .

[52]  Dominic Widdows A Mathematical Model for Context and Word-Meaning , 2003, CONTEXT.

[53]  E W Scripture THE NEED OF PSYCHOLOGICAL TRAINING. , 1892, Science.

[54]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[55]  W. J. Hutchins,et al.  ON THE PROBLEM OF 'ABOUTNESS' IN DOCUMENT ANALYSIS , 1977 .

[56]  Jan Hamhalter,et al.  De Morgan Property for Effect Algebras of von Neumann Algebras , 2002 .

[57]  C. J. van Rijsbergen,et al.  Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2001 .

[58]  C. J. van Rijsbergen,et al.  (invited paper) A new theoretical framework for information retrieval , 1986, SIGIR '86.

[59]  C. J. van Rijsbergen,et al.  The geometry of information retrieval , 2004 .

[60]  P. Carpena,et al.  Level statistics of words: finding keywords in literary texts and symbolic sequences. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[61]  E. Beltrametti,et al.  Bericht: On the Logic of Quantum Mechanics , 1973 .

[62]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[63]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[64]  T. B. Rogers,et al.  Measurement in psychology: A critical history of a methodological concept , 2002 .

[65]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[66]  William P. Alston,et al.  Knowledge and the Flow of Information , 1985 .

[67]  Jakob Nielsen,et al.  Concise, SCANNABLE, and Objective: How to Write for the Web , 2006 .

[68]  Stephen E. Robertson,et al.  The TREC-8 Filtering Track Final Report , 1999, TREC.

[69]  P. Busch Quantum states and generalized observables: a simple proof of Gleason's theorem. , 1999, Physical review letters.

[70]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[71]  P. Gärdenfors Belief Revisions and the Ramsey Test for Conditionals , 1986 .

[72]  Niki Pfeifer,et al.  Inference in conditional probability logic , 2006, Kybernetika.

[73]  W. Bruce Croft Language models for information retrieval , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[74]  Mounia Lalmas,et al.  Logical Models in Information Retrieval: Introduction and Overview , 1998, Inf. Process. Manag..

[75]  Jian-Yun Nie,et al.  Toward a Broader Logical Model for Information Retrieval , 1998 .

[76]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[77]  Norman Robert Campbell,et al.  An Account of the Principles of Measurement and Calculation , 1928, Nature.

[78]  Andrei Khrennikov,et al.  Interpretations of Probability , 1999 .

[79]  C. J. van Rijsbergen,et al.  The Quantum Probability Ranking Principle for Information Retrieval , 2009, ICTIR.

[80]  J. Bennett A Philosophical Guide to Conditionals , 2003 .

[81]  R. Mcweeny On the Einstein-Podolsky-Rosen Paradox , 2000 .

[82]  Claudio Carpineto,et al.  FUB at TREC-10 Web Track: A Probabilistic Framework for Topic Relevance Term Weighting , 2001, TREC.

[83]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[84]  Massimo Melucci Exploring a Mechanics for Context-Aware Information Retrieval , 2007, AAAI Spring Symposium: Quantum Interaction.

[85]  Edward Fredkin Five big questions with pretty simple answers , 2004, IBM J. Res. Dev..

[86]  Stanley Burris,et al.  A course in universal algebra , 1981, Graduate texts in mathematics.

[87]  William K. Wootters,et al.  Quantum mechanics without probability amplitudes , 1986 .

[88]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[89]  Constantin Piron,et al.  On the logic of quantum logic , 1977, J. Philos. Log..

[90]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[91]  Albert Einstein,et al.  Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? , 1935 .

[92]  John Lafferty,et al.  A Model of Lexical Attraction and Repulsion , 1997, Annual Meeting of the Association for Computational Linguistics.

[93]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[94]  N. Bohr II - Can Quantum-Mechanical Description of Physical Reality be Considered Complete? , 1935 .

[95]  Baowen Xu,et al.  A constrained non-negative matrix factorization in information retrieval , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[96]  J. Skilling,et al.  The Origin of Complex Quantum Amplitudes , 2009 .

[97]  Sheila Webber,et al.  Information Science in 2003: A Critique , 2003, J. Inf. Sci..

[98]  R. Landauer The physical nature of information , 1996 .

[99]  C. J. van Rijsbergen,et al.  A New Theoretical Framework for Information Retrieval , 1986, SIGIR Forum.

[100]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[101]  Aurora Pérez,et al.  A Computational Approach to George Boole's Discovery of Mathematical Logic , 1997, Artif. Intell..

[102]  Mounia Lalmas,et al.  A Quantum-Based Model for Interactive Information Retrieval , 2009, ICTIR.

[103]  Paul Muter,et al.  Reading and skimming from computer screens and books: the paperless office revisited? , 1991 .

[104]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[105]  Jussi Karlgren,et al.  Stylistic Experiments for Information Retrieval , 1999 .

[106]  D. Bohm,et al.  Wholeness and the Implicate Order , 1981 .

[107]  Elio Conte,et al.  A Preliminar Evidence of Quantum Like Behavior in Measurements of Mental States , 2003 .

[108]  J. Neumann,et al.  The Logic of Quantum Mechanics , 1936 .

[109]  J. Bell,et al.  QuasiBoolean algebras and simultaneously definite properties in quantum mechanics , 1995 .

[110]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[111]  C. J. van Rijsbergen,et al.  Eraser Lattices and Semantic Contents , 2009, QI.

[112]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[113]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[114]  Peter Bruza Is There Something Quantum-Like about the Human Mental Lexicon? , 2009, INEX.

[115]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[116]  J. Busemeyer,et al.  A quantum probability explanation for violations of ‘rational’ decision theory , 2009, Proceedings of the Royal Society B: Biological Sciences.

[117]  Stefano Mizzaro,et al.  How many relevances in information retrieval? , 1998, Interact. Comput..

[118]  James T. Townsend,et al.  Quantum dynamics of human decision-making , 2006 .

[119]  Paul Thompson,et al.  Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2002, Information Retrieval.

[120]  A. Strauss,et al.  Grounded theory , 2017 .

[121]  Thomas G. Dietterich,et al.  Readings in Machine Learning , 1991 .

[122]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[123]  Norbert Fuhr,et al.  Retrieval of complex objects using a four-valued logic , 1996, SIGIR '96.