Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections

The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[3]  Peter A. Flach,et al.  Abduction and Induction , 2000 .

[4]  T. Kanaji,et al.  Convulxin Binds to Native, Human Glycoprotein Ibα* , 2003, Journal of Biological Chemistry.

[5]  Barend Mons,et al.  Online tools to support literature-based discovery in the life sciences , 2005, Briefings Bioinform..

[6]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[7]  Bob Rehder,et al.  How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans , 1997 .

[8]  Michael W. Berry,et al.  GTP (General Text Parser) Software for Text Mining , 2003 .

[9]  Dominic Widdows,et al.  Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and Document Retrieval , 2003, ACL.

[10]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[11]  Wanda Pratt,et al.  A new evaluation methodology for literature-based discovery systems , 2009, J. Biomed. Informatics.

[12]  Carol Friedman,et al.  Exploiting Semantic Relations for Literature-Based Discovery , 2006, AMIA.

[13]  C. Peirce,et al.  Philosophical Writings of Peirce , 1955 .

[14]  Hamparsum Bozdogan,et al.  Statistical Data Mining and Knowledge Discovery , 2004 .

[15]  T. Kanaji,et al.  Convulxin binds to native, human glycoprotein Ib alpha. , 2003, The Journal of biological chemistry.

[16]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[17]  Pentti Kanerva,et al.  Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors , 2009, Cognitive Computation.

[18]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[19]  Michael W. Berry,et al.  Mathematical Foundations Behind Latent Semantic Analysis , 2007 .

[20]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[21]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[22]  John Woods,et al.  A Quantum Logic of Down Below , 2006 .

[23]  Dirk Ifenthaler,et al.  Computer-Based Diagnostics and Systematic Analysis of Knowledge , 2010 .

[24]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[25]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[26]  Marc Weeber,et al.  Case Report: Generating Hypotheses by Discovering Implicit Associations in the Literature: A Case Report of a Search for New Potential Therapeutic Uses for Thalidomide , 2003, J. Am. Medical Informatics Assoc..

[27]  Darrell Laham,et al.  Latent Semantic Analysis Approaches to Categorization , 1997 .

[28]  Dominic Widdows,et al.  Semantic Vectors: a Scalable Open Source Package and Online Technology Management Application , 2008, LREC.

[29]  Trevor Cohen,et al.  Exploring MEDLINE Space with Random Indexing and Pathfinder Networks , 2008, AMIA.

[30]  Ronald N. Kostoff,et al.  Literature-Related Discovery (LRD): Introduction and background , 2008 .

[31]  Stephen I. Gallant,et al.  Context Vectors: A Step Toward a "Grand Unified Representation" , 1998, Hybrid Neural Systems.

[32]  Wanda Pratt,et al.  Using statistical and knowledge-based approaches for literature-based discovery , 2006, J. Biomed. Informatics.

[33]  Saso Dzeroski,et al.  Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS , 2001, MedInfo.

[34]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[35]  Michael D. Gordon,et al.  Toward Discovery Support Systems: A Replication, Re-Examination, and Extension of Swanson's Work on Literature-Based Discovery of a Connection between Raynaud's and Fish Oil , 1996, J. Am. Soc. Inf. Sci..

[36]  Stefan Wermter,et al.  Hybrid Neural Systems, LNAI 1778 , 2000 .

[37]  Michael D. Gordon,et al.  Toward Discovery Support Systems: A Replication, Re-Examination, and Extension of Swanson's Work on Literature-Based Discovery of a Connection between Raynaud's and Fish Oil , 1996, J. Am. Soc. Inf. Sci..

[38]  Peter Bruza,et al.  A Bare Bones Approach to Literature-Based Discovery: An Analysis of the Raynaud's/Fish-Oil and Migraine-Magnesium Discoveries in Semantic Space , 2005, Discovery Science.

[39]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[40]  William M. Pottenger,et al.  Recent Advances in Literature Based Discovery , 2005 .

[41]  Peter Bruza,et al.  Towards Operational Abduction from a Cognitive Perspective , 2006, Log. J. IGPL.

[42]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[43]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[44]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[45]  Gerda Ruge,et al.  Experiments on Linguistically-Based Term Associations , 1992, Inf. Process. Manag..

[46]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[47]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .

[48]  Dov M. Gabbay,et al.  Handbook of Quantum Logic and Quantum Structures: Quantum Logic , 2009 .

[49]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[50]  Laurianne Sitbon,et al.  On the relevance of documents for semantic representation , 2008 .

[51]  Susan T. Dumais,et al.  Using Latent Semantic Indexing for Literature Based Discovery , 1998, J. Am. Soc. Inf. Sci..

[52]  Trevor Cohen,et al.  Semantic Vector Combinations and the Synoptic Gospels , 2009, QI.