The Efficiency of Corpus-based Distributional Models for Literature-based Discovery on Large Data Sets

This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.

[1]  Peter Davies,et al.  Discovering discovery patterns with predication-based Semantic Indexing , 2012, J. Biomed. Informatics.

[2]  Ronald N. Kostoff,et al.  Literature-related discovery and innovation — update , 2012, Technological Forecasting and Social Change.

[3]  Laurianne Sitbon,et al.  Modelling Word Meaning using Efficient Tensor Representations , 2011, PACLIC.

[4]  Barend Mons,et al.  Online tools to support literature-based discovery in the life sciences , 2005, Briefings Bioinform..

[5]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[6]  Guido Zuccon,et al.  An evaluation of corpus-driven measures of medical concept similarity for information retrieval , 2012, CIKM.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[9]  P. Kanerva,et al.  Permutations as a means to encode order in word space , 2008 .

[10]  Carol Friedman,et al.  Exploiting Semantic Relations for Literature-Based Discovery , 2006, AMIA.

[11]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .

[12]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[13]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[14]  Michael D. Gordon,et al.  Toward Discovery Support Systems: A Replication, Re-Examination, and Extension of Swanson's Work on Literature-Based Discovery of a Connection between Raynaud's and Fish Oil , 1996, J. Am. Soc. Inf. Sci..

[15]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[16]  Guido Zuccon,et al.  Automatic query expansion: A structural linguistic perspective , 2014, J. Assoc. Inf. Sci. Technol..

[17]  Curt Burgess,et al.  Explorations in context space: Words, sentences, discourse , 1998 .

[18]  Daniel M. Roy,et al.  Complexity of Inference in Latent Dirichlet Allocation , 2011, NIPS.

[19]  Susan T. Dumais,et al.  Using Latent Semantic Indexing for Literature Based Discovery , 1998, J. Am. Soc. Inf. Sci..

[20]  Ronald N. Kostoff,et al.  Literature-Related Discovery (LRD): Introduction and background , 2008 .

[21]  Laurianne Sitbon,et al.  A tensor encoding model for semantic processing , 2012, CIKM '12.

[22]  Marc Weeber,et al.  Literature-based Discovery , 2008 .

[23]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[24]  Tugrul U. Daim,et al.  Forecasting emerging technologies: Use of bibliometrics and patent analysis , 2006 .

[25]  Michael D. Gordon,et al.  Toward Discovery Support Systems: A Replication, Re-Examination, and Extension of Swanson's Work on Literature-Based Discovery of a Connection between Raynaud's and Fish Oil , 1996, J. Am. Soc. Inf. Sci..

[26]  Mehrnoosh Sadrzadeh,et al.  Experimental Support for a Categorical Compositional Distributional Model of Meaning , 2011, EMNLP.

[27]  Clayton M. Christensen The Ongoing Process of Building a Theory of Disruption , 2006 .

[28]  Guido Zuccon,et al.  Term associations in query expansion: a structural linguistic perspective , 2013, CIKM.

[29]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[30]  Susan T. Dumais,et al.  The latent semantic analysis theory of knowledge , 1997 .