Neural embedding-based indices for semantic search

Abstract Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.

[1]  Ido Dagan,et al.  CogALex-V Shared Task: LexNET - Integrated Path-based and Distributional Method for the Identification of Semantic Relations , 2016, CogALex@COLING.

[2]  Alice H. Oh,et al.  Effective ranking and search techniques for Web resources considering semantic relationships , 2014, Inf. Process. Manag..

[3]  Bhaskar Mitra,et al.  Improving Document Ranking with Dual Word Embeddings , 2016, WWW.

[4]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[5]  Zhen Wang,et al.  Knowledge Graph and Text Jointly Embedding , 2014, EMNLP.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Craig MacDonald,et al.  On Inverted Index Compression for Search Engine Efficiency , 2014, ECIR.

[8]  Hui Chen,et al.  Bilinear joint learning of word and entity embeddings for Entity Linking , 2018, Neurocomputing.

[9]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[10]  Hiroyuki Shindo,et al.  Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation , 2016, CoNLL.

[11]  Eric P. Xing,et al.  Entity Hierarchy Embedding , 2015, ACL.

[12]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[13]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[14]  Massimiliano Ciaramita,et al.  A framework for benchmarking entity-annotation systems , 2013, WWW.

[15]  Francis C. Fernández-Reyes,et al.  A Prospect-Guided global query expansion strategy using word embeddings , 2018, Inf. Process. Manag..

[16]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Romaric Besançon,et al.  Apprendre des représentations jointes de mots et d’entités pour la désambiguïsation d’entités (Combining Word and Entity Embeddings for Entity Linking) , 2017, JEPTALNRECITAL.

[18]  Ingemar J. Cox,et al.  A Concept Language Model for Ad-hoc Retrieval , 2017, WWW.

[19]  Tommi S. Jaakkola,et al.  Word Embeddings as Metric Recovery in Semantic Spaces , 2016, TACL.

[20]  Hannah Bast,et al.  An index for efficient semantic full-text search , 2013, CIKM.

[21]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[22]  Charles L. A. Clarke,et al.  Faster and smaller inverted indices with treaps , 2013, SIGIR.

[23]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[24]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[25]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[26]  Haofen Wang,et al.  Semplore: A scalable IR approach to search the Web of Data , 2009, J. Web Semant..

[27]  Kevin Chen-Chuan Chang,et al.  Beyond pages: supporting efficient, scalable entity search with dual-inversion index , 2010, EDBT '10.

[28]  Po Hu,et al.  Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering , 2015, ACL.

[29]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[30]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[31]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[32]  Wei Lin,et al.  Revisiting Word Embedding for Contrasting Meaning , 2015, ACL.

[33]  Guido Zuccon,et al.  Integrating and Evaluating Neural Word Embeddings in Information Retrieval , 2015, ADCS.

[34]  M. de Rijke,et al.  Learning Latent Vector Spaces for Product Search , 2016, CIKM.

[35]  Escuela Politécnica Superior,et al.  Semantically enhanced Information Retrieval: an ontology-based approach , 2009 .

[36]  Marcel Worring,et al.  Unsupervised, Efficient and Semantic Expertise Retrieval , 2016, WWW.

[37]  Hua Yuan,et al.  Semantic Search for Public Opinions on Urban Affairs: A Probabilistic Topic Modeling-Based Approach , 2016, Inf. Process. Manag..

[38]  Ming Li,et al.  Entity Disambiguation by Knowledge and Text Jointly Embedding , 2016, CoNLL.

[39]  Ali A. Ghorbani,et al.  Efficient indexing for semantic search , 2017, Expert Syst. Appl..

[40]  Minh-Triet Tran,et al.  News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion , 2017, SoICT.

[41]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[42]  Fabrizio Silvestri,et al.  Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search , 2015, SIGIR.

[43]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[44]  Cong Yu,et al.  EntityEngine: answering entity-relationship queries using shallow semantics , 2010, CIKM '10.

[45]  Michael Granitzer,et al.  Robust and Collective Entity Disambiguation through Semantic Embeddings , 2016, SIGIR.

[46]  Kohei Sugawara,et al.  On Approximately Searching for Similar Word Embeddings , 2016, ACL.

[47]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[48]  Ebrahim Bagheri,et al.  Document Retrieval Model Through Semantic Linking , 2017, WSDM.

[49]  Amal Zouaq,et al.  An Empirical Study of Embedding Features in Learning to Rank , 2017, CIKM.

[50]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[51]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[52]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[53]  Christoph Mangold,et al.  A survey and classification of semantic search approaches , 2007, Int. J. Metadata Semant. Ontologies.

[54]  James Allan,et al.  Entity query feature expansion using knowledge base links , 2014, SIGIR.

[55]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[56]  Jason Weston,et al.  Open Question Answering with Weakly Supervised Embedding Models , 2014, ECML/PKDD.

[57]  Krisztian Balog,et al.  Entity linking and retrieval for semantic search , 2014, WSDM.

[58]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[59]  Xitong Liu,et al.  Latent entity space: a novel retrieval approach for entity-bearing queries , 2015, Information Retrieval Journal.

[60]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[61]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[62]  Hannah Bast,et al.  Semantic full-text search with broccoli , 2014, SIGIR.

[63]  Martin Aumüller,et al.  ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms , 2018, SISAP.

[64]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[65]  Jiafeng Guo,et al.  Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[66]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[67]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[68]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[69]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[70]  Tie-Yan Liu,et al.  Word-Entity Duet Representations for Document Ranking , 2017, SIGIR.

[71]  Krisztian Balog,et al.  Exploiting Entity Linking in Queries for Entity Retrieval , 2016, ICTIR.

[72]  Enrico Motta,et al.  Semantically enhanced Information Retrieval: An ontology-based approach , 2011, J. Web Semant..

[73]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[74]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[75]  Oren Kurland,et al.  Query Expansion Using Word Embeddings , 2016, CIKM.

[76]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[77]  Marcel Worring,et al.  Semantic Entities , 2015, ESAIR@CIKM.

[78]  Hae-Chang Rim,et al.  Knowledge-based question answering using the semantic embedding space , 2015, Expert Syst. Appl..

[79]  Roi Blanco,et al.  Lightweight Multilingual Entity Extraction and Linking , 2017, WSDM.

[80]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[81]  Hao Wu,et al.  Hierarchical Neural Language Models for Joint Representation of Streaming Documents and their Content , 2015, WWW.

[82]  Kalina Bontcheva,et al.  Mímir: An open-source semantic search framework for interactive information seeking and discovery , 2015, J. Web Semant..

[83]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[84]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[85]  Steven Schockaert,et al.  MEmbER: Max-Margin Based Embeddings for Entity Retrieval , 2017, SIGIR.