Efficient indexing for semantic search

Abstract The increasing performance and wider spread use of automated semantic annotation and entity linking platforms has empowered the possibility of using semantic information in information retrieval. While keyword-based information retrieval techniques have shown impressive performance, the addition of semantic information can increase retrieval performance by allowing for more accurate sense disambiguation, intent determination, and instance identification, just to name a few. Researchers have already delved into the possibility of integrating semantic information into practical search engines using a combination of techniques such as using graph databases, hybrid indices and adapted inverted indices, among others. One of the challenges with the efficient design of a search engine capable of considering semantic information is that it would need to be able to index information beyond the traditional information stored in inverted indices, including entity mentions and type relationships. The objective of our work in this paper is to investigate various ways in which different data structure types can be adopted to integrate three types of information including keywords, entities and types. We will systematically compare the performance of the different data structures for scenarios where (i) the same data structure types are adopted for the three types of information, and (ii) different data structure types are integrated for storing and retrieving the three different information types. We report our findings in terms of the performance of various query processing tasks such as Boolean and ranked intersection for the different indices and discuss which index type would be appropriate under different conditions for semantic search.

[1]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[2]  Eyal Oren,et al.  Sindice.com: a document-oriented lookup index for open linked data , 2008, Int. J. Metadata Semant. Ontologies.

[3]  Hannah Bast,et al.  Fast construction of the HYB index , 2011, TOIS.

[4]  Escuela Politécnica Superior,et al.  Semantically enhanced Information Retrieval: an ontology-based approach , 2009 .

[5]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[6]  Guillaume Blin,et al.  A survey of RDF storage approaches , 2012, ARIMA J..

[7]  Roi Blanco,et al.  Effective and Efficient Entity Search in RDF Data , 2011, SEMWEB.

[8]  James Allan,et al.  Fast Forward Index Methods for Pseudo-Relevance Feedback Retrieval , 2015, ACM Trans. Inf. Syst..

[9]  Björn Buchhold,et al.  Semantic Search on Text and Knowledge Bases , 2016, Found. Trends Inf. Retr..

[10]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists in Practice , 2012, SPIRE.

[11]  Gonzalo Navarro,et al.  New algorithms on wavelet trees and applications to information retrieval , 2010, Theor. Comput. Sci..

[12]  Fabian M. Suchanek,et al.  Semantic Full-Text Search with ESTER: Scalable, Easy, Fast , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[13]  G. Aghila,et al.  Design of New Indexing Techniques Based on Ontology for Information Retrieval Systems , 2010, ICT.

[14]  Andrea Dessi,et al.  A machine-learning approach to ranking RDF properties , 2016, Future Gener. Comput. Syst..

[15]  Peter Mika Distributed indexing for semantic search , 2010, SEMSEARCH '10.

[16]  Charles L. A. Clarke,et al.  Faster and smaller inverted indices with treaps , 2013, SIGIR.

[17]  Sherif Sakr,et al.  Relational processing of RDF queries: a survey , 2010, SGMD.

[18]  Jan Hidders,et al.  Storing and Indexing Massive RDF Datasets , 2012, Semantic Search over the Web.

[19]  Simon Gog,et al.  Compact Indexes for Flexible Top- k k Retrieval , 2014, CPM.

[20]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[21]  J. Shane Culpepper,et al.  Exploring the magic of WAND , 2013, ADCS.

[22]  Haofen Wang,et al.  Semplore: A scalable IR approach to search the Web of Data , 2009, J. Web Semant..

[23]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[24]  Kevin Chen-Chuan Chang,et al.  Beyond pages: supporting efficient, scalable entity search with dual-inversion index , 2010, EDBT '10.

[25]  Ruben Verborgh,et al.  Triple Pattern Fragments: A low-cost knowledge graph interface for the Web , 2016, J. Web Semant..

[26]  Fausto Giunchiglia,et al.  Concept Search , 2009, ESWC.

[27]  Diego Arroyuelo,et al.  Compressed Self-indices Supporting Conjunctive Queries on Document Collections , 2010, SPIRE.

[28]  Krisztian Balog,et al.  Entity Linking in Queries: Tasks and Evaluation , 2015, ICTIR.

[29]  Cong Yu,et al.  EntityEngine: answering entity-relationship queries using shallow semantics , 2010, CIKM '10.

[30]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[31]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.

[32]  Hannah Bast,et al.  A case for semantic full-text search , 2012, JIWES '12.

[33]  Justin Zobel,et al.  Efficient single-pass index construction for text databases , 2003, J. Assoc. Inf. Sci. Technol..

[34]  Hannah Bast,et al.  Broccoli: Semantic Full-Text Search at your Fingertips , 2012, ArXiv.

[35]  Ganesh Ramakrishnan,et al.  Compressed data structures for annotated web search , 2012, WWW.

[36]  Enrico Motta,et al.  Semantically enhanced Information Retrieval: An ontology-based approach , 2011, J. Web Semant..

[37]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[38]  Giuseppe Ottaviano,et al.  Fast and Space-Efficient Entity Linking for Queries , 2015, WSDM.

[39]  Enrico Motta,et al.  SemSearch: A Search Engine for the Semantic Web , 2006, EKAW.

[40]  Soumen Chakrabarti,et al.  Optimizing scoring functions and indexes for proximity search in type-annotated corpora , 2006, WWW '06.

[41]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[42]  Gerhard Weikum,et al.  Relationship Queries on Extended Knowledge Graphs , 2016, WSDM.

[43]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[44]  Kalina Bontcheva,et al.  Mímir: An open-source semantic search framework for interactive information seeking and discovery , 2015, J. Web Semant..

[45]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.

[46]  Kalina Bontcheva,et al.  Semantic Search over Documents and Ontologies , 2013, PROMISE Winter School.

[47]  Ganesh Ramakrishnan,et al.  Web-scale entity-relation search architecture , 2011, WWW.

[48]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[49]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[50]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[51]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[52]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[53]  Evgeny Kharlamov,et al.  Faceted search over RDF-based knowledge graphs , 2016, J. Web Semant..

[54]  James P. Callan,et al.  EsdRank: Connecting Query and Documents through External Semi-Structured Data , 2015, CIKM.

[55]  Gonzalo Navarro,et al.  General Document Retrieval in Compact Space , 2015, ACM J. Exp. Algorithmics.

[56]  Andy Seaborne,et al.  SPARQL/Update: A language for updating RDF graphs , 2007 .

[57]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[58]  Xiaonan Li,et al.  Structured Querying of Annotation-Rich Web Text with Shallow Semantics , 2010 .

[59]  Oren Kurland,et al.  Document Retrieval Using Entity-Based Language Models , 2016, SIGIR.

[60]  Giovanni Tummarello,et al.  Searching web data: An entity retrieval and high-performance indexing model , 2012, J. Web Semant..

[61]  J. Shane Culpepper,et al.  Top-k Ranked Document Search in General Text Databases , 2010, ESA.

[62]  Hannah Bast,et al.  An index for efficient semantic full-text search , 2013, CIKM.

[63]  Gonzalo Navarro,et al.  Dual-Sorted Inverted Lists , 2010, SPIRE.

[64]  Karl Aberer,et al.  Contextualized ranking of entity types based on knowledge graphs , 2016, J. Web Semant..

[65]  Christos Faloutsos,et al.  Description and performance analysis of signature file methods for office filing , 1987, TOIS.

[66]  Daniela Petrelli,et al.  Hybrid Search: Effectively Combining Keywords and Semantic Searches , 2008, ESWC.