pest: Fast approximate keyword search in semantic data using eigenvector-based term propagation

We present pest, a novel approach to the approximate querying of graph-structured data such as RDF that exploits the data's structure to propagate term weights between related data items. We focus on data where meaningful answers are given through the application semantics, e.g., pages in wikis, persons in social networks, or papers in a research network such as Mendeley. The pest matrix generalizes the Google Matrix used in PageRank with a term-weight dependent leap and accommodates different levels of (semantic) closeness for different relations in the data, e.g., friend vs. co-worker in a social network. Its eigenvectors represent the distribution of a term after propagation. The eigenvectors for all terms together form a (vector space) index that takes the structure of the data into account and can be used with standard document retrieval techniques. In extensive experiments including a user study on a real life wiki, we show how pest improves the quality of the ranking over a range of existing ranking approaches, yet achieves a query performance comparable to a plain vector space index.

[1]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[2]  Jaroslav Pokorný Vector-Oriented Retrieval in XML Data Collections , 2008, DATESO.

[3]  Pavel Berkhin,et al.  A Survey on PageRank Computing , 2005, Internet Math..

[4]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[5]  Laks V. S. Lakshmanan,et al.  FleXPath: flexible structure and full-text querying for XML , 2004, SIGMOD '04.

[6]  Henryk Sienkiewicz,et al.  Quo Vadis? , 1967, American Association of Industrial Nurses journal.

[7]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[8]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[9]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[10]  Kevin Chen-Chuan Chang,et al.  EntityRank: Searching Entities Directly and Holistically , 2007, VLDB.

[11]  Alistair Moffat,et al.  Compression and an IR Approach to XML Retrieval , 2002, INEX Workshop.

[12]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Gerhard Weikum,et al.  Language-model-based ranking for queries on RDF-graphs , 2009, CIKM.

[14]  Djoerd Hiemstra,et al.  Combining document- and paragraph-based entity ranking , 2008, SIGIR '08.

[15]  Sepandar D. Kamvar,et al.  An Analytical Comparison of Approaches to Personalizing PageRank , 2003 .

[16]  Gerhard Weikum,et al.  Searching RDF Graphs with SPARQL and Keywords , 2010, IEEE Data Eng. Bull..

[17]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[18]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[19]  Tim Furche,et al.  KWilt: A Semantic Patchwork for Flexible Access to Heterogeneous Knowledge , 2010, RR.

[20]  Gad M. Landau,et al.  An Extension of the Vector Space Model for Querying XML Documents via XML Fragments 1 , 2002 .

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.

[23]  Vagelis Hristidis,et al.  ObjectRank: a system for authority-based search on databases , 2006, SIGMOD Conference.

[24]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[25]  Allan Collins,et al.  A spreading-activation theory of semantic processing , 1975 .

[26]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[27]  Tim Furche,et al.  Quo Vadis, Web Queries? , 2008 .

[28]  Richard Chbeir,et al.  An overview on XML similarity: Background, current trends and future directions , 2009, Comput. Sci. Rev..

[29]  Kaizhong Zhang,et al.  ATreeGrep: approximate searching in unordered trees , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[30]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[31]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[32]  François Bry,et al.  Flavors of KWQL, a Keyword Query Language for a Semantic Wiki , 2010, SOFSEM.

[33]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[34]  Soumen Chakrabarti,et al.  Dynamic personalized pagerank in entity-relation graphs , 2007, WWW '07.

[35]  Torsten Schlieder Similarity Search in XML Data using Cost-Based Query Transformations , 2001, WebDB.