Entity-Based Keyword Search in Web Documents

In document search, documents are typically seen as a flat list of keywords. To deal with the syntactic interoperability, i.e., the use of different keywords to refer to the same real world entity, entity linkage has been used to replace keywords in the text with a unique identifier of the entity to which they are referring. Yet, the flat list of entities fails to capture the actual relationships that exist among the entities, information that is significant for a more effective document search. In this work we propose to go one step further from entity linkage in text, and model the documents as a set of structures that describe relationships among the entities mentioned in the text. We show that this kind of representation is significantly improving the effectiveness of document search. We describe the details of the implementation of the above idea and we present an extensive set of experimental results that prove our point.

[1]  Lillian Lee,et al.  Iterative Residual Rescaling: An Analysis and Generalization of LSI , 2001, SIGIR 2002.

[2]  Annalina Caputo,et al.  Integrating Named Entities in a Semantic Search Engine , 2010, IIR.

[3]  S. Sudarshan,et al.  BANKS: Browsing and Keyword Searching in Relational Databases , 2002, VLDB.

[4]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[5]  Sonia Bergamaschi,et al.  Combining user and database perspective for solving keyword queries over relational databases , 2016, Inf. Syst..

[6]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[7]  Marko Grobelnik,et al.  Learning Sub-structures of Document Semantic Graphs for Document Summarization , 2004 .

[8]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[9]  Themis Palpanas,et al.  Exemplar Queries: Give me an Example of What You Need , 2014, Proc. VLDB Endow..

[10]  Ekaterini Ioannou,et al.  On Generating Benchmark Data for Entity Matching , 2012, Journal on Data Semantics.

[11]  Claudia Niederée,et al.  Entity Name System: The Back-Bone of an Open and Scalable Web of Data , 2008, 2008 IEEE International Conference on Semantic Computing.

[12]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Divesh Srivastava,et al.  Fine-grained controversy detection in Wikipedia , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[14]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[15]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[16]  Tru H. Cao,et al.  Text Clustering with Named Entities: A Model, Experimentation and Realization , 2012 .

[17]  Rada Mihalcea,et al.  Document Indexing using Named Entities , 2001 .

[18]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[19]  Svetlana Hensman,et al.  Construction of Conceptual Graph Representation of Texts , 2004, NAACL.

[20]  Gautam Das,et al.  A Probabilistic Optimization Framework for the Empty-Answer Problem , 2013, Proc. VLDB Endow..

[21]  Sonia Bergamaschi,et al.  Keyword search over relational databases: a metadata approach , 2011, SIGMOD '11.

[22]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[23]  Sandeep Tata,et al.  SQAK: doing more with keywords , 2008, SIGMOD Conference.

[24]  Yong Yu,et al.  Learning to Generate CGs from Domain Specific Sentences , 2001, ICCS.

[25]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[26]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.