Toward entity-aware search

As the Web has evolved into a data-rich repository, with the standard “page view,” current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data “entities” (e.g., phone number, paper PDF, date), today’s engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-awareWeb search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning– entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability.

[1]  Ariel Fuxman,et al.  Using the wisdom of the crowds for keyword generation , 2008, WWW.

[2]  Oren Etzioni,et al.  Structured Querying of Web Text Data: A Technical Challenge , 2007, CIDR.

[3]  Soumen Chakrabarti,et al.  Optimizing scoring functions and indexes for proximity search in type-annotated corpora , 2006, WWW '06.

[4]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[5]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[6]  Ravi Kumar,et al.  A Characterization of Online Search Behavior , 2009, IEEE Data Eng. Bull..

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Gerhard Weikum,et al.  NAGA: Searching and Ranking Knowledge , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Surajit Chaudhuri,et al.  Exploiting web search to generate synonyms for entities , 2009, WWW '09.

[10]  Wei-Ying Ma,et al.  Web object retrieval , 2007, WWW '07.

[11]  David Taniar,et al.  Performance analysis of "Groupby-After-Join" query processing in parallel database systems , 2004, Inf. Sci..

[12]  Sumit Sarkar,et al.  A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases , 2002, IEEE Trans. Knowl. Data Eng..

[13]  Kevin Chen-Chuan Chang,et al.  Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web , 2007, CIDR.

[14]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[15]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[16]  Ioannis Antonellis,et al.  Simrank++: query rewriting through link analysis of the clickgraph (poster) , 2007, Proc. VLDB Endow..

[17]  Kevin Chen-Chuan Chang,et al.  Supporting entity search: a large-scale prototype search engine , 2007, SIGMOD '07.

[18]  Eser Kandogan,et al.  Avatar semantic search: a database approach to information retrieval , 2006, SIGMOD Conference.

[19]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[20]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[21]  Michael Collins,et al.  Answer Extraction , 2000, ANLP.

[22]  Soumen Chakrabarti,et al.  Breaking Through the Syntax Barrier: Searching with Entities and Relations , 2004, ECML.

[23]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[24]  Gerhard Weikum DB&IR: both sides now , 2007, SIGMOD '07.

[25]  Amélie Marian,et al.  Corroborating Answers from Multiple Web Sources , 2007, WebDB.

[26]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[27]  Kevin Chen-Chuan Chang,et al.  Beyond pages: supporting efficient, scalable entity search with dual-inversion index , 2010, EDBT '10.

[28]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[29]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[30]  Michael J. Cafarella Extracting and Querying a Comprehensive Web Database , 2009, CIDR.

[31]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[32]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[33]  Tao Cheng,et al.  Fuzzy matching of Web queries to structured data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[34]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[35]  Graham Wilcock,et al.  Unstructured Information Management Architecture (UIMA) , 2009 .

[36]  Per-Ake Larson,et al.  Performing Group-By before Join , 1994, ICDE 1994.

[37]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[38]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[39]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[40]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[41]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[42]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[43]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[44]  Jimmy J. Lin,et al.  Question answering from the web using knowledge annotation and knowledge mining techniques , 2003, CIKM '03.

[45]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[46]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[47]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[48]  Gerhard Weikum,et al.  MING: mining informative entity relationship subgraphs , 2009, CIKM.

[49]  Gerhard Weikum,et al.  How NAGA uncoils: searching with entities and relations , 2007, WWW '07.

[50]  Kevin Chen-Chuan Chang,et al.  EntityRank: Searching Entities Directly and Holistically , 2007, VLDB.

[51]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[52]  Euripides G. M. Petrakis,et al.  Semantic similarity methods in wordNet and their application to information retrieval on the web , 2005, WIDM '05.

[53]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[54]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[55]  Ravi Kumar,et al.  Optimizing query rewrites for keyword-based advertising , 2008, EC '08.

[56]  D. Taniar,et al.  Aggregate-join query processing in parallel database systems , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.

[57]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[58]  Hugh E. Williams,et al.  Fast phrase querying with combined indexes , 2004, TOIS.

[59]  Alon Y. Halevy,et al.  Indexing dataspaces , 2007, SIGMOD '07.

[60]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[61]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[62]  Divesh Srivastava,et al.  Keyword proximity search in XML trees , 2006 .

[63]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[64]  Kevin Chen-Chuan Chang,et al.  Data-oriented content query system: searching for data into text on the web , 2010, WSDM '10.

[65]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[66]  Steven Skiena,et al.  Concordance-Based Entity-Oriented Search , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[67]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[68]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[69]  Junghoo Cho,et al.  A fast regular expression indexing engine , 2002, Proceedings 18th International Conference on Data Engineering.