Answering Imprecise Structured Search Queries

Humans are increasingly becoming the primary consumer of structured data. As the volume and heterogeneity of data produced in the world increases, the existing paradigm of using an application layer to query and search for information in data is becoming infeasible. The human end-user is overwhelmed with a barrage of diverse query and data models. Due to the lack of familiarity with the data sources, search queries issued by the user are typically found to be imprecise. To solve this problem, this dissertation introduces the notion of a "queried unit", or qunit, which is the semantic unit of information returned in response to a user's search query. In a qunits-based system, the user comes in with an information need, and is guided to the qunit that is an appropriate response for that need. The qunits-based paradigm aids the user by systematically shrinking both the query and result spaces. On one end, the query space is reduced by enriching the user's imprecise information need. This is done by extracting information from the user during query input by providing schema and data suggestions. On the other end, the result space is reduced by modeling the structured data into a collection of qunits. This is done using qunit derivation methods that use various sources of information such as query logs. This dissertation describes the design and implementation of a autocompletion-style system that performs both query and result space reduction by interacting with the user in real time, providing suggestions and pruning candidate qunit results. It enables the user to search through databases without any knowledge of the data, schema or the query language.

[1]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2]  Dan Suciu,et al.  SnipSuggest: Context-Aware Autocompletion for SQL , 2010, Proc. VLDB Endow..

[3]  Cong Yu,et al.  Querying Complex Structured Databases , 2007, VLDB.

[4]  Cong Yu,et al.  Semantic Adaptation of Schema Mappings when Schemas Evolve , 2005, VLDB.

[5]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[6]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[7]  Ian H. Witten,et al.  Browsing in digital libraries: a phrase-based approach , 1997, DL '97.

[8]  H. V. Jagadish,et al.  Assisted querying using instant-response interfaces , 2007, SIGMOD '07.

[9]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[10]  Heikki Mannila,et al.  Standing Out in a Crowd: Selecting Attributes for Maximum Visibility , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[12]  Soumen Chakrabarti,et al.  Enhanced Answer Type Inference from Questions using Sequential Models , 2005, HLT/EMNLP.

[13]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[14]  Mark Levene,et al.  Data Mining of User Navigation Patterns , 1999, WEBKDD.

[15]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Andrei Z. Broder,et al.  Classifying search queries using the Web as a source of knowledge , 2009, TWEB.

[17]  H. V. Jagadish,et al.  Effective Phrase Prediction , 2007, VLDB.

[18]  Ingmar Weber,et al.  Type less, find more: fast autocompletion search with a succinct index , 2006, SIGIR.

[19]  Anna Pienimäki,et al.  Indexing Music Databases Using Automatic Extraction of Frequent Phrases , 2002, ISMIR.

[20]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[21]  Inderpal Singh Mumick,et al.  Selection of Views to Materialize in a Data Warehouse , 2005, IEEE Trans. Knowl. Data Eng..

[22]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[23]  Tiziana Catarci,et al.  QBD*: A Graphical Query Language with Recursion , 1989, IEEE Trans. Software Eng..

[24]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[25]  Guoliang Li,et al.  Efficient type-ahead search on relational data: a TASTIER approach , 2009, SIGMOD Conference.

[26]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[27]  Jeffrey F. Naughton,et al.  Toward industrial-strength keyword search systems over relational data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Roy Goldman,et al.  Proximity Search in Databases , 1998, VLDB.

[29]  Samuel Madden,et al.  Java support for data-intensive systems: experiences building the telegraph dataflow system , 2001, SGMD.

[30]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[31]  Moshé M. Zloof Query-by-Example: A Data Base Language , 1977, IBM Syst. J..

[32]  Alain Pirotte,et al.  Domain-Oriented Relational Languages , 1977, VLDB.

[33]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[34]  Linda Dailey Paulson,et al.  Building Rich Web Applications with Ajax , 2005, Computer.

[35]  Hiroshi Motoda,et al.  Machine Learning Techniques to Make Computers Easier to Use , 1997, IJCAI.

[36]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[37]  Sriram Raghavan,et al.  Understanding queries in a search database system , 2010, PODS '10.

[38]  Ophir Frieder,et al.  Automatic web query classification using labeled and unlabeled training data , 2005, SIGIR '05.

[39]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[40]  Sanda M. Harabagiu,et al.  Performance Issues and Error Analysis in an Open-Domain Question Answering System , 2002, ACL.

[41]  Srinivasan Parthasarathy,et al.  Query by output , 2009, SIGMOD Conference.

[42]  Ted Selker,et al.  Context-aware design and interaction in computer systems , 2000, IBM Syst. J..

[43]  Marti A. Hearst,et al.  Hierarchical faceted metadata in site search interfaces , 2002, CHI Extended Abstracts.

[44]  S. Sudarshan,et al.  Enhancing Search with Structure , 2010, IEEE Data Eng. Bull..

[45]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[46]  Haixun Wang,et al.  Unifying Data and Domain Knowledge Using Virtual Views , 2007, VLDB.

[47]  Pat Hanrahan,et al.  Show Me: Automatic Presentation for Visual Analysis , 2007, IEEE Transactions on Visualization and Computer Graphics.

[48]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[49]  I. Witten,et al.  The Reactive Keyboard: a predictive typing aid , 1990, Computer.

[50]  Yong Yu,et al.  Identifying ambiguous queries in web search , 2007, WWW '07.

[51]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[52]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[53]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[54]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[55]  Jignesh M. Patel,et al.  Towards Declarative Querying for Biological Sequences , 2005 .

[56]  Egidio P. Giachin,et al.  Phrase bigrams for continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[57]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[58]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[59]  Andrew Dillon,et al.  Query by templates: a generalized approach for visual query formulation for text dominated databases , 1997, Proceedings of ADL '97 Forum on Research and Technology. Advances in Digital Libraries.

[60]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[61]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[62]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[63]  E. F. Codd,et al.  Relational Completeness of Data Base Sublanguages , 1972, Research Report / RJ / IBM / San Jose, California.

[64]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[65]  Giovanni Maria Sacco,et al.  Research Results in Dynamic Taxonomy and Faceted Search Systems , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[66]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[67]  Luis Gravano,et al.  Learning search engine specific query transformations for question answering , 2001, WWW '01.

[68]  Eytan Adar,et al.  GUESS: a language and interface for graph exploration , 2006, CHI.

[69]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[70]  Lin Guo XRANK : Ranked Keyword Search over XML Documents , 2003 .

[71]  Ricardo A. Baeza-Yates,et al.  The Intention Behind Web Queries , 2006, SPIRE.

[72]  Peter G. Anick,et al.  The paraphrase search assistant: terminological feedback for iterative information seeking , 1999, SIGIR '99.

[73]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[74]  Magesh Jayapandian,et al.  Expressive query specification through form customization , 2008, EDBT '08.

[75]  Adriane Chapman,et al.  Making database systems usable , 2007, SIGMOD '07.

[76]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[77]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[78]  George Buchanan,et al.  Scalable browsing for large collections: a case study , 2000, DL '00.

[79]  Tobias Scheffer,et al.  Sentence Completion , 1921, SIGIR '04.

[80]  Carole A. Goble,et al.  Kaleidoquery: a visual query language for object databases , 1998, AVI '98.

[81]  Brian D. Davison,et al.  Predicting Sequences of User Actions , 1998 .

[82]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[83]  Peter Bruza,et al.  Interactive Internet search: keyword, directory and query reformulation mechanisms compared , 2000, SIGIR '00.

[84]  Haim Levkowitz,et al.  From Visual Data Exploration to Visual Data Mining: A Survey , 2003, IEEE Trans. Vis. Comput. Graph..

[85]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[86]  Alistair Moffat,et al.  Re-store: a system for compressing, browsing, and searching large documents , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[87]  Amanda Spink,et al.  Determining the user intent of web search engine queries , 2007, WWW '07.

[88]  H. Mannila,et al.  Data mining: machine learning, statistics, and databases , 1996, Proceedings of 8th International Conference on Scientific and Statistical Data Base Management.

[89]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[90]  Bruce K. Britton,et al.  Effects of prior knowledge on use of cognitive capacity in three complex cognitive tasks. , 1982 .

[91]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[92]  David L. Waltz,et al.  An English language question answering system for a large relational database , 1978, CACM.

[93]  Magesh Jayapandian,et al.  Automated creation of a forms-based database query interface , 2008, Proc. VLDB Endow..

[94]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[95]  Jignesh M. Patel,et al.  Practical Suffix Tree Construction , 2004, VLDB.

[96]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[97]  Charles A. MacArthur,et al.  Word Processing with Speech Synthesis and Word Prediction: Effects on the Dialogue Journal Writing of Students with Learning Disabilities , 1998 .

[98]  Bin Liu,et al.  A Spreadsheet Algebra for a Direct Data Manipulation Query Interface , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[99]  Jeffrey Xu Yu,et al.  Keyword Search in Databases , 2010, Keyword Search in Databases.

[100]  Desney S. Tan,et al.  FacetMap: A Scalable Search and Browse Visualization , 2006, IEEE Transactions on Visualization and Computer Graphics.

[101]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[102]  Daphne Koller,et al.  Word-Sense Disambiguation for Machine Translation , 2005, HLT.

[103]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[104]  Cong Yu,et al.  Schema summarization , 2006, VLDB.

[105]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[106]  Brian P. Bailey,et al.  The Effects of Interruptions on Task Performance, Annoyance, and Anxiety in the User Interface , 2001, INTERACT.

[107]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[108]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[109]  Sunita Sarawagi,et al.  Biography and Position Statement. , 2010 .

[110]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[111]  I. Scott MacKenzie,et al.  Predicting text entry speed on mobile phones , 2000, CHI.

[112]  Peter G. Anick,et al.  A longitudinal study of real-time search assistance adoption , 2008, SIGIR '08.

[113]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[114]  Maxim Sviridenko,et al.  Approximation Algorithms for Maximum Coverage and Max Cut with Given Sizes of Parts , 1999, IPCO.

[115]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[116]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[117]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[118]  Harris Wu,et al.  Evaluating Web-based Question Answering Systems , 2002, LREC.

[119]  Kevin Li,et al.  Faceted metadata for image search and browsing , 2003, CHI '03.

[120]  Divesh Srivastava,et al.  IDEA: interactive data exploration and analysis , 1996, SIGMOD '96.

[121]  Georgia Koutrika,et al.  Précis: from unstructured keywords as queries to structured databases as answers , 2007, The VLDB Journal.

[122]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[123]  Hugh E. Williams,et al.  Fast phrase querying with combined indexes , 2004, TOIS.

[124]  Peter Haider,et al.  Learning to Complete Sentences , 2005, ECML.

[125]  Daniel Tunkelang Dynamic Category Sets: An Approach for Faceted Search , 2006 .

[126]  Moshé M. Zloof Office-by-Example: A Business Language that Unifies Data and Word Processing and Electronic Mail , 1982, IBM Syst. J..

[127]  Jock D. Mackinlay,et al.  The information visualizer, an information workspace , 1991, CHI.

[128]  Pat Hanrahan,et al.  VizQL: a language for query, analysis and visualization , 2006, SIGMOD Conference.

[129]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[130]  Matthias Müller-Prove,et al.  Professional usability in open source projects: GNOME, OpenOffice.org, NetBeans , 2004, CHI EA '04.

[131]  Krithi Ramamritham,et al.  Materialized view selection and maintenance using multi-query optimization , 2000, SIGMOD '01.

[132]  Kurt Mehlhorn,et al.  A Faster Approximation Algorithm for the Steiner Problem in Graphs , 1988, Inf. Process. Lett..

[133]  Alex Zelikovsky,et al.  Tighter Bounds for Graph Steiner Tree Approximation , 2005, SIAM J. Discret. Math..

[134]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[135]  T. Selkar,et al.  Context-aware design and interaction in computer systems , 2000 .

[136]  Brad A. Myers,et al.  Past, Present and Future of User Interface Software Tools , 2000, TCHI.

[137]  Andy Cockburn,et al.  Multimodal feedback for the acquisition of small targets , 2005, Ergonomics.

[138]  Magesh Jayapandian,et al.  Automating the Design and Construction of Query Forms , 2009, IEEE Transactions on Knowledge and Data Engineering.

[139]  Olfa Nasraoui,et al.  Mining search engine query logs for query recommendation , 2006, WWW '06.

[140]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[141]  Philip A. Bernstein,et al.  HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching , 2009, Proc. VLDB Endow..

[142]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[143]  Amanda Spink,et al.  Searching for multimedia: analysis of audio, video and image Web queries , 2000, World Wide Web.

[144]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[145]  Donald Kossmann,et al.  Predicate-based Indexing of Enterprise Web Applications , 2007, CIDR.

[146]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[147]  Panayiotis Tsaparas,et al.  Structured annotations of web queries , 2010, SIGMOD Conference.

[148]  Norbert Fuhr,et al.  Active Support for Query Formulation in Virtual Digital Libraries: A Case Study with DAFFODIL , 2005, ECDL.

[149]  Tiziana Catarci,et al.  Visual Query Systems for Databases: A Survey , 1997, J. Vis. Lang. Comput..

[150]  Rangasami L. Kashyap,et al.  A Visual Query Language for Graphical Interaction with Schema-Intensive Databases , 1993, IEEE Trans. Knowl. Data Eng..