Clustering-based Approximate Answering of Query Result in Large and Distributed Databases

Database systems are increasingly used for interactive and exploratory data retrieval. In such re- trievals, users queries often result in too many answers, so users waste significant time and efforts sifting and sorting through these answers to find the relevant ones. In this thesis, we first propose an efficient and effective algorithm coined Explore-Select-Rearrange Algorithm (ESRA), based on the SAINTETIQ model, to quickly provide users with hierarchical clustering schemas of their query re- sults. SAINTETIQ is a domain knowledge-based approach that provides multi-resolution summaries of structured data stored into a database. Each node (or summary) of the hierarchy provided by ESRA describes a subset of the result set in a user-friendly form based on domain knowledge. The user then navigates through this hierarchy structure in a top-down fashion, exploring the summaries of interest while ignoring the rest. Experimental results show that the ESRA algorithm is efficient and provides well-formed (tight and clearly separated) and well-organized clusters of query results. The ESRA al- gorithm assumes that the summary hierarchy of the queried data is already built using SAINTETIQ and available as input. However, SAINTETIQ requires full access to the data which is going to be summarized. This requirement severely limits the applicability of the ESRA algorithm in a distributed environment, where data is distributed across many sites and transmitting the data to a central site is not feasible or even desirable. The second contribution of this thesis is therefore a solution for sum- marizing distributed data without a prior “unification” of the data sources. We assume that the sources maintain their own summary hierarchies (local models), and we propose new algorithms for merging them into a single final one (global model). An experimental study shows that our merging algorithms result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time.

[1]  M. Shapiro,et al.  The three dimensions of data consistency , 2005 .

[2]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[3]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[4]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[5]  Xiaowei Sun,et al.  Online B-tree merging , 2005, SIGMOD '05.

[6]  Rakesh Agrawal,et al.  A framework for expressing and combining preferences , 2000, SIGMOD '00.

[7]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[8]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[9]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[10]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[11]  Georgia Koutrika,et al.  A Unified User Profile Framework for Query Disambiguation and Personalization , 2005 .

[12]  Hua Yang,et al.  CoBase: A scalable and extensible cooperative information system , 1996, Journal of Intelligent Information Systems.

[13]  Jan Chomicki,et al.  Querying with Intrinsic Preferences , 2002, EDBT.

[14]  Divyakant Agrawal,et al.  Constrained Nearest Neighbor Queries , 2001, Encyclopedia of GIS.

[15]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[16]  Surajit Chaudhuri,et al.  DBXplorer: enabling keyword search over relational databases , 2002, SIGMOD '02.

[17]  Ion Muslea,et al.  Online Query Relaxation via Bayesian Causal Structures Discovery , 2005, AAAI.

[18]  Michael Spann,et al.  A new approach to clustering , 1990, Pattern Recognit..

[19]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[20]  Didier Dubois,et al.  A new perspective on reasoning with fuzzy rules , 2002, Int. J. Intell. Syst..

[21]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[22]  Yehoshua Sagiv,et al.  Finding and approximating top-k answers in keyword proximity search , 2006, PODS '06.

[23]  Noureddine Mouaddib,et al.  General Purpose Database Summarization , 2005, VLDB.

[24]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[25]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[26]  Tran Khanh Dang,et al.  ISA - An Incremental Hyper-sphere Approach for Efficiently Solving Complex Vague Queries , 2002, DEXA.

[27]  Werner Kießling,et al.  Foundations of Preferences in Database Systems , 2002, VLDB.

[28]  Subbarao Kambhampati,et al.  Answering Imprecise Queries over Autonomous Web Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[29]  Patrick Bosc,et al.  Fuzzy queries against regular and fuzzy databases , 1997 .

[30]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[31]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[32]  Subbarao Kambhampati,et al.  Mining approximate functional dependencies and concept similarities to answer imprecise queries , 2004, WebDB '04.

[33]  Qiming Chen,et al.  Query answering via cooperative data inference , 2004, Journal of Intelligent Information Systems.

[34]  Amihai Motro,et al.  Query Generalization: A Method for Interpreting Null Answers , 1984, Expert Database Workshop.

[35]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[36]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[37]  Gerhard Weikum,et al.  STAR: A System for Tuple and Attribute Ranking of Query Answers , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[38]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[39]  Wesley W. Chu,et al.  Pattern-based clustering for database attribute values , 1993 .

[40]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[41]  Clement T. Yu,et al.  Priniples of Database Query Processing for Advanced Applications , 1997 .

[42]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[43]  Parke Godfrey,et al.  Minimization in Cooperative Response to Failing Database Queries , 1994, Int. J. Cooperative Inf. Syst..

[44]  S. Robertson The probability ranking principle in IR , 1997 .

[45]  Kaizhong Zhang,et al.  A constrained edit distance between unordered labeled trees , 1996, Algorithmica.

[46]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[47]  S. Sudarshan,et al.  Bidirectional Expansion For Keyword Search on Graph Databases , 2005, VLDB.

[48]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[49]  Tancred Lindholm,et al.  A three-way merge for XML documents , 2004, DocEng '04.

[50]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[51]  Noureddine Mouaddib,et al.  Querying the SaintEtiQ Summaries - A First Attempt , 2004, FQAS.

[52]  José Galindo,et al.  Handbook of Research on Fuzzy Information Processing in Databases , 2008, Handbook of Research on Fuzzy Information Processing in Databases.

[53]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[54]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[55]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[56]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[57]  Noureddine Mouaddib,et al.  Multi-Dimensional Grid-Based Clustering of Fuzzy Query Results , 2008 .

[58]  E. Rosch,et al.  Family resemblances: Studies in the internal structure of categories , 1975, Cognitive Psychology.

[59]  Martin H. Levinson Technostress: Coping with Technology @ Work @ Home @ Play , 1999 .

[60]  Clement T. Yu,et al.  Automatic integration of Web search interfaces with WISE-Integrator , 2004, The VLDB Journal.

[61]  Kevin Chen-Chuan Chang,et al.  Supporting ranking and clustering as generalized order-by and group-by , 2007, SIGMOD '07.

[62]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[63]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[64]  Carla E. Brodley,et al.  Interactive Content-based Image Retrieval Using Relevance Feedback , 2002 .

[65]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[66]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[67]  Henry F. Korth,et al.  Replication and Consistency in a Distributed Environment , 1999, J. Comput. Syst. Sci..

[68]  Erich Schikuta,et al.  BANG-Clustering: A Novel Grid-Clustering Algorithm for Huge Data Sets , 1998, SSPR/SPR.

[69]  Bernadette Bouchon-Meunier,et al.  Towards general measures of comparison of objects , 1996, Fuzzy Sets Syst..

[70]  Valiollah Tahani,et al.  A conceptual framework for fuzzy query processing - A step toward very intelligent database systems , 1977, Inf. Process. Manag..

[71]  Joydeep Ghosh,et al.  A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing , 2002 .

[72]  Jan Chomicki,et al.  Preference formulas in relational queries , 2003, TODS.

[73]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[74]  K. Sivakumar,et al.  Collective mining of Bayesian networks from distributed heterogeneous data , 2003, Knowledge and Information Systems.

[75]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[76]  Seung-won Hwang,et al.  Automatic categorization of query results , 2004, SIGMOD '04.

[77]  Wesley W. Chu,et al.  An error-based conceptual clustering method for providing approximate query answers , 1996, CACM.

[78]  Slawomir Zadrozny,et al.  FQUERY for Access: towards human consistent querying user interface , 1996, SAC '96.

[79]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[80]  A. Tversky Features of Similarity , 1977 .

[81]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[82]  M. Ashcraft Human memory and cognition , 1989 .

[83]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[84]  Qiong Huang,et al.  Query result ranking over e-commerce web databases , 2006, CIKM '06.

[85]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[86]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[87]  Noureddine Mouaddib,et al.  Joining Distributed Database Summaries , 2008 .

[88]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[89]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[90]  Tom Mens,et al.  A State-of-the-Art Survey on Software Merging , 2002, IEEE Trans. Software Eng..

[91]  Moshé M. Zloof Query-by-example: the invocation and definition of tables and forms , 1975, VLDB '75.

[92]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[93]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[94]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[95]  Patrick Bosc,et al.  About quotient and division of crisp and fuzzy relations , 2006, Journal of Intelligent Information Systems.

[96]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[97]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[98]  Stéphane Lopes,et al.  Query Rewriting Based on User's Profile Knowledge , 2007, BDA.

[99]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[100]  Tao Jiang,et al.  Some MAX SNP-Hard Results Concerning Unordered Labeled Trees , 1994, Inf. Process. Lett..

[101]  Betty Vandenbosch,et al.  Information Overload in a Groupware Environment: Now You See It, Now You Don't , 1998, J. Organ. Comput. Electron. Commer..

[102]  J. Bain,et al.  PSYCHOLOGICAL SCIENCE Research Article How Many Variables Can Humans Process? , 2022 .

[103]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[104]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[105]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[106]  Subbarao Kambhampati,et al.  Answering imprecise database queries: a novel approach , 2003, WIDM '03.

[107]  Mukesh K. Mohania,et al.  OSQR: overlapping clustering of query results , 2005, CIKM '05.

[108]  Michael Brady,et al.  Cooperative Responses From a Portable Natural Language Database Query System , 1983 .

[109]  Noureddine Mouaddib,et al.  Merging distributed database summaries , 2007, CIKM '07.

[110]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[111]  H A Simon,et al.  How Big Is a Chunk? , 1974, Science.

[112]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[113]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[114]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[115]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[116]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[117]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[118]  W. Bruce Croft,et al.  Document clustering: An evaluation of some experiments with the cranfield 1400 collection , 1975, Inf. Process. Manag..

[119]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[120]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[121]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[122]  Shan Wang,et al.  Finding Top-k Min-Cost Connected Trees in Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[123]  Lotfi A. Zadeh,et al.  The Concepts of a Linguistic Variable and its Application to Approximate Reasoning , 1975 .

[124]  Amihai Motro SEAVE: a mechanism for verifying user presuppositions in query systems , 1986, TOIS.

[125]  Patrick Bosc,et al.  SQLf: a relational database language for fuzzy querying , 1995, IEEE Trans. Fuzzy Syst..

[126]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[127]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[128]  M. Ross Quillian,et al.  Retrieval time from semantic memory , 1969 .

[129]  Matthias Klusch,et al.  Distributed Clustering Based on Sampling Local Density Estimates , 2003, IJCAI.

[130]  H. Kriegel,et al.  Towards Effective and Efficient Distributed Clustering , 2003 .

[131]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[132]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[133]  M. Lacroix,et al.  Preferences; Putting More Knowledge into Queries , 1987, VLDB.

[134]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[135]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[136]  Klemens Böhm,et al.  Trading Quality for Time with Nearest Neighbor Search , 2000, EDBT.

[137]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[138]  Patrick Bosc,et al.  On the Approximate Division of Fuzzy Relations , 2005, ISMIS.

[139]  Divyakant Agrawal,et al.  Approximate nearest neighbor searching in multimedia databases , 2001, Proceedings 17th International Conference on Data Engineering.

[140]  Joydeep Ghosh,et al.  Distributed Clustering with Limited Knowledge Sharing , 2022 .

[141]  Brent Stuart Goodwin,et al.  Data Smog: Surviving the Information Glut , 1999 .

[142]  Christos Faloutsos,et al.  FALCON: Feedback Adaptive Loop for Content-Based Retrieval , 2000, VLDB.

[143]  Vasant Honavar,et al.  Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[144]  Masahito Hirakawa,et al.  ARES: A relational database with the capability of performing flexible interpretation of queries , 1986, IEEE Transactions on Software Engineering.

[145]  Ion Muslea,et al.  Machine learning for online query relaxation , 2004, KDD.

[146]  Patrick Valduriez,et al.  Principles of distributed database systems (2nd ed.) , 1999 .

[147]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[148]  Parke Godfrey,et al.  An overview of cooperative answering , 1992, Journal of Intelligent Information Systems.

[149]  A Ehrenfeucht,et al.  Organization of memory. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[150]  Patrick Bosc,et al.  Fuzzy querying in conventional databases , 1992 .