TOWARD THE SCALABLE INTEGRATION OF INTERNET INFORMATION SOURCES

As the number of databases accessible on the Web grows, the ability to execute queries spanning multiple remote heterogeneous databases is becoming increasingly important. Two challenges in providing such a capability are (1) to discover the semantic correspondences between schema and data elements across the autonomous, heterogeneous information sources, and (2) developing query processing algorithms that work over data that arrives in a stream from a remote data source rather than data that resides on a local disk. Addressing the first problem, we introduced an automatic schema matching algorithm, "uninterpreted matching." With respect to the second problem, I will present a new optimization framework for continuous queries over unbounded streams, using a unique unit-time basis cost model. For a complete calendar, see: www.cs.northwestern.edu; click on ‘CS Seminars’ Join our e-mail list for notices of upcoming Computer Science lectures: send an e-mail with the word ‘subscribe’ in the subject line to: cs_seminar@cs.northwestern.edu

[1]  Stephen J. Wright Primal-Dual Interior-Point Methods , 1997, Other Titles in Applied Mathematics.

[2]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[3]  Laura M. Haas,et al.  Clio: a semi-automatic tool for schema mapping , 2001, SIGMOD '01.

[4]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[5]  Jeffrey F. Naughton,et al.  Rate-based query optimization for streaming information sources , 2002, SIGMOD '02.

[6]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[7]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[9]  Knud D. Andersen,et al.  The Mosek Interior Point Optimizer for Linear Programming: An Implementation of the Homogeneous Algorithm , 2000 .

[10]  Amihai Motro,et al.  Database Schema Matching Using Machine Learning with Feature Selection , 2002, CAiSE.

[11]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[12]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[13]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[14]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[15]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[16]  Renée J. Miller,et al.  Schema equivalence in heterogeneous systems: bridging theory and practice , 1994, Information Systems.

[17]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[18]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[19]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[20]  Steven Gold,et al.  A Graduated Assignment Algorithm for Graph Matching , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Deborah Estrin,et al.  Next Century Challenges: Mobile Networking for Smart Dust , 1999, MobiCom 1999.

[22]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[23]  Silvana Castano,et al.  Global Viewing of Heterogeneous Data Sources , 2001, IEEE Trans. Knowl. Data Eng..

[24]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Kenneth A. Ross,et al.  Cache Conscious Indexing for Decision-Support in Main Memory , 1999, VLDB.

[26]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[27]  Anne Rogers,et al.  Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[28]  Carlo Batini,et al.  Inclusion and Equivalence between Relational Database Schemata , 1982, Theor. Comput. Sci..

[29]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[30]  David J. DeWitt,et al.  The Niagara Internet Query System , 2001, IEEE Data Eng. Bull..

[31]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[32]  A. N. Wilschut,et al.  Dataflow query execution in a parallel main-memory environment , 1991, Distributed and Parallel Databases.

[33]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[34]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[35]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[36]  Andrew Heybey,et al.  Tribeca: A System for Managing Large Databases of Network Traffic , 1998, USENIX Annual Technical Conference.

[37]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[38]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[39]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[40]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[41]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[42]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[43]  Mong-Li Lee,et al.  ICICLES: Self-Tuning Samples for Approximate Query Answering , 2000, VLDB.

[44]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[45]  Richard Hull Relative Information Capacity of Simple Relational Database Schemata , 1986, SIAM J. Comput..

[46]  Jorma Rissanen On equivalences of database schemes , 1982, PODS '82.

[47]  Renée J. Miller,et al.  Schema equivalence in heterogeneous systems: bridging theory and practice , 1994, Inf. Syst..

[48]  Miron Livny,et al.  The Design and Implementation of a Sequence Database System , 1996, VLDB.

[49]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[50]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[51]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[52]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[53]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[54]  Salih O. Duffuaa,et al.  A Linear Programming Approach for the Weighted Graph Matching Problem , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Pedro M. Domingos,et al.  Learning Source Description for Data Integration , 2000, WebDB.

[56]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[57]  Michael Randolph Garey,et al.  Johnson: "computers and intractability , 1979 .

[58]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[59]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[60]  Laurent Amsaleg,et al.  Cost-based query scrambling for initial delays , 1998, SIGMOD '98.

[61]  Miron Livny,et al.  Sequence query processing , 1994, SIGMOD '94.

[62]  Jeffrey Scott Vitter,et al.  Data cube approximation and histograms via wavelets , 1998, CIKM '98.

[63]  Silvana Castano,et al.  Information Integration: The MOMIS Project Demonstration , 2000, VLDB.

[64]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[65]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[66]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[67]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[68]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[69]  Michael J. Carey,et al.  A Study of Index Structures for a Main Memory Database Management System , 1986, HPTS.

[70]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[71]  Douglas B. Terry,et al.  Continuous queries over append-only databases , 1992, SIGMOD '92.

[72]  Heikki Mannila,et al.  Dependency Inference , 1987, VLDB.

[73]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[74]  Kurt M. Anstreicher,et al.  A new bound for the quadratic assignment problem based on convex quadratic programming , 2001, Math. Program..

[75]  Christoph Schnörr,et al.  Evaluation of Convex Optimization Techniques for the Weighted Graph-Matching Problem in Computer Vision , 2001, DAGM-Symposium.

[76]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[77]  Shin Ishii,et al.  Doubly constrained network for combinatorial optimization , 2002, Neurocomputing.

[78]  Katta G. Murty,et al.  Operations Research: Deterministic Optimization Models , 1994 .

[79]  Catriel Beeri,et al.  Equivalence of Relational Database Schemes , 1981, SIAM J. Comput..

[80]  Shinji Umeyama,et al.  An Eigendecomposition Approach to Weighted Graph Matching Problems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[81]  Satish Kumar,et al.  Next century challenges: scalable coordination in sensor networks , 1999, MobiCom.

[82]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .