Querying web metadata: Native score management and text support in databases

In this article, we discuss the issues involved in adding a native score management system to object-relational databases, to be used in querying Web metadata (that describes the semantic content of Web resources). The Web metadata model is based on topics (representing entities), relationships among topics (called metalinks), and importance scores (sideway values) of topics and metalinks. We extend database relations with scoring functions and importance scores. We add to SQL score-management clauses with well-defined semantics, and propose the sideway-value algebra (SVA), to evaluate the extended SQL queries. SQL extensions and the SVA algebra are illustrated through two Web resources, namely, the DBLP Bibliography and the SIGMOD Anthology.SQL extensions include clauses for propagating input tuple importance scores to output tuples during query processing, clauses that specify query stopping conditions, threshold predicates (a type of approximate similarity predicates for text comparisons), and user-defined-function-based predicates. The propagated importance scores are then used to rank and return a small number of output tuples. The query stopping conditions are propagated to SVA operators during query processing. We show that our SQL extensions are well-defined, meaning that, given a database and a query Q, under any query processing scheme, the output tuples of Q and their importance scores stay the same.To process the SQL extensions, we discuss two sideway value algebra operators, namely, sideway value algebra join and topic closure, give their implementation algorithms, and report their experimental evaluations.

[1]  K. Selçuk Candan,et al.  Query Optimization in the Presence of Top-k Predicates , 2001, Multimedia Information Systems.

[2]  Hamid Pirahesh,et al.  SQL open heterogeneous data access , 1998, SIGMOD '98.

[3]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[4]  Vassilis Christophides,et al.  Querying RDF Descriptions for Community Web Portals , 2001, BDA.

[5]  Walid G. Aref,et al.  Joining Ranked Inputs in Practice , 2002, VLDB.

[6]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  Gerald Salton,et al.  Automatic text processing , 1988 .

[9]  Ralph Grishman,et al.  Real-time event extraction for infectious disease outbreaks , 2002 .

[10]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[11]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[13]  Rakesh Agrawal,et al.  Extending SQL with Generalized Transitive Closure Functionality , 1993, IEEE Trans. Knowl. Data Eng..

[14]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[15]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[16]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[17]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[18]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[19]  Luis Gravano,et al.  Querying text databases for efficient information extraction , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[20]  Özgür Ulusoy,et al.  Metadata-based modeling of information resources on the Web , 2004, J. Assoc. Inf. Sci. Technol..

[21]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[22]  Stefan Decker,et al.  On the Integration of Topic Maps and RDF data , 2001, SWWS.

[23]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[25]  Özgür Ulusoy,et al.  Topic-Centric Querying of Web Information Resources , 2001, DEXA.

[26]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[27]  Özgür Ulusoy,et al.  Sideway Value Algebra for Object-Relational Databases , 2002, VLDB.

[28]  KiferMichael,et al.  Databases and transaction processing , 2002 .

[29]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[30]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[31]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[32]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[33]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[34]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[35]  Clement T. Yu,et al.  Performance Analysis of Three Text-Join Algorithms , 1998, IEEE Trans. Knowl. Data Eng..

[36]  Gultekin Özsoyoglu,et al.  Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach , 2003, DEXA.

[37]  Jim Melton,et al.  SQL: 1999, formerly known as SQL3 , 1999, SGMD.

[38]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[39]  Gultekin Özsoyoglu,et al.  On automated lesson construction from electronic textbooks , 2004, IEEE Transactions on Knowledge and Data Engineering.

[40]  Gultekin Özsoyoglu,et al.  Electronic books in digital libraries , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[41]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[42]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[43]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[44]  Michael Kifer,et al.  Databases and Transaction Processing: An Application-Oriented Approach , 2001 .

[45]  H. V. Jagadish,et al.  Optimization of generalized transitive closure queries , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[46]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[47]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[48]  Luis Gravano,et al.  Combining Strategies for Extracting Relations from Text Collections , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[49]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[50]  E. F. Codd,et al.  Data models in database management , 1981, Workshop on Data Abstraction, Databases and Conceptual Modelling.

[51]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[52]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[53]  E. F. Codd Data models in database management , 1981, SIGMOD 1981.