Conceptual Modeling – ER 2004

The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevanceranked lists as query results. This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search. 1 The Challenge of “Semantic” Information Search The age of information explosion poses tremendous challenges regarding the intelligent organization of data and the effective search of relevant information in business and industry (e.g., market analyses, logistic chains), society (e.g., health care), and virtually all sciences that are more and more data-driven (e.g., gene expression data analyses and other areas of bioinformatics). The problems arise in intranets of large organizations, in federations of digital libraries and other information sources, and in the most humongous and amorphous of all data collections, the World Wide Web and its underlying numerous databases that reside behind portal pages. The Web bears the potential of being the world’s largest encyclopedia and knowledge base, but we are very far from being able to exploit this potential. Database-system and search-engine technologies provide support for organizing and querying information; but all too often they require excessive manual preprocessing, such as designing a schema and cleaning raw data or manually classifying documents into a taxonomy for a good Web portal, or manual postprocessing such as browsing through large result lists with too many irrelevant items or surfing in the vicinity of promising but not truly satisfactory approximate matches. The following are a few example queries where current Web and intranet search engines fall short or where data P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 3–17, 2004. c © Springer-Verlag Berlin Heidelberg 2004 4 Gerhard Weikum et al. integration techniques and the use of SQL-like querying face unsurmountable difficulties even on structured, but federated and highly heterogeneous databases: Q1: Which professors from Saarbruecken in Germany teach information retrieval and do research on XML? Q2: Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g? And are there any metabolic models for acid reflux that could be related to the gene expression data? Q3: What are the most important research results on large deviation theory? Q4: Which drama has a scene in which a woman makes a prophecy to a Scottish nobleman that he will become king? Q5: Who was the French woman that I met in a program committee meeting where Paolo Atzeni was the PC chair? Q6: Are there any published theorems that are equivalent to or subsume my latest mathematical conjecture? Why are these queries difficult (too difficult for Google-style keyword search unless one invests a huge amount of time to manually explore large result lists with mostly irrelevant and some mediocre matches)? For Q1 no single Web site is a good match; rather one has to look at several pages together within some bounded context: the homepage of a professor with his address, a page with course information linked to by the homepage, and a research project page on semistructured data management that is a few hyperlinks away from the homepage. Q2 would be easy if asked for a single bioinformatics database with a familiar query interface, but searching the answer across the entire Web and Deep Web requires discovering all relevant data sources and unifying their query and result representations on the fly. Q3 is not a query in the traditional sense, but requires gathering a substantial number of key resources with valuable information on the given topic; it would be best served by looking up a well maintained Yahoo-style topic directory, but highly specific expert topics are not covered there. Q4 cannot be easily answered because a good match does not necessarily contain the keywords “woman”, “prophecy”, “nobleman”, etc., but may rather say something like “Third witch: All hail, Macbeth, thou shalt be king hereafter!” and the same document may contain the text “All hail, Macbeth! hail to thee, thane of Glamis!”. So this query requires some background knowledge to recognize that a witch is a woman, “shalt be” refers to a prophecy, and thane is a title for a Scottish nobleman. Q5 is similar to Q4 in the sense that it also requires background knowledge, but it is more difficult because it additionally requires putting together various information fragments: conferences on which I served on the PC found in my email archive, PC members of conferences found on Web pages, and detailed information found on researchers’ homepages. And after having identified a candidate like Sophie Cluet from Paris, one needs to infer that Sophie is a typical female first name and that Paris most likely denotes the capital of France rather than the 500-inhabitants town of Paris, Texas, that became known through a movie. Q6 finally is what some researchers call “AI-complete”, it will remain a challenge for a long time. For a human expert who is familiar with the corresponding topics, none of these queries is really difficult. With unlimited time, the expert could easily identify relevant pages and combine semantically related information units into query answers. The challenge is to automate or simulate these intellectual capabilities and implement them so that they can handle billions of Web pages and petabytes of data in structured (but schematically highly diverse) Deep-Web databases. Towards a Statistically Semantic Web 5 2 The Need for Statistics What if all Web pages and all Web-accessible data sources were in XML, RDF, or OWL (a description-logic representation) as envisioned in the Semantic Web research direction [25, 1]? Would this enable a search engine to effectively answer the challenging queries of the previous section? And would such an approach scale to billions of Web pages and be efficient enough for interactive use? Or could we even load and integrate all Web data into one gigantic database and use XQuery for searching it? XML, RDF, and OWL offer ways of more explicitly structuring and richly annotating Web pages. When viewed as logic formulas or labeled graphs, we may think of the pages as having “semantics”, at least in terms of model theory or graph isomorphisms1. In principle, this opens up a wealth of precise querying and logical inferencing opportunities. However, it is extremely unlikely that all pages will use the very same tag or predicate names when they refer to the same semantic properties and relationships. Making such an assumption would be equivalent to assuming a single global schema: this would be arbitrarly difficult to achieve in a large intranet, and it is completely hopeless for billions of Web pages given the Web’s high dynamics, extreme diversity of terminology, and uncertainty of natural language (even if used only for naming tags and predicates). There may be standards (e.g., XML schemas) for certain areas (e.g., for invoices or invoice-processsing Web Services), but these will have limited scope and influence. A terminologically unified and logically consistent Semantic Web with billions of pages is hard to imagine. So reasoning about diversely annotated pages is a necessity and a challenge. Similarly to the ample research on database schema integration and instance matching (see, e.g., [49] and the references given there), knowledge bases [50], lexicons, thesauri [24], or ontologies [58] are considered as the key asset to this end. Here an ontology is understood as a collection of concepts with various semantic relationships among them; the formal representation may vary from rigorous logics to natural language. The most important relationship types are hyponymy (specialization into narrower concepts) and hypernymy (generalization into broader concepts). To the best of my knowledge, the most comprehensive, publicly available kind of ontology is the WordNet thesaurus hand-crafted by cognitive scientists at Princeton [24]. For the concept “woman” WordNet lists about 50 immediate hyponyms, which include concepts like “witch” and “lady” which could help to answer queries like Q4 from the previous section. However, regardless of whether one represents these hyponymy relationships in a graph-oriented form or as logical formulas, such a rigid “trueor-false” representation could never discriminate these relevant concepts from the other 48 irrelevant and largely exotic hyponyms of “woman”. In information-retrieval (IR) jargon, such an approach would be called Boolean retrieval or Boolean reasoning; and IR almost always favors ranked retrieval with some quantitative relevance assessment. In fact, by simply looking at statistical correlations of using words like “woman” and “lady” together in some text neighborhood within large corpora (e.g., the Web or large digital libraries) one can infer that t

[1]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[2]  Zoubida Kedad,et al.  Dealing with Semantic Heterogeneity During Data Integration , 1999, ER.

[3]  DianeC . P. Smith,et al.  Database Abstractions: Aggregation and Generalization , 1989 .

[4]  N. F. Noy,et al.  Ontology Development 101: A Guide to Creating Your First Ontology , 2001 .

[5]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[6]  Renate Motschnig-Pitrik A generic framework for the modeling of contexts and its applications , 2000 .

[7]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Wamberto Weber Vasconcelos,et al.  An Agent-Based Approach to Web Site Maintenance , 2004, ICWE.

[10]  Pavel Zezula,et al.  Processing XML Queries with Tree Signatures , 2003, Intelligent Search on XML Data.

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[13]  Veda C. Storey,et al.  Data Abstractions: Why and How? , 1999, Data Knowl. Eng..

[14]  Steffen Staab,et al.  SEAL - A Framework for Developing SEmantic Web PortALs , 2001, BNCOD.

[15]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[16]  David C. Hay,et al.  Data Model Patterns: Conventions of Thought , 1965 .

[17]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[18]  Marc J. Rochkind,et al.  The source code control system , 1975, IEEE Transactions on Software Engineering.

[19]  Keishi Tajima,et al.  Archiving scientific data , 2002, SIGMOD '02.

[20]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[21]  Paolo Merialdo,et al.  The Araneus Web-based management system , 1998, SIGMOD '98.

[22]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[23]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[24]  Gustavo Rossi,et al.  The object-oriented hypermedia design model , 1995, CACM.

[25]  Craig Larman,et al.  Applying UML and patterns , 1997 .

[26]  David A. Bell,et al.  Generalization of the Dempster-Shafer Theory , 1993, IJCAI.

[27]  Hui Wang Contextual probability , 2003 .

[28]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .