Knowledge management and XML: derivation of synthetic views over semi-structured data

One of the effects of the expansion of the World Wide Web is theproduction of a huge amount of data, differentiated for type,available to a large number of different users. Furthermore, theconstant progress of computer hardware technology in the past threedecades has led to the availability of powerful computers, datacollection equipments, and storage media; this technology providesa great boost to the database and information industry by allowingtransaction management, information retrieval, and data analysisover massive amounts of heterogeneous data. Moreover, the explosionof Internet increases the availability of data in differentformats: structured (e.g. relational), semistructured (e.g. HTML,XML) and unstructured (e.g. plain text, audio/video) data [2].Thus, new data management systems, able to take advantage of theseheterogeneous data, are emerging and will play a vital role in theinformation industry. Thus, heterogeneous database systems emergeand play a vital role in the information industry. Knowledge Management is concerned with the technological,economic and organizational aspects related to (i) thecreation, distribution, diversification and sharing of knowledge incomplex organizations and to (ii) the management ofinformative flows, processes and interactions with externalKnowledge [8]. Figure 1 summarizes the steps (each represented on a differentlevel of the pyramid) through which knowledge is typicallyextracted from basic data. The first three levels regard themanagement of explicit knowledge (i.e. codified, structuredor semistructured and completely available). In particular,starting from the bottom, the first level is concerned with storingand exchanging "factual" knowledge, essentially corresponding tobasic data. Technologies used here comprise Databases [17],Data Repositories, Archive Sharing tools and the emergingExtensible Markup Language (XML) [18]. The second level regards "conceptual knowledge" modeling, i.e.the definition of concepts and relationships among them. Suchknowledge is typically represented by means of diagram-basedformalisms for both information and related processes [9]. TheUnified Modeling Language (UML) is currently one of the mostpromising modeling languages, oriented towards thespecification,implementation and documentation of complex softwaresystems, but also used for modeling company processes not strictlyrelated to the software. The third level is concerned with organization and integrationof information represented according to heterogeneous formalisms.Techniques used here are essentially those concerning DataWarehousing (DW) [10]. Data warehouses are integratedrepositories of data extracted from multiple heterogeneous sources,organized under a unified schema and at a single site, in order tofacilitate management and decision making. Data Warehousingtechnologies include data cleaning, data integration, and OnlineAnalytical Processing (OLAP), i.e. analysis techniques based onaggregation and summarization. The highest level regards Knowledge Discovery, i.e. theuncovering of new, implicit and potentially usefulknowledge from large amounts of data. The core phase of knowledgediscovery is Data Mining [10], an interactive, iterative,multi-step process, comprising in particular pattern searching andeventual refinements on the basis of domain experts' knowledge. In the context of explicit knowledge management, the ExtensibleMarkup Language takes naturally place. XML is a language forsemistructured data [1, 5] of the World Wide Web Consortium(W3C) [13] which is designed to allow marking, transferring andreusing information by means of a standard method of definition ofthe documents structure and format. Its metalanguage features havebeen used in knowledge management typically for (i) thesemi-automatic production of documents, (ii) the reuse ofsemistructured information and its integration in heterogeneoussystems, (iii) the creation of knowledge maps for theorganization and sharing of information. The increasing quantity of available semistructured data and theuse of XML for their description and exchange discovers newreaserch themes related to management and knowledge extraction overXML data. In this scenario, our proposal consists of a system forthe syntesization of XML documents that attempts to extracttheir semantics and to derive synthetic versions of them by meansof a multidimensional interpretation [10]. In the contest ofKnowledge Management, data synthesization can be regarded as a newway for knowledge extraction, by discovering and aggregating(useful) core information and by neglecting (useless) details.

[1]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[2]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[3]  Mark A. Roth,et al.  Database compression , 1993, SGMD.

[4]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[5]  M. Fischetti Working knowledge. , 2003, Scientific American.

[6]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[7]  Mario Cannataro,et al.  Semantic Lossy Compression of XML Data , 2001, KRDB.

[8]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[9]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[10]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[11]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[12]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[13]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[14]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[15]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[16]  Francesco Buccafurri,et al.  Estimating Range Queries Using Aggregate Data with Integrity Constraints: A Probabilistic Approach , 2001, ICDT.

[17]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..