Wrapping XML-Sources to Support Update Awareness

Data warehousing is a generally accepted method of providing corporate decision support. Today, the majority of information in these warehouses originates from sources within a company, although changes often occur from the outside. Companies need to look outside their enterprises for valuable information, increasing their knowledge of customers, suppliers, competitors etc. The largest and most frequently accessed information source today is the Web, which holds more and more useful business information. Today, the Web primarily relies on HTML, making mechanical extraction of information a difficult task. In the near future, XML is expected to replace HTML as the language of the Web, bringing more structure and content focus. One problem when considering XML-sources in a data warehouse context is their lack of update awareness capabilities, which restricts eligible data warehouse maintenance policies. In this work, we wrap XML-sources in order to provide update awareness capabilities. We have implemented a wrapper prototype that provides update awareness capabilities for autonomous XML-sources, especially change awareness, change activeness, and delta awareness. The prototype wrapper complies with recommendations and working drafts proposed by W3C, thereby being compliant with most off-the-shelf XML tools. In particular, change information produced by the wrapper is based on methods defined by the DOM, implying that any DOM-compliant software, including most off-the-shelf XML processing tools, can be used to incorporate identified changes in a source into an older version of it. For the delta awareness capability we have investigated the possibility of using change detection algorithms proposed for semi-structured data. We have identified similarities and differences between XML and semi-structured data, which affect delta awareness for XML-sources. As a result of this effort, we propose an algorithm for change detection in XML-sources. We also propose matching criteria for XML-documents, to which the documents have to conform to be subject to change awareness extension.