Towards creating a knowledge base for World-Wide Web documents

The lack of organization of information on the web results in non-efficient information retrieval. Several approaches for improvement have been suggested. We propose to use a document knowledge base that contains semantic and structural information concerning the retrievable documents that is extracted from the actual documents. We show that using such a knowledge base gives a number of advantages, including advanced query functionality. We also discuss the creation of such a knowledge base and in particular we show how we can automatically extract structural information from HTML documents for addition to the document knowledge base.