Multi-Databases: Removal of Redundant Information

Abstract : This effort included the partial development of a search engines for multimedia web documents and the complete implementation of a prototype methodology for removing (partially or totally) redundant information from multiple documents in an effort to synthesize new documents. A typical multimedia document contains free text and images and additionally has associating well-structured data. An SQL-like query language, WebSSQL, has been used to retrieve these types of documents. The main differences between Web SSQL and other proposed SQL extensions for retrieving web documents are that Web SSQL is similarity-based and supports conditions on images. This report also describes a software methodology for the detection and removal of redundant information (text paragraphs and images) from multiple retrieved documents. Documents reporting the same or related events and stories may contain substantial redundant information. The removal of the redundant information and the synthesis of these documents into a single document can not only save a user's time to acquire the information, but also storage space to archive the data. The methodology reported here consists of techniques for analyzing text paragraphs and images as well as a set of similarity criteria used to detect redundant paragraphs and images. The methodology developed in this project has the ability either to work independently with text paragraphs and images, or to combine both in one synthetic document.

[1]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[2]  Vijay V. Raghavan,et al.  Design of an Integrated Information Retrieval/Database Management System , 1990, IEEE Trans. Knowl. Data Eng..

[3]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[4]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[5]  W. Bruce Croft,et al.  A loosely-coupled integration of a text retrieval system and an object-oriented database system , 1992, SIGIR '92.

[6]  James Allan,et al.  Automatic Retrieval With Locality Information Using SMART , 1992, TREC.

[7]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[8]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Michael Stonebraker,et al.  Chabot: Retrieval from a Relational Database of Images , 1995, Computer.

[11]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[12]  Clement T. Yu,et al.  Priniples of Database Query Processing for Advanced Applications , 1997 .

[13]  William I. Grosky,et al.  Multimedia information systems , 1994, IEEE MultiMedia.

[14]  Rohini K. Srihari Automatic indexing and content-based retrieval of captioned photographs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[15]  Hong-Mei Chen Garcia,et al.  Multimedia Information Systems , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.