论文信息 - Combined Content based and Semantic Image Retrieval

Combined Content based and Semantic Image Retrieval

i-score (Image Semantic and COntent based REtrieval system) [1] developed at the Information Processing Laboratory combines two open source software libraries, Lire [2] and Lucene [3], with the aim to investigate the impact of images text-description in the quality and the effectiveness of image retrieval. In our runs for the ImagrCLEF2009 track the default Lucene’s text analysis (stopword removal and stemming) was performed and the default Lucene’s score function was used to evaluate the queries. Also all the duplicate descriptions of the images were removed from the database and a link was added to each record instead referring to a unique text. 39310 unique texts were remained in the database. In both tasks Ad-Hoc and Case-based the semantic retrieval outperformed by far the visual and consequently the mixed retrieval. This is sensible for at least in our case we have used a naive visual retrieval procedure. However give us promising evidence that techniques from textual retrieval can improve image retrieval in both the performance and efficiency Construction of the Indexes Two indexes were created automatically, one for the database of the images for visual retrieval and one for their descriptions for semantic retrieval. For the images’ data-base the index was created using Lire’s DefaultDocumentBuilder and as an Analyzer Lire’s SimpleAnalyzer. As a result the low level characteristics that we keep for each image are ScalableColor, ColorLayout and EdgeHistogram as they are defined at mpeg7. For the texts’ data-base firstly the HTML tags were removed. Then all the duplicate texts were removed and a link was added to each record instead referring to a unique text. 39310 unique texts were remained in the database. The index was based on the Lucene library and for each field of the images’ records, that we want to be able to search, the following analysis was performed: The LowerCaseTokenizer was used and we have tokenized wherever the character is not a letter. Lucene’s standard stop-words list was used and Porter’s stemming algorithm applied on the remaining terms. Finally the filter (LengthFilter) was used to remove the terms that are either very small or very big to enter in a java stream.

Theodore Kalamboukis | Ioannis Boutsis