Searching 100M Images by Content Similarity

In this paper we present the web user interface of a scalable and distributed system for image retrieval based on visual features and annotated text, developed in the context of the SAPIR project. Its architecture makes use of Peer-to-Peer networks to achieve scalability and efficiency allowing the management of huge amount of data and simultaneous access by a large number of users. Describing the SAPIR web user interface we want to encourage final users to use SAPIR to search by content similarity, together with the usual text search, on a large image collection (100 million images crawled from Flickr) with realistic response time. On the ground of the statistics collected, it will be possible, for the first time, to study the user behavior (e.g., the way they combine text and image content search) in this new realistic environment.

[1]  G. Frege Über Sinn und Bedeutung , 1892 .

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[6]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[7]  Joseph Rothstein,et al.  MIDI: A Comprehensive Introduction , 1992 .

[8]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[9]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Carlo Tasso,et al.  User Model-Based Information Filtering , 1997, AI*IA.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[14]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[15]  Antonietta Alonge,et al.  "ItalWordNet" : Building a Large Semantic Database for the Automatic Treatment of Italian , 1998 .

[16]  Justin Zobel,et al.  Manipulation of music for melody matching , 1998, MULTIMEDIA '98.

[17]  Robert Krovetz,et al.  More than One Sense Per Discourse , 1998 .

[18]  D. Canter,et al.  Differentiating arsonists: A model of firesetting actions and characteristics , 1998 .

[19]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[20]  Chris North,et al.  A Taxonomy of Multiple Window Coordinations , 1998 .

[21]  IAN ROWLANDS,et al.  Digital Libraries: A Conceptual Framework , 1999 .

[22]  Nicola Orio,et al.  Musical information retrieval using melodic surface , 1999, DL '99.

[23]  Tullio De Mauro,et al.  Il dizionario della lingua italiana , 2000 .

[24]  Ian H. Witten,et al.  Greenstone: a comprehensive open-source digital library software system , 2000, DL '00.

[25]  Russell Beale,et al.  Architectures to make simple visualisations using simple systems , 2000, AVI '00.

[26]  Carol Peters,et al.  CLIR Evaluation at TREC , 2000, CLEF.

[27]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[28]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[29]  Kristi Kiesling,et al.  Metadata, metadata, everywhere - but where is the hook? , 2001, OCLC Syst. Serv..

[30]  M. Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[31]  Ian H. Witten,et al.  Power to the people: end-user building of digital library collections , 2001, JCDL '01.

[32]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[33]  George Buchanan,et al.  Greenstone: A Platform for Distributed Digital Library Applications , 2001, ECDL.

[34]  Myra Spiliopoulou,et al.  The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis , 2002, WEBKDD.

[35]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[36]  Nicola Orio Alignment of Performances with Scores Aimed at Content-Based Music Access and Retrieval , 2002, ECDL.

[37]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[38]  Edward A. Fox,et al.  5SL: a language for declarative specification and generation of digital libraries , 2002, JCDL '02.

[39]  Rada Mihalcea,et al.  Bootstrapping Large Sense Tagged Corpora , 2002, LREC.

[40]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[41]  Frank Lützenkirchen MyCoRe - Ein Open-Source-System zum Aufbau digitaler Bibliotheken , 2002, Datenbank-Spektrum.

[42]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[43]  Walter Daelemans,et al.  Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation , 2002, SENSEVAL.

[44]  Bruce R. Barkstrom,et al.  Adapting digital libraries to continual evolution , 2002, JCDL '02.

[45]  Carlo Strapparava,et al.  The role of domain information in Word Sense Disambiguation , 2002, Natural Language Engineering.

[46]  Jianzhong Li,et al.  A Kind of Content-Based Music Information Retrieval Method in Peer-to-peer Environment , 2002, ISMIR.

[47]  George Tzanetakis,et al.  A scalable peer-to-peer system for music content and information retrieval , 2003, ISMIR.

[48]  Timothy W. Cole,et al.  Harvesting cultural heritage metadata using the OAI protocol , 2003 .

[49]  A. Butler Encoded Archival Description Tag Library, Version 2002 (review) , 2003 .

[50]  Christopher J. Prom Reengineering archival access through the OAI protocols , 2003 .

[51]  Yuji Matsumoto,et al.  Fast Methods for Kernel-Based Text Analysis , 2003, ACL.

[52]  MacKenzie Smith,et al.  The DSpace institutional digital repository system: current functionality , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[53]  W. Bruce Croft,et al.  Language Modeling for Information Retrieval , 2010, The Springer International Series on Information Retrieval.

[54]  Rada Mihalcea,et al.  Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users’ Help , 2003, LINC@EACL.

[55]  Giles Oatley,et al.  Crimes analysis software: 'pins in maps', clustering and Bayes net prediction , 2003, Expert Syst. Appl..

[56]  Cheng Yang Peer-to-peer architecture for content-based music retrieval on acoustic data , 2003, WWW '03.

[57]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[58]  Walter Daelemans,et al.  GAMBL, genetic algorithm optimization of memory-based WSD , 2004, SENSEVAL@ACL.

[59]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[60]  Hans-Jörg Schek,et al.  Digital library information-technology infrastructures , 2005, International Journal on Digital Libraries.

[61]  Nicola Orio,et al.  Indexing and Retrieval of Music Documents through Pattern Analysis and Data Fusion Techniques , 2004, ISMIR.

[62]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[63]  George Tzanetakis,et al.  A Scalable Peer-to-Peer System for Music Information Retrieval , 2004, Computer Music Journal.

[64]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[65]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[66]  Claudio Gennaro,et al.  IMPROVING IMAGE SIMILARITY SEARCH EFFECTIVENESS IN A MULTIMEDIA CONTENT MANAGEMENT SYSTEM , 2004 .

[67]  Martin Braschler Combination Approaches for Multilingual Text Retrieval , 2004, Information Retrieval.

[68]  Deniz Yuret Some experiments with a Naive Bayes WSD system , 2004, SENSEVAL@ACL.

[69]  Mona T. Diab Relieving the data Acquisition Bottleneck in Word Sense Disambiguation , 2004, ACL.

[70]  Emanuele Pianta,et al.  Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus , 2005, Natural Language Engineering.

[71]  Gerhard Weikum,et al.  MINERVA: Collaborative P2P Search , 2005, VLDB.

[72]  Sandra Payette,et al.  Fedora: an architecture for complex objects and their relationships , 2005, International Journal on Digital Libraries.

[73]  Giovanna Guerrini,et al.  Impact of XML schema evolution on valid documents , 2005, WIDM '05.

[74]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[75]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[77]  David Nicholas,et al.  Scholarly journal usage: the results of deep log analysis , 2005, J. Documentation.

[78]  Massimo Melucci,et al.  An evaluation of a recursive weighing scheme for information retrieval in peer-to-peer networks , 2005, P2PIR '05.

[79]  Rada Mihalcea,et al.  Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling , 2005, HLT.

[80]  Bo Zhang,et al.  A unified framework for image retrieval using keyword and visual features , 2005, IEEE Transactions on Image Processing.

[81]  Pedro Cano,et al.  A Review of Audio Fingerprinting , 2005, J. VLSI Signal Process..

[82]  Herbert Van de Sompel,et al.  aDORe: a modular and standards-based digital object repository at the los alamos national laboratory , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[83]  D. Losada Language modeling for sentence retrieval : A comparison between Multiple-Bernoulli models and Multinomial models , 2005 .

[84]  Hermann Ney,et al.  FIRE in ImageCLEF 2005: Combining Content-based Image Retrieval with Textual Information Retrieval , 2005, CLEF.

[85]  Yannis Manolopoulos,et al.  Musical Retrieval in P2P Networks under the Warping Distance , 2005, ICEIS.

[86]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[87]  Diego Calvanese,et al.  DL-Lite: Tractable Description Logics for Ontologies , 2005, AAAI.

[88]  Nicola Orio,et al.  Music Retrieval: A Tutorial and Review , 2006, Found. Trends Inf. Retr..

[89]  Geoffroy Peeters Chroma-based estimation of musical key from audio-signal analysis , 2006, ISMIR.

[90]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[91]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[92]  Simone Paolo Ponzetto,et al.  Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution , 2006, NAACL.

[93]  Jie Lu,et al.  Full-text federated search of text-based digital libraries in peer-to-peer networks , 2006, Information Retrieval.

[94]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[95]  David Novak,et al.  On scalability of the similarity search in the world of peers , 2006, InfoScale '06.

[96]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[97]  Norbert Fuhr,et al.  Comparing Different Architectures for Query Routing in Peer-to-Peer Networks , 2006, ECIR.

[98]  Carlo Tasso,et al.  Personalized intelligent information services within an online digital library for medicine: the BIBLIOMED system , 2007, IRCDL.

[99]  Anna Lisa Gentile,et al.  The JIGSAW Algorithm for Word Sense Disambiguation and Semantic Indexing of Documents , 2007, AI*IA.

[100]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[101]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[102]  Pasquale Pagano,et al.  A Reference Architecture for Digital Library Systems: Principles and Applications , 2007, DELOS.

[103]  Giovanna Guerrini,et al.  XML Schema Evolution: Incremental Validation and Efficient Document Adaptation , 2007, XSym.

[104]  Heiko Schuldt,et al.  The Delos digital library reference model : foundations for digital libraries , 2007 .

[105]  B. Magnini,et al.  SemEval-2007 Task 01: Evaluating WSD on Cross-Language Information Retrieval , 2007, *SEMEVAL.

[106]  Pasquale Lops,et al.  Combining Learning and Word Sense Disambiguation for Intelligent User Profiling , 2007, IJCAI.

[107]  Giorgio Maria Di Nunzio,et al.  Web Log Mining : A Study of User Sessions , 2007 .

[108]  Diego Calvanese,et al.  Actions and Programs over Description Logic Ontologies , 2007, Description Logics.

[109]  Max Mühlhäuser,et al.  Analyzing and accessing Wikipedia as a lexical semantic resource , 2007 .

[110]  Fuat Akal,et al.  DILIGENT: integrating digital library and Grid technologies for a new Earth observation research infrastructure , 2007, International Journal on Digital Libraries.

[111]  Jonathan Mamou,et al.  A Query Language for Multimedia Content , 2007 .

[112]  Akrivi Katifori,et al.  Task-Centered Information Management , 2007 .

[113]  Heiko Schuldt,et al.  Setting the Foundations of Digital Libraries: The DELOS Manifesto , 2007 .

[114]  Carol Peters,et al.  From CLEF to TrebleCLEF: promoting technology transfer for multilingual information retrieval , 2007 .

[115]  Riccardo Miotto,et al.  A Methodology for the Segmentation and Identification of Music Works , 2007, ISMIR.

[116]  Nicola Orio,et al.  Song identification through HMM-Based Modeling of the Main Melody , 2007, ICMC.

[117]  P. Manghi,et al.  An Architecture for Type-based Repository Systems , 2007 .

[118]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[119]  Massimo Melucci,et al.  A Study of a Weighting Scheme for Information Retrieval in Hierarchical Peer-to-Peer Networks , 2007, ECIR.

[120]  Fabio Crestani,et al.  Mining Police Digital Archives to Link Criminal Styles with Offender Characteristics , 2007, ICADL.

[121]  Nicola Ferro,et al.  An Architecture for Sharing Metadata Among Geographically Distributed Archives , 2007, DELOS.

[122]  Fabio Crestani,et al.  Application of Language Models to Suspect Prioritisation and Suspect Likelihood in Serial Crimes , 2007, Third International Symposium on Information Assurance and Security.

[123]  Donna K. Harman,et al.  Dealing with MultiLingual Information Access: Grid Experiments at TrebleCLEF , 2008, IRCDL.

[124]  Paolo Manghi,et al.  An Extensible Virtual Digital Libraries Generator , 2008, ECDL.

[125]  Stefano Ferilli,et al.  Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction , 2008, Machine Learning in Document Analysis and Recognition.

[126]  Susan Gauch,et al.  Using Text Mining to Enrich the Vocabulary of Domain Ontologies , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[127]  Anna Lisa Gentile,et al.  Lexical and Semantic Resources for NLP: From Words to Meanings , 2008, KES.

[128]  Anna Lisa Gentile,et al.  META - MultilanguagE Text Analyzer , 2008 .

[129]  Nicola Ferro,et al.  A Distributed Digital Library System Architecture for Archive Metadata , 2008, IRCDL.

[130]  Nicola Ferro,et al.  Content-based Information Retrieval in SPINA , 2008, IRCDL.

[131]  James A. Thom,et al.  Entity ranking in Wikipedia , 2007, SAC '08.

[132]  Julio Gonzalo,et al.  Workshop on Novel Methodologies for Evaluation in Information Retrieval , 2008, ECIR.

[133]  Fabio Crestani,et al.  Estimating real-valued characteristics of criminals from their recorded crimes , 2008, CIKM '08.

[134]  Nicola Ferro,et al.  A Methodology for Sharing Archival Descriptive Metadata in a Distributed Environment , 2008, ECDL.

[135]  Sally Chambers,et al.  Uncovering cultural heritage through collaboration , 2008, International Journal on Digital Libraries.

[136]  Antonina Dattolo,et al.  Sentiment Classification for the Italian Language: A Case Study on Movie Reviews , 2008 .

[137]  Pasquale Lops,et al.  Combining Knowledge-based Methods and Supervised Learning for Effective Italian Word Sense Disambiguation , 2008, STEP.

[138]  Meir Cohen Similarity Search , 2008, Encyclopedia of GIS.

[139]  Antonina Dattolo,et al.  A Conceptual Model for Digital Libraries Evolution , 2009, WEBIST.

[140]  Melvil Dewey Dewey Decimal Classification , 2009, Encyclopedia of Database Systems.

[141]  Salvatore Orlando,et al.  Caching content-based queries for robust and efficient image retrieval , 2009, EDBT '09.

[142]  Sebastian Ryszard Kruk,et al.  Semantic Digital Libraries , 2009, Semantic Digital Libraries.

[143]  Carol Peters,et al.  Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, ... Applications, incl. Internet/Web, and HCI) , 2009 .

[144]  Leonardo Candela,et al.  D4Science: an e-Infrastructure for Supporting Virtual Research Environments , 2009, IRCDL.

[145]  Jean-Marc Petit,et al.  Web Intelligence and Intelligent Agent Technology , 2011 .