Object-level Web Information Retrieval*

The primary function of current Web search engines is essentially relevance ranking at the document level. However, there is lots of structured information about real-world objects embedded in static Web pages and online Web databases. Document-level information retrieval will unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we consider a new paradigm shift to enable searching at the object level. In traditional information retrieval models, document is taken as the retrieval unit and the content of a document is reliable. However the reliability assumption is no longer valid in the object retrieval context where usually exist multiple copies of information about the same object. These copies may be inconsistent because of the diverse Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we will not be able to achieve satisfactory retrieval performance. In this paper, we introduce a probabilistic model to handle the inconsistency problem using the source quality information, and our empirical evaluation shows that our object-level model is significantly better than the existing document-level models.

[1]  Mounia Lalmas Uniform Representation of Content and Structure for structured document retrieval , 2001 .

[2]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[3]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[4]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[5]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[6]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[7]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[8]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[9]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[10]  Amihai Motro,et al.  Estimating the Quality of Databases , 1998, FQAS.

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[13]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[14]  Mounia Lalmas,et al.  Dempster-Shafer's theory of evidence applied to structured documents: modelling uncertainty , 1997, SIGIR '97.

[15]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[16]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[17]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[18]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[19]  David Hawking,et al.  TREC10 Web and Interactive Tracks at CSIRO , 2001, TREC.

[20]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[21]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[22]  Alf-Christian Ortyl Paul Achilles,et al.  The Collection of Computer Science Bibliographies , 1995 .

[23]  Ophir Frieder,et al.  IIT at TREC 2002 Linear Combinations Based on Document Structure and Varied Stemming for Arabic Retrieval , 2002, TREC.

[24]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[26]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[27]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[28]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[29]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[30]  Felix Naumann,et al.  Assessment Methods for Information Quality Criteria , 2000, IQ.

[31]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.