Building enriched web page representations using link paths

Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.

[1]  Reiner Kraft,et al.  Mining anchor text for query refinement , 2004, WWW '04.

[2]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[3]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[4]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[5]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries: A transitive translation approach , 2004, TOIS.

[6]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[7]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[8]  Ji-Rong Wen,et al.  Using anchor texts with their hyperlink structure for web search , 2009, SIGIR.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Mark D. Dunlop,et al.  Image retrieval by hypertext links , 1997, SIGIR '97.

[11]  Jasmine Novak,et al.  Building enriched document representations using aggregated anchor text , 2009, SIGIR.

[12]  Atsushi Fujii Modeling anchor text and classifying queries to enhance web document retrieval , 2008, WWW.

[13]  Donato Malerba,et al.  Growing parallel paths for entity-page discovery , 2011, WWW.

[14]  Katunobu Itou,et al.  Exploiting Anchor Text for the Navigational Web Retrieval at NTCIR-5 , 2005, NTCIR.

[15]  Maguelonne Teisseire,et al.  Natural Language Processing and Information Systems , 2014, Lecture Notes in Computer Science.

[16]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[17]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[18]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[19]  Birger Andersson,et al.  Natural Language Processing and Information Systems , 2003, Lecture Notes in Computer Science.

[20]  Bo Zhao,et al.  Entity relation discovery from web tables and links , 2010, WWW '10.

[21]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[22]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[23]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[24]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[25]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[26]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[27]  Jaap Kamps,et al.  The importance of anchor text for ad hoc search revisited , 2010, SIGIR '10.