Mapping web pages to database records via link paths

In this paper we propose a new knowledge management task which aims to map Web pages to their corresponding records in a structured database. For example, the DBLP database contains records for many computer scientists, and most of these persons have public Web pages; if we can map the database record with the appropriate Web page then the new information could be used to further describe the person's database record. To accomplish this goal we employ link paths which contain anchor texts from multiple paths through the Web ending at the Web page in question. We hypothesize that the information from these link paths can be used to generate an accurate Web page to database record mapping. Experiments on two large, real world data sets, DBLP and IMDB for the structured data and computer science faculty members' Web pages and official movie homepages for the Web page data, show that our method does provide an accurate mapping. Finally, we conclude by issuing a call for further research on this promising new task.

[1]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[2]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[3]  Reiner Kraft,et al.  Mining anchor text for query refinement , 2004, WWW '04.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Bo Zhao,et al.  Entity relation discovery from web tables and links , 2010, WWW '10.

[6]  Katunobu Itou,et al.  Exploiting Anchor Text for the Navigational Web Retrieval at NTCIR-5 , 2005, NTCIR.

[7]  Ji-Rong Wen,et al.  Using anchor texts with their hyperlink structure for web search , 2009, SIGIR.

[8]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[9]  J. Y. Yen,et al.  Finding the K Shortest Loopless Paths in a Network , 2007 .

[10]  Birger Andersson,et al.  Natural Language Processing and Information Systems , 2003, Lecture Notes in Computer Science.

[11]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[12]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[13]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[14]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries: A transitive translation approach , 2004, TOIS.

[15]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[16]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[17]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[18]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.

[20]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[21]  Mark D. Dunlop,et al.  Image retrieval by hypertext links , 1997, SIGIR '97.

[22]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[23]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[24]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[25]  Jasmine Novak,et al.  Building enriched document representations using aggregated anchor text , 2009, SIGIR.

[26]  Atsushi Fujii Modeling anchor text and classifying queries to enhance web document retrieval , 2008, WWW.