Anchor Text Extraction for Academic Search

Anchor text plays a special important role in improving the performance of general Web search, due to the fact that it is relatively objective description for a Web page by potentially a large number of other Web pages. Academic Search provides indexing and search functionality for academic articles. It may be desirable to utilize anchor text in academic search as well to improve the search results quality. The main challenge here is that no explicit URLs and anchor text is available for academic articles. In this paper we define and automatically assign a pseudo-URL for each academic article. And a machine learning approach is adopted to extract pseudo-anchor text for academic articles, by exploiting the citation relationship between them. The extracted pseudo-anchor text is then indexed and involved in the relevance score computation of academic articles. Experiments conducted on 0.9 million research papers show that our approach is able to dramatically improve search performance.

[1]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[2]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[3]  Manabu Okumura,et al.  Bilingual PRESRI - Integration of Multiple Research Paper Databases , 2004, RIAO.

[4]  Hidetsugu Nanba,et al.  Towards multi-paper summarization reference information , 1999, IJCAI 1999.

[5]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[6]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[7]  Shuming Shi,et al.  Pseudo-anchor text extraction for searching vertical objects , 2006, CIKM '06.

[8]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[9]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[10]  Shuming Shi,et al.  Latent Additivity: Combining Homogeneous Evidence , 2006 .

[11]  Manabu Okumura,et al.  Towards Multi-paper Summarization Using Reference Information , 1999, IJCAI.

[12]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[13]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[14]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[15]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[16]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[17]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[18]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[19]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[20]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[21]  Einat Amitay,et al.  Using common hypertext links to identify the best phrasal description of target web documents , 1998 .

[22]  Giuseppe Attardi,et al.  Theseus: Categorization by Context , 2000 .