Author Homepage Discovery in CiteSeerX

Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers. CiteSeerX is one such digital library search engine that provides access to more than 10 million academic documents. We propose a novel search-driven approach to build and maintain a large collection of homepages that can be used as seed URLs in any digital library including CiteSeerX to crawl scientific documents. Precisely, we integrate Web search and classification in a unified approach to discover new homepages: first, we use publicly-available author names and research paper titles as queries to a Web search engine to find relevant content, and then we identify the correct homepages from the search results using a powerful deep learning classifier based on Convolutional Neural Networks. Moreover, we use SelfTraining in order to reduce the labeling effort and to utilize the unlabeled data to train the efficient researcher homepage classifier. Our experiments on a large scale dataset highlight the effectiveness of our approach, and position Web search as an effective method for acquiring authors’ homepages. We show the development and deployment of the proposed approach in CiteSeerX and the maintenance requirements.

[1]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[2]  Yuxin Wang,et al.  Web Page Classification Exploiting Contents of Surrounding Pages for Building a High-Quality Homepage Collection , 2006, ICADL.

[3]  Cornelia Caragea,et al.  Co-Training for Topic Classification of Scholarly Data , 2015, EMNLP.

[4]  C. Lee Giles,et al.  Learning to Rank Homepages For Researcher-Name Queries , 2011 .

[5]  Denilson Alves Pereira,et al.  A framework to collect and extract publication lists of a given researcher from the web , 2017, Int. J. Web Eng. Technol..

[6]  Seungwoo Lee,et al.  Construction of a large-scale test set for author disambiguation , 2011, Inf. Process. Manag..

[7]  Cornelia Caragea,et al.  Researcher homepage classification using unlabeled data , 2013, WWW.

[8]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[9]  Cornelia Caragea,et al.  Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents , 2019, WWW.

[10]  Amanda Spink,et al.  Web Search: Public Searching of the Web , 2011, Information Science and Knowledge Management.

[11]  Philip S. Yu,et al.  A General Model for Multiple View Unsupervised Learning , 2008, SDM.

[12]  S. Lawrence Free online availability substantially increases a paper's impact , 2001, Nature.

[13]  L. Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[14]  Cornelia Caragea,et al.  PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents , 2017, ACL.

[15]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[16]  C. Lee Giles,et al.  What's there and what's not?: focused crawling for missing documents in digital libraries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[17]  Jie Tang,et al.  Social Network Extraction of Academic Researchers , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  M. de Rijke,et al.  Determining Expert Profiles (With an Application to Expert Finding) , 2007, IJCAI.

[19]  Edward A. Fox,et al.  Machine Learning Approach for Homepage Finding Task , 2002, TREC.

[20]  Shiguang Shan,et al.  Semi-Supervised Multi-View Correlation Feature Learning with Application to Webpage Classification , 2017, AAAI.

[21]  Cornelia Caragea,et al.  Exploring Word Embeddings in CRF-based Keyphrase Extraction from Research Papers , 2019, K-CAP.

[22]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[23]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[24]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Qinghua Zheng,et al.  PLIDMiner: A Quality Based Approach for Researcher's Homepage Discovery , 2012, AIRS.

[27]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[28]  Trevor Darrell,et al.  Multi-View Learning in the Presence of View Disagreement , 2008, UAI 2008.

[29]  Cornelia Caragea,et al.  Document Type Classification in Online Digital Libraries , 2016, AAAI.

[30]  Cornelia Caragea,et al.  Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[31]  David Hawking,et al.  Query-independent evidence in home page finding , 2003, TOIS.

[32]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[34]  Cornelia Caragea,et al.  Extracting Keyphrases from Research Papers Using Citation Networks , 2014, AAAI.

[35]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[36]  Jinqiao Shi,et al.  Improving Academic Homepage Identification from the Web Using Neural Networks , 2019, ICCS.

[37]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[38]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[41]  C. Lee Giles,et al.  A Web Service for Author Name Disambiguation in Scholarly Databases , 2018, 2018 IEEE International Conference on Web Services (ICWS).

[42]  Cornelia Caragea,et al.  Keyphrase Extraction in Scholarly Digital Library Search Engines , 2020, ICWS.