Big Scholarly Data in CiteSeerX: Information Extraction from the Web

We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From the perspective of automatic information extraction and modes of alternative search, we examine various functional aspects of this complex system with an eye towards ongoing and future research developments.

[1]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[2]  Cornelia Caragea,et al.  Specialized Research Datasets in the CiteSeerx Digital Library , 2012, D Lib Mag..

[3]  Zhaohui Wu,et al.  Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX , 2014, Feedback Computing.

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Zhaohui Wu,et al.  Can back-of-the-book indexes be automatically created? , 2013, CIKM.

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  C. Lee Giles,et al.  Improving algorithm search using the algorithm co-citation network , 2012, JCDL '12.

[8]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[9]  Hung-Hsuan Chen,et al.  CSSeer: an expert recommendation system based on CiteseerX , 2013, JCDL '13.

[10]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[11]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[12]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[13]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[14]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[15]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[16]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[17]  C. Lee Giles,et al.  A classification scheme for algorithm citation function in scholarly works , 2013, JCDL '13.

[18]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[19]  Zhaohui Wu,et al.  Table of Contents Recognition and Extraction for Heterogeneous Book Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[20]  Madian Khabsa,et al.  Web crawler middleware for search engine digital libraries: a case study for citeseerX , 2012, WIDM '12.

[21]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[22]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[23]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[24]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[25]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  Wenyi Huang,et al.  Towards building a scholarly big data platform: Challenges, lessons and opportunities , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[27]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[28]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[29]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[30]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[31]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[32]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[33]  Zhaohui Wu,et al.  Measuring Term Informativeness in Context , 2013, NAACL.

[34]  Cornelia Caragea,et al.  On identifying academic homepages for digital libraries , 2011, JCDL '11.

[35]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.