CiteSeerX : Intelligent Information Extraction and Knowledge Creation from Web-Based Data

In order to provide convenient access to this web-based data, intelligent systems, such as CiteSeerX, are developed to construct a knowledge base from this unstructured information. CiteSeerX does this autononmously, even leveraging utility-based feedback control to minimize computational resource usage and incorporate user input to correct automatically extracted metadata [26]. The rich metadata that CiteSeerX extracts has been used for many data mining projects. CiteSeerX provides free access to over 4 million full-text academic documents and rarely seen fuctionalities, e.g., table search.

[1]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[2]  Wenyi Huang,et al.  Towards building a scholarly big data platform: Challenges, lessons and opportunities , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[3]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[4]  Zhaohui Wu,et al.  Measuring Term Informativeness in Context , 2013, NAACL.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[7]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[9]  Cornelia Caragea,et al.  Specialized Research Datasets in the CiteSeerx Digital Library , 2012, D Lib Mag..

[10]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[11]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[14]  Hung-Hsuan Chen,et al.  CSSeer: an expert recommendation system based on CiteseerX , 2013, JCDL '13.

[15]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[16]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[17]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[18]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[19]  C. Lee Giles,et al.  A classification scheme for algorithm citation function in scholarly works , 2013, JCDL '13.

[20]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[21]  Zhaohui Wu,et al.  Can back-of-the-book indexes be automatically created? , 2013, CIKM.

[22]  Cornelia Caragea,et al.  On identifying academic homepages for digital libraries , 2011, JCDL '11.

[23]  Zhaohui Wu,et al.  Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX , 2014, Feedback Computing.

[24]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[25]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[26]  C. Lee Improving Algorithm Search Using the Algorithm Co-Citation Network , 2012 .

[27]  Zhaohui Wu,et al.  Table of Contents Recognition and Extraction for Heterogeneous Book Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[28]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[29]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[30]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.