Document Type Classification in Online Digital Libraries

Online digital libraries make it easier for researchers to search for scientific information. They have been proven as powerful resources in many data mining, machine learning and information retrieval applications that require high-quality data. The quality of the data highly depends on the accuracy of classifiers that identify the types of documents that are crawled from the Web, e.g., as research papers, slides, books, etc., for appropriate indexing. These classifiers in turn depend on the choice of the feature representation. We propose novel features that result in high-accuracy classifiers for document type classification. Experimental results on several datasets show that our classifiers outperform models that are employed in current systems.

[1]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[2]  Catherine Roussey,et al.  DOCUMENT CLASSIFICATION Combining Structure and Content , 2011, ICEIS 2011.

[3]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[4]  Xiaojun Wan,et al.  PPSGen: Learning to Generate Presentation Slides for Academic Papers , 2013, IJCAI.

[5]  Zhaohui Wu,et al.  Table of Contents Recognition and Extraction for Heterogeneous Book Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[7]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[8]  Dragomir R. Radev,et al.  Citation Summarization Through Keyphrase Extraction , 2010, COLING.

[9]  Min-Yen Kan SlideSeer: a digital library of aligned document and presentation pairs , 2007, JCDL '07.

[10]  Prasenjit Mitra,et al.  Utilizing Context in Generative Bayesian Models for Linked Corpus , 2010, AAAI.

[11]  Cornelia Caragea,et al.  Classifying Scientific Publications Using Abstract Features , 2011, SARA.

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Ümit V. Çatalyürek,et al.  Diversified recommendation on graphs: pitfalls, measures, and algorithms , 2013, WWW.

[14]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[15]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[16]  C. Lee Giles,et al.  Similar researcher search in academic environments , 2012, JCDL '12.

[17]  Hema Swetha Koppula,et al.  Learning URL patterns for webpage de-duplication , 2010, WSDM '10.

[18]  Cornelia Caragea,et al.  Co-Training for Topic Classification of Scholarly Data , 2015, EMNLP.

[19]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[20]  ChengXiang Zhai,et al.  Generating Impact-Based Summaries for Scientific Literature , 2008, ACL.

[21]  Cornelia Caragea,et al.  Extracting Keyphrases from Research Papers Using Citation Networks , 2014, AAAI.

[22]  Cornelia Caragea,et al.  Can't see the forest for the trees?: a citation recommendation system , 2013, JCDL '13.

[23]  Cornelia Caragea,et al.  Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[24]  Pabitra Mitra,et al.  Combining content and structure similarity for XML document classification using composite SVM kernels , 2008, 2008 19th International Conference on Pattern Recognition.

[25]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[26]  L. Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[27]  Cornelia Caragea,et al.  Researcher homepage classification using unlabeled data , 2013, WWW.

[28]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[29]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[30]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..