CiteSeer x : A Scholarly Big Dataset

The CiteSeer x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer x , which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.

[1]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[2]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[3]  Cornelia Caragea,et al.  Classifying Scientific Publications Using Abstract Features , 2011, SARA.

[4]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5]  Ümit V. Çatalyürek,et al.  Diversified recommendation on graphs: pitfalls, measures, and algorithms , 2013, WWW.

[6]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[7]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[8]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[9]  Shenghuo Zhu,et al.  Learning multiple graphs for document recommendations , 2008, WWW.

[10]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[11]  C. Lee Giles,et al.  Similar researcher search in academic environments , 2012, JCDL '12.

[12]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[13]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[14]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[15]  L. Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[16]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[17]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[18]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[19]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[20]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[21]  Cornelia Caragea,et al.  Can't see the forest for the trees?: a citation recommendation system , 2013, JCDL '13.

[22]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[23]  Min-Yen Kan SlideSeer: a digital library of aligned document and presentation pairs , 2007, JCDL '07.

[24]  Cornelia Caragea,et al.  Context Sensitive Topic Models for Author Influence in Document Networks , 2011, IJCAI.

[25]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[26]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[27]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[28]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[29]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[30]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[31]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.