A Data Cleaning Method for CiteSeer Dataset

CiteSeer is considered as the first academic search engine that have been serving data for almost twenty years. Recently, CiteSeer graciously makes all the data public, including raw PDF files, text transformed from PDF, and metadata extracted from the text. Numerous efforts have been tried to improve the accuracy of the metadata extraction. The problem is inherently challenging and errors are abundant. In this paper, we propose an innovative record-linkage-based method for data cleaning, which use two new matching algorithms to significantly improve the cleaning performance for the CiteSeer dataset. One is an enhanced matching algorithm for local datasets, the other is developed for online datasets. Experimental results show that 48.1 % wrong metadata entries can be corrected by our method in total and the improvement is more than 539 % compared to existing state-of-the-art data cleaning methods.

[1]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[2]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[3]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[4]  Cornelia Caragea,et al.  Classifying Scientific Publications Using Abstract Features , 2011, SARA.

[5]  Madian Khabsa,et al.  The impact of user corrections on a crawl-based digital library: A CiteSeerX perspective , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[6]  Wenyi Huang,et al.  Recommending citations: translating papers into references , 2012, CIKM.

[7]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[8]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[9]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[10]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[11]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[12]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[13]  Soumen Chakrabarti,et al.  Mining the web - discovering knowledge from hypertext data , 2002 .

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[16]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[17]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[18]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[19]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[20]  Jianguo Lu,et al.  TS-IDS Algorithm for Query Selection in the Deep Web Crawling , 2014, APWeb.

[21]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[22]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[23]  Cornelia Caragea,et al.  Can't see the forest for the trees?: a citation recommendation system , 2013, JCDL '13.