A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets

Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.

[1]  Madian Khabsa,et al.  Inventor name disambiguation for a patent database using a random forest and DBSCAN , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[2]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[3]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[4]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[5]  C. Lee Giles,et al.  Near duplicate detection in an academic digital library , 2013, ACM Symposium on Document Engineering.

[6]  Jianguo Lu,et al.  A Data Cleaning Method for CiteSeer Dataset , 2016, WISE.

[7]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[8]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[9]  Jiebo Luo,et al.  Machine Identification of High Impact Research through Text and Image Analysis , 2017, 2017 IEEE Third International Conference on Multimedia Big Data (BigMM).

[10]  Ahmed Ali Abdalla Esmin,et al.  Disambiguating publication venue titles using association rules , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[11]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[12]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[13]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[14]  Changsheng Li,et al.  On Modeling and Predicting Individual Paper Citation Count over Time , 2016, IJCAI.

[15]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[16]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[17]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[18]  Madian Khabsa,et al.  The impact of user corrections on a crawl-based digital library: A CiteSeerX perspective , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.