Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets

Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets.

[1]  Wenyi Huang,et al.  A Neural Probabilistic Model for Context Based Citation Recommendation , 2015, AAAI.

[2]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[3]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[4]  C. Lee Giles,et al.  A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets , 2017, K-CAP.

[5]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[6]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[8]  Xin Liu,et al.  On Predictive Patent Valuation: Forecasting Patent Citations and Their Types , 2017, AAAI.

[9]  C. Lee Giles,et al.  Scaling Author Name Disambiguation with CNF Blocking , 2017, ArXiv.

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  Lior Rokach,et al.  Entity Matching in Online Social Networks , 2013, 2013 International Conference on Social Computing.

[12]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[13]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[14]  C. Lee Giles Scholarly big data: information extraction and data mining , 2013, CIKM.

[15]  Yizhou Sun,et al.  Entity Matching across Heterogeneous Sources , 2015, KDD.

[16]  Jevin D. West,et al.  Babel: A Platform for Facilitating Research in Scholarly Article Discovery , 2016, WWW.

[17]  Xu Sun,et al.  Modeling Scientific Influence for Research Trending Topic Prediction , 2018, AAAI.

[18]  C. Lee Giles,et al.  A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents , 2017, AAAI.