Developing a Temporal Bibliographic Data Set for Entity Resolution

Entity resolution is the process of identifying groups of records within or across data sets where each group represents a real-world entity. Novel techniques that consider temporal features to improve the quality of entity resolution have recently attracted significant attention. However, there are currently no large data sets available that contain both temporal information as well as ground truth information to evaluate the quality of temporal entity resolution approaches. In this paper, we describe the preparation of a temporal data set based on author profiles extracted from the Digital Bibliography and Library Project (DBLP). We completed missing links between publications and author profiles in the DBLP data set using the DBLP public API. We then used the Microsoft Academic Graph (MAG) to link temporal affiliation information for DBLP authors. We selected around 80K (1%) of author profiles that cover 2 million (50%) publications using information in DBLP such as alternative author names and personal web profile to improve the reliability of the resulting ground truth, while at the same time keeping the data set challenging for temporal entity resolution research.

[1]  Jianzhong Li,et al.  Rule-Based Entity Resolution on Database with Hidden Temporal Information , 2018, IEEE Transactions on Knowledge and Data Engineering.

[2]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[3]  Divesh Srivastava,et al.  Linking temporal records , 2011, Frontiers of Computer Science.

[4]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[5]  Huizhi Liang,et al.  Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2015, ACM J. Data Inf. Qual..

[6]  Jeffrey F. Naughton,et al.  Tracking Entities in the Dynamic World: A Fast Algorithm for Matching Temporal Records , 2014, Proc. VLDB Endow..

[7]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[8]  Dmitri V. Kalashnikov,et al.  ProgressER: Adaptive Progressive Approach to Relational Entity Resolution , 2018, ACM Trans. Knowl. Discov. Data.

[9]  Qing Wang,et al.  Efficient Interactive Training Selection for Large-Scale Entity Resolution , 2015, PAKDD.

[10]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[11]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[12]  Peter Christen,et al.  A note on using the F-measure for evaluating record linkage algorithms , 2017, Statistics and Computing.

[13]  Hanna Köpcke,et al.  Object matching on real-world problems , 2014 .

[14]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.