Personal Names Popularity Estimation and Its Application to Record Linkage

In this study, we investigate several statistical techniques for personal name popularity estimation and perform a record linkage experiment guided by name popularity estimates. The results show that name popularity can leverage personal name matching in databases and be of interest for many other domains.

[1]  Marco Baroni,et al.  zipfR : word frequency distributions in R , 2007, ACL 2007.

[2]  Alexander Panchenko,et al.  Detecting Gender by Full Name: Experiments with the Russian Language , 2014, AIST.

[3]  Lars Backstrom,et al.  ePluribus: Ethnicity on Social Networks , 2010, ICWSM.

[4]  G. Lasker,et al.  Use of Surname Models in Human Population Biology: A Review of Recent Developments , 2003, Human biology.

[5]  Octavian Popescu,et al.  Person number estimation in large corpora , 2012, Intelligenza Artificiale.

[6]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[7]  Claude Castelluccia,et al.  How Unique and Traceable Are Usernames? , 2011, PETS.

[8]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[9]  Yuan Ding,et al.  The City Privacy Attack: Combining Social Media and Public Records for Detailed Profiles of Adults and Children , 2015, COSN.

[10]  R. Zweigenhaft,et al.  The Psychological Impact of Names , 1980 .

[11]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[12]  E. Khmaladze The statistical analysis of a large number of rare events , 1988 .

[13]  David Yarowsky,et al.  Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter , 2013, NAACL.

[14]  Marco Baroni,et al.  Testing the extrapolation quality of word frequency models , 2006 .

[15]  David Yarowsky,et al.  Typed graph models for semi-supervised learning of name ethnicity , 2011, ACL 2011.

[16]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[17]  Arkaitz Zubiaga,et al.  Overview of the M-WePNaD Task: Multilingual Web Person Name Disambiguation at IberEval 2017 , 2017, IberEval@SEPLN.

[18]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[19]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[20]  P. Longley,et al.  Ethnicity and Population Structure in Personal Naming Networks , 2011, PloS one.

[21]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[22]  Shou-De Lin,et al.  Effective string processing and matching for author disambiguation , 2013, KDD Cup '13.

[23]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[24]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[25]  Fan Zhang,et al.  What's in a name?: an unsupervised approach to link users across communities , 2013, WSDM.

[26]  Chiara Scapoli,et al.  Surnames in Western Europe: a comparison of the subcontinental populations through isonymy. , 2007, Theoretical population biology.

[27]  Vern Paxson,et al.  Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse , 2013, USENIX Security Symposium.

[28]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[29]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[30]  Nadav M. Shnerb,et al.  You Name It – How Memory and Delay Govern First Name Dynamics , 2012, PloS one.

[31]  Stefan Evert,et al.  A Simple LNRE Model for Random Character Sequences , 2004 .

[32]  F. L. Wells,et al.  A note on singularity in given names. , 1948, The Journal of social psychology.

[33]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.