Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010)

Research into invention, innovation policy, and technology strategy can greatly benefit from an accurate understanding of inventor careers. The United States Patent and Trademark Office does not provide unique inventor identifiers, however, making large-scale studies challenging. Many scholars of innovation have implemented ad-hoc disambiguation methods based on string similarity thresholds and string comparison matching; such methods have been shown to be vulnerable to a number of problems that can adversely affect research results. The authors address this issue contributing (1) an application of the Author-ity disambiguation approach (Torvik et al., 2005; Torvik and Smalheiser, 2009) to the US utility patent database, (2) a new iterative blocking scheme that expands the match space of this algorithm while maintaining scalability, (3) a public posting of the algorithm and code, and (4) a public posting of the results of the algorithm in the form of a database of inventors and their associated patents. The paper provides an overview of the disambiguation method, assesses its accuracy, and calculates network measures based on co-authorship and collaboration variables. It illustrates the potential for large-scale innovation studies across time and space with visualizations of inventor mobility across the United States. The complete input and results data from the original disambiguation are available at (http://dvn.iq.harvard.edu/dvn/dv/patent); revised data described here are at (http://funglab.berkeley.edu/pub/disamb_no_postpolishing.csv); original and revised code is available at (https://github.com/funginstitute/disambiguator); visualizations of inventor mobility are at (http://funglab.berkeley.edu/mobility/).

[1]  Vetle I. Torvik,et al.  Has Large-Scale Named-Entity Network Analysis Been Resting on a Flawed Assumption? , 2013, PloS one.

[2]  B. Kogut,et al.  Localization of Knowledge and the Mobility of Engineers in Regional Networks , 1999 .

[3]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .

[4]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[5]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[6]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[7]  Byung-Won On,et al.  Comparative study of name disambiguation problem using a scalable blocking-based framework , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[8]  Jasjit Singh,et al.  Collaborative Networks as Determinants of Knowledge Diffusion Patterns , 2005, Manag. Sci..

[9]  Matt Marx,et al.  1 Mobility , Skills , and the Michigan Noncompete Experiment , 2008 .

[10]  Thomas A. Stewart For Strategy, the Readiness Is All , 2004 .

[11]  Jean-Raymond Abrial,et al.  On B , 1998, B.

[12]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[13]  Kathleen M. Carley,et al.  He says, she says. Pat says, Tricia says. How much reference resolution matters for entity extraction, relation extraction, and social network analysis , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[14]  Jasjit Singh,et al.  Lone Inventors as Source of Breakthroughs: Myth or Reality? , 2009, Manag. Sci..

[15]  Li Tang,et al.  Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps , 2010, Scientometrics.

[16]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[17]  S. Breschi,et al.  Mobility of Skilled Workers and Co-Invention Networks: An Anatomy of Localized Knowledge Flows , 2009 .

[18]  Michele Pezzoni,et al.  How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation , 2014, Scientometrics.

[19]  Philipp Cimiano,et al.  A Systematic Investigation of Blocking Strategies for Real-Time Classification of Social Media Content into Events , 2012, Proceedings of the International AAAI Conference on Web and Social Media.

[20]  Nicolas Carayol,et al.  Who's Who in Patents. A Bayesian approach , 2009 .

[21]  Julio Raffo,et al.  How to play the “Names Game”: Patent retrieval comparing different heuristics , 2009 .

[22]  Lee Fleming,et al.  A Network of Invention , 2004 .

[23]  A. Agrawal,et al.  Gone but not Forgotten: Knowledge Flows, Labor Mobility, and Enduring Social Relationships , 2006 .

[24]  Manuel Trajtenberg,et al.  'Names Game': Harnessing Inventors Patent Data for Economic Research , 2006 .

[25]  Phillip Bonacich,et al.  Simultaneous group and individual centralities , 1991 .

[26]  Trevor Hastie,et al.  The Elements of Statistical Inference , 2001 .

[27]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[28]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[29]  L. Fleming,et al.  Collaborative Brokerage, Generative Creativity, and Creative Success , 2007 .

[30]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[31]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[32]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[33]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..