Creating Powerful Indicators for Innovation Studies with Approximate Matching Algorithms. A test based on PATSTAT and Amadeus databases

The lack of firm-level data on innovative activities has always constrained the development of empirical studies on innovation. More recently, the availability of large datasets on indicators, such as R&D expenditures and patents, has relaxed these constrains and spurred the growth of a new wave of research. However, measuring innovation still remains a difficult task for reasons linked to the quality of available indicators and the difficulty of integrating innovation indicators to other firm-level data. As regards quality, data on R&D expenditures represent a measure of input but do not tell much about the ‘success’ of innovative activities. Moreover, especially in the case of European firms, data on R&D expenditures are often missing because reporting these expenditures is not required by accounting and fiscal regulations in some countries. An increasing number of studies have used patents counts as a measure of inventive output. However, crude patent counts are a biased indicator of inventive output because they do not account for differences in the value of patented inventions. This is the reason why innovation scholars have introduced various patent-related indicators as a measure of the ‘quality’ of the inventive output. Integrating these measures of inventive activity with other firm-level information, such as accounting and financial data, is another challenging task. A major problem in this field is represented by the difficulty of harmonizing information from different data sources. This is a relevant issue since inaccuracy in data merging and integration leads to measurement errors and biased results. An important source of measurement error arises from inaccuracies in matching data on innovators across different datasets. This study reports on a test of company names standardization and matching. Our test is based on two data sources: the PATSTAT patent database and the Amadeus accounting and financial dataset. Earlier studies have mostly relied on manual, ad-hoc methods. More recently scholars have started experimenting with automatic matching techniques. This paper contributes to this body of research by comparing two different approaches – the character-tocharacter match of standardized company names (perfect matching) and the approximate matching based on string similarity functions. Our results show that approximate matching yields substantial gains over perfect matching, in terms of frequency of positive matches, with a limited loss of precision – i.e., low rates of false matches and false negatives.

[1]  Mark A. Schankerman,et al.  Patent Quality and Research Productivity: Measuring Innovation with Multiple Indicators , 2004 .

[2]  K. Pavitt,et al.  Patent statistics as indicators of innovative activities: Possibilities and problems , 2005, Scientometrics.

[3]  Bart Verspagen,et al.  THE VALUE OF PATENTS , 2006 .

[4]  W. Powell,et al.  Network Dynamics and Field Evolution: The Growth of Interorganizational Collaboration in the Life Sciences1 , 2005, American Journal of Sociology.

[5]  Manuel Trajtenberg,et al.  1 Market value and patent citations , 2004 .

[6]  D. Harhoff,et al.  Citation Frequency and the Value of Patented Inventions , 1999, Review of Economics and Statistics.

[7]  K. Pavitt,et al.  Large Firms in the Production of the World's Technology: An Important Case of “Non-Globalisation” , 1991 .

[8]  Marco S. Giarratana,et al.  Product Strategies and Survival in Schumpeterian Environments: Evidence from the US Security Software Industry , 2007 .

[9]  D. Harhoff,et al.  Citation Frequency and the Value of Patented Innovation , 1997 .

[10]  Hristian,et al.  Everything you Always Wanted to Know About Inventors (But Never Asked): Evidence from the PatVal-EU Survey , 2006 .

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Jacob Schmookler,et al.  Invention and Economic Growth , 1967 .

[13]  Keith Pavitt,et al.  The Size Distribution of Innovating Firms in the UK: 1945-1983 , 1987 .

[14]  Andrea Fosfuri,et al.  Product Strategies and Startups' Survival in Turbulent Industries: Evidence from the Security Software Industry , 2004 .

[15]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[16]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[17]  Zvi Griliches,et al.  R&D, Patents, and Market Value Revisited: is There Evidence of a Secondtechnological Opportunity Related Factor? , 1988 .

[18]  Z. Griliches Market Value, R&D, and Patents , 1981 .

[19]  Zvi Griliches,et al.  R&D, Patents, and Market Value Revisited: is There Evidence of a Secondtechnological Opportunity Related Factor? , 1988, Economics of Innovation and New Technology.

[20]  Keith Pavitt,et al.  USES AND ABUSES OF PATENT STATISTICS , 1988 .

[21]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[22]  Richard R. Nelson,et al.  Appropriating the Returns from Industrial R&D , 1988 .

[23]  Tom Magerman,et al.  Data Production Methods for Harmonized Patent Statistics: Patentee Name Harmonization , 2006 .

[24]  S. Winter,et al.  Appropriating the Returns from Industrial Research and Development , 1987 .

[25]  Z. Griliches Patent Statistics as Economic Indicators: a Survey , 1990 .