The Impact of Spelling Errors on Patent Search

The search in patent databases is a risky business compared to the search in other domains. A single document that is relevant but overlooked during a patent search can turn into an expensive proposition. While recent research engages in specialized models and algorithms to improve the effectiveness of patent retrieval, we bring another aspect into focus: the detection and exploitation of patent inconsistencies. In particular, we analyze spelling errors in the assignee field of patents granted by the United States Patent & Trademark Office. We introduce technology in order to improve retrieval effectiveness despite the presence of typographical ambiguities. In this regard, we (1) quantify spelling errors in terms of edit distance and phonological dissimilarity and (2) render error detection as a learning problem that combines word dissimilarities with patent meta-features. For the task of finding all patents of a company, our approach improves recall from 96.7% (when using a state-of-the-art patent search engine) to 99.5%, while precision is compromised by only 3.7%.

[1]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Wim Vanderbauwhede,et al.  A survey of patent users: an analysis of tasks, behavior, search functionality and system requirements , 2010, IIiX.

[3]  Ming Zhou,et al.  Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[4]  John Tait,et al.  Current Challenges in Patent Information Retrieval , 2011, The Information Retrieval Series.

[5]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[6]  D. S. Sivia,et al.  Data Analysis , 1996, Encyclopedia of Evolutionary Psychological Science.

[7]  R. Mooney,et al.  Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases , 2002 .

[8]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[9]  Stephen Adams,et al.  The text, the full text and nothing but the text: Part 1 – Standards for creating textual information in patent documents and general search implications ☆ , 2010 .

[10]  Christophe Giraud-Carrier,et al.  Searching trademark databases for verbal similarities , 2005 .

[11]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[12]  Candidate Document Retrieval for Web-Scale Text Reuse Detection , 2011, SPIRE.

[13]  ChengXiang Zhai,et al.  CloudSpeller: query spelling correction by using a unified hidden markov model with web-scale resources , 2012, WWW.

[14]  David Hunt,et al.  Patent searching : tools & techniques , 2007 .

[15]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[16]  Heiko Mueller,et al.  Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[17]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[18]  Walid Magdy,et al.  Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task , 2010, CLEF.

[19]  Benno Stein,et al.  Phonetic Spelling and Heuristic Search , 2006, ECAI.

[20]  Matthias Hagen,et al.  Introducing the User-over-Ranking Hypothesis , 2011, ECIR.

[21]  W. Bruce Croft,et al.  Automatic query generation for patent search , 2009, CIKM.

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  Benno Stein,et al.  New Issues in Near-duplicate Detection , 2007, GfKl.

[24]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[25]  I Levenshtein Vladimir BINARY CODES CAPABLE OF CORRECTING DELETIONS, INSERTIONS, AND REVERSALS , 1966 .

[26]  Walid Magdy,et al.  A study on query expansion methods for patent retrieval , 2011, PaIR '11.

[27]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[28]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[29]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[30]  Laurent Romary,et al.  Experiments with Citation Mining and Key-Term Extraction for Prior Art Search , 2010, CLEF.