Automatic categorisation applications at the European patent office

Abstract The first major use of natural language processing techniques in the European patent office (EPO) is described. This relates to automating the task of initially classifying newly filed applications with sufficient accuracy to enable reliable routing to the examiner(s) who work in the appropriate technical areas. Precision levels of the order of 80% are required. To achieve this, matters like recall levels, the problems of rarely occurring technical fields, the options for `training material' for the software––using existing fully classified documents, the accuracy of OCR scans of the incoming applications, the use of full texts or just abstracts, and confidence levels for the results are considered. The results are presented in relation to their level of success in precision and recall at various organisational levels at the EPO, i.e. at the highest (cluster) level, at directorate, and technical examiner levels. As another measure of applicability, confusion matrices are also presented. The authors also outline some of the other potential uses of categorisation and linguistic techniques within the work of the EPO, such as routing and partial classifying of both patent and non-patent literature, identifying potentially relevant citations, extracting bibliographic data of patents cited in incoming applications, document-relevance ranking systems and the creation of cross-lingual dictionaries.