Automated categorization in the international patent classification

A new reference collection of patent documents for training and testing automated categorization systems is established and described in detail. This collection is tailored for automating the attribution of international patent classification codes to patent applications and is made publicly available for future research work. We report the results of applying a variety of machine learning algorithms to the automated categorization of English-language patent documents. This procedure involves a complex hierarchical taxonomy, within which we classify documents into 114 classes and 451 subclasses. Several measures of categorization success are described and evaluated. We investigate how best to resolve the training problems related to the attribution of multiple classification codes to each patent document.

[1]  Marc Krier,et al.  Automatic categorisation applications at the European patent office , 2002 .

[2]  C. Koster,et al.  Classifying Patent Applications with Winnow , 2001 .

[3]  Noriko Kando kando What Shall We Evaluate ?-- Preliminary Discussion for the NTCIR Patent IR Challenge ( PIC ) Based on the Brainstorming with the Specialized Intermediaries in Patent Searching and Patent Attorneys , 2001 .

[4]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of U.S. Patents Working Notes for the AAAI-98 Workshop on Learning for Text Categorization , 1998 .

[5]  Stephen Adams Using the International Patent Classification in an online environment , 2000 .

[6]  Harold Smith Automation of patent classification , 2002 .

[7]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of US Patents , 1997 .

[8]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[9]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[10]  Pascaline Parisot,et al.  Interactive exploration of patent data for competitive intelligence: applications in ulix (novartis , 2001 .

[11]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[12]  Fredric C. Gey,et al.  Entry Vocabulary - a Technology to Enhance Digital Search , 2001, HLT.

[13]  Eric Gaussier,et al.  Language technologies and patent search and classification , 2001 .

[14]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[15]  Keijiro Tamenaga,et al.  International Patent Classification , 1980 .

[16]  James Calvert,et al.  The reform of the IPC , 2001 .

[17]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[18]  Prabhakar Raghavan,et al.  Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases , 1997, VLDB.