Automatic Classification Using DDC on the Swedish Union Catalogue

With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of two machine learning algorithms for Swedish catalogue records from the Swedish union catalogue (LIBRIS). The algorithms are tested on the top three hierarchical levels of the DDC. Based on a data set of 143,838 records, evaluation shows that Support Vector Machine with linear kernel outperforms Multinomial Naive Bayes algorithm. Also, using keywords or combining titles and keywords gives better results than using only titles as input. The class imbalance where many DDC classes only have few records greatly affects classification performance: 81.37% accuracy on the training set is achieved when at least 1,000 records per class are available, and 66.13% when few records on which to train are available. Proposed future research involves an exploration of the intellectual effort put into creating the DDC to further improve the algorithm performance as commonly applied in string matching, and to test the best approach on new digital collections that do not have DDC assigned. (Less)

[1]  Diane Vizine-Goetz,et al.  Evaluating Dewey concepts as a knowledge base for automatic subject assignment , 1997, DL '97.

[2]  James D. Anderson,et al.  The nature of indexing: how humans and machines analyze messages and texts for retrieval - Part II: Machine indexing, and the allocation of human versus machine effort , 2001, Inf. Process. Manag..

[3]  Koraljka Golub Automatic Subject Indexing of Text , 2019 .

[4]  Ahmed H. Aliwy,et al.  Comparative Study of Five Text Classification Algorithms with their Improvements , 2017 .

[5]  Alexander Mehler,et al.  Building a DDC-annotated Corpus from OAI Metadata , 2011, J. Digit. Inf..

[6]  Koraljka Golub,et al.  Automated Subject Classification of Textual Documents in the Context of Web-Based Hierarchical Browsing , 2007 .

[7]  Marjorie M. K. Hlava,et al.  Adoption and evaluation issues of automatic and computer aided indexing systems , 2008, ASIST.

[8]  Virginia A. Lingle,et al.  Indexing and Abstracting in Theory and Practice , 2005 .

[9]  José A. Senso,et al.  The Use of OPAC in a Large Academic Library: A Transactional Log Analysis Study of Subject Searching. , 2007 .

[10]  Douglas Tudhope,et al.  Augmenting Dublin Core digital library metadata with Dewey Decimal Classification , 2015, J. Documentation.

[11]  Jun Wang,et al.  An extensive study on automated Dewey Decimal Classification , 2009, J. Assoc. Inf. Sci. Technol..

[12]  Thierry Hamon,et al.  Automated Classification of Textual Documents Based on a Controlled Vocabulary in Engineering , 2007 .

[13]  Koraljka Golub,et al.  Automated subject classification of textual web documents , 2006, J. Documentation.

[14]  Rhonda N. Hunter Successes and Failures of Patrons Searching the Online Catalog at a Large Academic Library: A Transaction Log Analysis. , 1991 .

[15]  June P. Silvester Computer Supported Indexing: A History and Evaluation of NASA's MAI System. Supplement 24 , 1997 .

[16]  Kurt Leininger,et al.  Interindexer consistency in PsycINFO , 2000, J. Libr. Inf. Sci..

[17]  George Buchanan,et al.  A framework for evaluating automatic indexing or classification in the context of retrieval , 2016, J. Assoc. Inf. Sci. Technol..

[18]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[19]  Herbert L. Roitblat,et al.  Document categorization in legal electronic discovery: computer classification vs. manual review , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Samrudhi Sharma,et al.  Comparison of Text Classification Algorithms , 2015 .

[21]  Kelly Meadow,et al.  Search Query Quality and Web-Scale Discovery: A Qualitative and Quantitative Analysis , 2012 .

[22]  Wiley Interscience Journal of the American Society for Information Science and Technology , 2013 .

[23]  Elaine Svenonius The Intellectual Foundation of Information Organization , 2000 .

[24]  Marianne Lykke,et al.  Simulated work tasks: the case of professional users , 2014, IIiX.