Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.,The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model.,Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts.,The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians.,The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases.,The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable.,These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.

[1]  Aida Slavic,et al.  Use of the Universal Decimal Classification: A world-wide survey , 2008, J. Documentation.

[2]  Cheng Gao,et al.  Need to Categorize: A Comparative Look at the Categories of Universal Decimal Classification System and Wikipedia , 2011, Leonardo.

[3]  Adi Wahyu Pribadi,et al.  Automatic news articles classification in Indonesian language by using Naive Bayes Classifier method , 2009, iiWAS.

[4]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[5]  Ina Blau,et al.  New Review of Hypermedia and Multimedia , 2014 .

[6]  Kwan Yi,et al.  Text classification using a hidden markov model , 2005 .

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[9]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[10]  Alan L. Porter,et al.  Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research , 2016 .

[11]  Robert Leskovar,et al.  A library's information retrieval system (In)effectiveness: case study , 2015, Libr. Hi Tech.

[12]  Thi-Lan Le,et al.  Crowdsourcing for botanical data collection towards to automatic plant identification: A review , 2018, Comput. Electron. Agric..

[13]  Ankita Choubey,et al.  A Survey on Classification Techniques in Internet Environment , 2016 .

[14]  Matthias Samwald,et al.  Fast and scalable neural embedding models for biomedical sentence classification , 2018, BMC Bioinformatics.

[15]  Massimo Franceschet,et al.  Clustering citation histories in the Physical Review , 2016, J. Informetrics.

[16]  István Hegedüs,et al.  Automatic free-text-tagging of online news archives , 2010, ECAI.

[17]  Alaa Mohamed Riad,et al.  A machine learning model for improving healthcare services on cloud computing environment , 2018 .

[18]  Igor Kononenko,et al.  Inductive and Bayesian learning in medical diagnosis , 1993, Appl. Artif. Intell..

[19]  Shreyes Seshasai,et al.  Document Classification for Newspaper Articles , 2012 .

[20]  Jung-ran Park,et al.  Evaluation of Semi-Automatic Metadata Generation Tools: A Survey of the Current State of the Art , 2015 .

[21]  Christian Wartena,et al.  A Hybrid Approach to Assignment of Library of Congress Subject Headings , 2018 .

[22]  Pavan Kumar Kankar,et al.  Ball Bearing Fault Diagnosis Using Supervised and Unsupervised Machine Learning Methods , 2015 .

[23]  E. A. Zanaty,et al.  Support Vector Machines (SVMs) versus Multilayer Perception (MLP) in data classification , 2012 .

[24]  Ronan Fablet,et al.  Hidden Markov Models: The Best Models for Forager Movements? , 2013, PloS one.

[25]  Abdallah Bashir Musa Comparative study on classification performance between support vector machine and logistic regression , 2012, International Journal of Machine Learning and Cybernetics.

[26]  Abdul Hamid Adom,et al.  Homogeneous multi-classifier system for moving vehicles noise classification based on multilayer perceptron , 2015, Journal of Intelligent & Fuzzy Systems.

[27]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[28]  Neeraj Kumar,et al.  An efficient scheme for automatic web pages categorization using the support vector machine , 2016, New Rev. Hypermedia Multim..

[29]  Feng Xia,et al.  Scientific Paper Recommendation: A Survey , 2020, IEEE Access.

[30]  Ajit Danti,et al.  Classification of compressed and uncompressed text documents , 2018, Future Gener. Comput. Syst..

[31]  Vijay Khatri,et al.  Managerial work in the realm of the digital universe: The role of the data triad , 2016 .

[32]  Iryna Gurevych,et al.  Bringing Order to Digital Libraries: From Keyphrase Extraction to Index Term Assignment , 2013, D Lib Mag..

[33]  Shawne D. Miksa The relationship between classification research and information retrieval research, 1952 to 1970 , 2017, J. Documentation.

[34]  Pavel Brazdil,et al.  Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks , 2006, IFIP AI.

[35]  Jian-hai Du Automatic text classification algorithm based on Gauss improved convolutional neural network , 2017, J. Comput. Sci..

[36]  Kristin M. Ferguson,et al.  Predicting Methamphetamine Use of Homeless Youths Attending High School: Comparison of Decision Rules and Logistic Regression Classification Algorithms , 2014, Journal of the Society for Social Work and Research.

[37]  Belén Ruíz-Mezcua,et al.  Towards a big data framework for analyzing social media content , 2019, Int. J. Inf. Manag..

[38]  Sugiarti Sugiarti,et al.  An artificial neural network approach for detecting skin cancer , 2019, TELKOMNIKA (Telecommunication Computing Electronics and Control).

[39]  Dimitris A. Karras,et al.  A Robust Meaning Extraction Methodology Using Supervised Neural Networks , 2002, Australian Joint Conference on Artificial Intelligence.

[40]  Douglas Tudhope,et al.  Augmenting Dublin Core digital library metadata with Dewey Decimal Classification , 2015, J. Documentation.

[41]  A. Yu. Romanov,et al.  Research of neural networks application efficiency in automatic scientific articles classification according to UDC , 2016, 2016 International Siberian Conference on Control and Communications (SIBCON).

[42]  Jöran Beel,et al.  Mr. DLib: Recommendations-as-a-Service (RaaS) for Academia , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[43]  Morena Danieli,et al.  Automatic classification of speech overlaps: Feature representation and algorithms , 2019, Comput. Speech Lang..

[44]  Lamiae Alami,et al.  Comparative Study of Clustering Algorithms in Text Mining Context , 2016, Int. J. Interact. Multim. Artif. Intell..

[45]  Alan R. Hevner,et al.  Design Science in Information Systems Research , 2004, MIS Q..

[46]  Kwan Yi Automated Text Classification Using Library Classification Schemes : Trends, Issues, and Challenges , 2007 .

[47]  Jon Rosewell,et al.  Crowdsourcing the identification of organisms: A case-study of iSpot , 2015, ZooKeys.

[48]  Mariàngels Granados Colillas UDC on the Internet: Theory and project in evolution for use of indexing and retrieval systems , 2011 .

[49]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[50]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[51]  Enrique Herrera-Viedma,et al.  A multi-disciplinar recommender system to advice research resources in University Digital Libraries , 2009, Expert Syst. Appl..

[52]  Vijay K. Vaishnavi,et al.  The emergence of design research in information systems in North America , 2008 .