Application of Natural Language Processing Algorithms to the Task of Automatic Classification of Russian Scientific Texts

This work is devoted to the study of applicability of modern methods of machine learning to the task of automatic classification of scientific articles and abstracts. For this purpose, the study of such models of machine learning as artificial neural networks, random forest, logistic regression, and support vector machine was carried out with taking into account such a feature of scientific texts as a large number of terms specific for various categories. Separately, the stages of data collection and extraction of text characteristics are considered. The results of research are used in development of a decision support system for assignment of scientific texts to the code of the department or abstract journal of All-Russian Institute of Scientific and Technical Information of Russian Academy of Sciences.

[1]  Bin Xu,et al.  A new SVM Chinese text of classification algorithm based on the semantic kernel , 2011, 2011 International Conference on Multimedia Technology.

[2]  L E Sapozhnikova,et al.  Text classification using convolutional neural network , 2019 .

[3]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[4]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[5]  Rachid Sammouda A Comparative Study of Effective Supervised Learning Methods on Arabic Text Classification , 2018 .

[6]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[7]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[8]  Christopher Conrad,et al.  SAR and optical time series for crop classification , 2017, 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS).

[9]  Nafissa Yussupova,et al.  Applying of Sentiment Analysis for Texts in Russian Based on Machine Learning Approach , 2012 .

[10]  Ibrahim S. I. Abuhaiba,et al.  Combining Different Approaches to Improve Arabic Text Documents Classification , 2017 .

[11]  Mi Lu,et al.  Comparisons and Selections of Features and Classifiers for Short Text Classification , 2017 .

[12]  Kuo-Chen Chou,et al.  Boosting classifier for predicting protein domain structural class. , 2005, Biochemical and biophysical research communications.

[13]  Zhiyong Luo,et al.  Combination of Convolutional and Recurrent Neural Network for Sentiment Analysis of Short Texts , 2016, COLING.

[14]  Yuan Luo,et al.  Classification of Data from Electronic Nose Using Gradient Tree Boosting Algorithm , 2017, Sensors.

[15]  Henryk Maciejewski,et al.  Deep learning methods for subject text classification of articles , 2017, 2017 Federated Conference on Computer Science and Information Systems (FedCSIS).

[16]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[17]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[18]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[19]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[20]  Ankit Srivastava,et al.  Automatic Classification of Abusive Language and Personal Attacks in Various Forms of Online Communication , 2017, GSCL.

[21]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[22]  Victoria Bobicev,et al.  Classification of Emotion Words in Russian and Romanian Languages , 2009, RANLP.

[23]  Jian-hai Du Automatic text classification algorithm based on Gauss improved convolutional neural network , 2017, J. Comput. Sci..

[24]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[25]  Zhenchang Xing,et al.  Ensemble application of convolutional and recurrent neural networks for multi-label text categorization , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[26]  Elisabeth Lex,et al.  Efficient Cross-Domain Classification of Weblogs , 2010 .

[27]  Michal Tomana,et al.  Influence of Word Normalization on Text Classification , 2007 .

[28]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[29]  Teresa Gonçalves,et al.  Evaluating preprocessing techniques in a Text Classification problem , 2005 .

[30]  Yang Liu,et al.  A method for multi-class sentiment classification based on an improved one-vs-one (OVO) strategy and the support vector machine (SVM) algorithm , 2017, Inf. Sci..