Classification is often required in various contexts, including in the field of official statistics. In the previous study, we have developed a multiclass classifier that can classify short text descriptions with high accuracy. The algorithm borrows the concept of the naive Bayes classifier and is so simple that its structure is easily understandable. The proposed classifier has the following two advantages. First, the processing times for both learning and classifying are extremely practical. Second, the proposed classifier yields high-accuracy results for a large portion of a dataset. We have previously developed an autocoding system for the Family Income and Expenditure Survey in Japan that has a better performing classifier. While the original system was developed in Perl in order to improve the efficiency of the coding process of short Japanese texts, the proposed system is implemented in the R programming language in order to explore versatility and is modified to make the system easily applicable to English text descriptions, in consideration of the increasing number of R users in the field of official statistics. We are planning to publish the proposed classifier as an R-package. The proposed classifier would be generally applicable to other classification tasks including coding activities in the field of official statistics, and it would contribute greatly to improving their efficiency.
[1]
D. J. Spiegelhalter,et al.
Statistical and Knowledge‐Based Approaches to Clinical Decision‐Support Systems, with an Application in Gastroenterology
,
1984
.
[2]
Leo Breiman,et al.
Random Forests
,
2001,
Machine Learning.
[3]
Peng Wang,et al.
Short Text Clustering via Convolutional Neural Networks
,
2015,
VS@HLT-NAACL.
[4]
Kaczmirek Lars,et al.
Three Methods for Occupation Coding Based on Statistical Learning
,
2017
.
[5]
Leo Breiman,et al.
Classification and Regression Trees
,
1984
.
[6]
Leon Willenborg,et al.
Theme: Coding; interpreting short descriptions using a classification
,
2012
.
[7]
Taku Kudo,et al.
MeCab : Yet Another Part-of-Speech and Morphological Analyzer
,
2005
.