A Neural Network Document Classifier with Linguistic Feature Selection

In this article, a neural network document classifier with linguistic feature selection and multi-category output is presented. It consists of a feature selection unit and a hierarchical neural network classification unit. In feature selection unit, we extract terms from some original documents by text processing, and then we analyze the conformity and uniformity of each term by entropy function which is characterized to measure the significance of term. Terms with high significance will be selected as input features for neural network document classifiers. In order to reduce the input dimension, we perform a mechanism to merge synonyms. According to the uniformity analysis, we obtain a term similarity matrix by fuzzy relation operation. By this method, we can construct a synonym thesaurus to reduce input dimension. In the hierarchical neural network classification unit, we adopt the well-known back-propagation learning model to build some proper hierarchical classification units. In our experiments, a product description database from an electronic commercial company is employed. The experimental results show that this classifier achieves sufficient accuracy to help human classification. It can save much manpower and working time for classifying a large database.