This paper compares Neural network Approach with N-gram approach, for text categorization, and demonstrates that Neural Network approach is similar to the N-gram approach but with much less judging time. Both methods demonstrated here are aimed at language identification. The presence of particular characters, words and the statistical information of word lengths are used as a feature vector. In an identification experiment with Asian languages the neural network approach achieved 98% correct classification rate with 500 bytes, but it is five times faster than n-gram based approach. Keywords-N-Gram,Neural Network,LanguageIdentification,Text categorization I. INTRODUCTION As Internet services supersizing increasing in popularity, more and more languages are able to make their way online. In such a trend, a need exits for the rapid organizing of ever- expanding electronic documents. A well-trained librarian can easily identify the language of a book or a document, but it is not so easy when it presents online: there are so many documents in so many languages, and most of them cannot be passed immediately with a glance eye. Thus an automatically language identification system is needed to be built to take this task. Because of the sheer volume of documents to be handled, the categorization must be efficient, consuming as small storage and little processing time as possible. Text classification addresses the problem of assigning a given passage of text (or a document) to one or more predefined classes. This is an important area of information retrieval research that has been heavily investigated. A segmentation-based approach was compared with the non-segmentation-based approach. N-gram based approach is the most widely accepted one and it is proved to have a good performance. As crucial as accuracy, the speed of classification is also a crucial factor for a classifier in a huge volume of categorization environment. However, most of the authors did not provide any information on the speed of classification. In the following sections, t he comparison of the performances of categorization algorithms using Neural networks and N-gram based approach are discussed. It is demonstrated that the identification rate of Neural networks is similar to the corresponding Ngram approach but with much less judging time.
[1]
Makoto Suzuki,et al.
Text categorization based on the ratio of word frequency in each categories
,
2007,
2007 IEEE International Conference on Systems, Man and Cybernetics.
[2]
Tomáš ÖLVECKÝ.
N-Gram Based Statistics Aimed at Language Identification
,
2005
.
[3]
Naushad UzZaman,et al.
Analysis of N-Gram based text categorization for Bangla in a newspaper
,
2006
.
[4]
Teuvo Kohonen,et al.
Self-Organizing Maps
,
2010
.
[5]
David D. Lewis,et al.
A comparison of two learning algorithms for text categorization
,
1994
.
[6]
Hui Zhao,et al.
Text Classification Improved through Automatically Extracted Sequences
,
2006,
22nd International Conference on Data Engineering (ICDE'06).
[7]
M. V. Velzen,et al.
Self-organizing maps
,
2007
.
[8]
Jilei Tian,et al.
n-gram and decision tree based language identification for written words
,
2001,
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..
[9]
Duane DeSieno,et al.
Adding a conscience to competitive learning
,
1988,
IEEE 1988 International Conference on Neural Networks.
[10]
Sung-Hyuk Cha,et al.
Language Identification from Text Using N-gram Based Cumulative Frequency Addition
,
2004
.
[11]
Xiangji Huang,et al.
Machine learning for Asian language text classification
,
2007,
J. Documentation.