Text classification is an important research area as it enables the computers to work intelligently process unstructured data. This unstructured data is a rich source of information for industries. Most of such opinion rich data (more than 85%) is in text format. In this work we have observed the effect of different machine learning algorithms on the text data including the Naive Bayes. Our main focus is on improving the classification efficiency of Naive Bayes using its complemented version with less sensitivity. The results show that the feature selection procedure from our previous work combined with these algorithms results in significant improvement of classification efficiency and reduced over-fitting compared to the previous work. In many cases our decisions are influenced by the opinions of others. Before the internet awareness became widespread, many of us used to ask our friends or neighbors for opinion of an electronic good or a movie before actually buying it. With the growing availability and popularity of opinion-rich resources such as online review websites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. Unfortunately, 85% of these opinion rich resources are available in unstructured format. It has encouraged the analysts to develop an intelligent system that can automatically categorize or classify these text documents. A lot of research has been carried out, and each of them belongs to one of the following two approaches: Unsupervised approaches manually derive or impose some rules on the data to extract useful information. Supervised machine learning approaches use the statistical models such as Naive Bayes, SVM and Bayesian networks etc., The proposed approach in this paper is the supervised approach. Here a series of text documents or reviews which have been previously categorized (manually) and a classifier (model) is trained on these documents. Later this trained classifier is used to categorize new (unclassified) documents. The work documented here is an extension of [2]. Previously, the emphasis was on Pre-Processing i.e. converting these unstructured training documents into structural. Here it is mainly concentrated on classification. The movie-review dataset [1] has been used for experimental purposes. A sequence of pre-processing steps will be carried out to convert these documents into structural format i.e. Term-by-Document matrix (TbyD), as most of the machine learning algorithms are designed to work on structural data rather than on the unstructured data. A set of four good classifiers including the simple and complemented Naive Bayesian, Support Vector Machines (SVM), Bayesian Networks and Discriminative Frequency Estimate with Bayesian Networks are selected for validation. The remainder of this work is presented as follows; section 2 covers the related work done so far to the problem followed by the proposed approach where pre-processing, feature selection and classifiers employed in this work were explained briefly. In the later section 4 and 5 covers the methodology used, the results and observations respectively.
[1]
Nir Friedman,et al.
Bayesian Network Classifiers
,
1997,
Machine Learning.
[2]
Thorsten Joachims,et al.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
,
1998,
ECML.
[3]
Bo Pang,et al.
Thumbs up? Sentiment Classification using Machine Learning Techniques
,
2002,
EMNLP.
[4]
Ioan Pop,et al.
An approach of the Naive Bayes classifier for the document classification 1
,
2006
.
[5]
Durvasula V. L. N. Somayajulu,et al.
Sentiment Classification of text reviews using novel feature selection with reduced over-fitting
,
2010,
2010 International Conference for Internet Technology and Secured Transactions.
[6]
Fabrizio Sebastiani,et al.
Machine learning in automated text categorization
,
2001,
CSUR.
[7]
Stan Matwin,et al.
Discriminative parameter learning for Bayesian networks
,
2008,
ICML '08.
[8]
Nigel Collier,et al.
Sentiment Analysis using Support Vector Machines with Diverse Information Sources
,
2004,
EMNLP.
[9]
Bo Pang,et al.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts
,
2004,
ACL.
[10]
Eric Brill,et al.
Reducing the human overhead in text categorization
,
2006,
KDD '06.
[11]
Thorsten Joachims,et al.
Text categorization with support vector machines
,
1999
.
[12]
Louise A. Francis.
Taming Text: An Introduction to Text Mining
,
2006
.
[13]
Rudy Prabowo,et al.
Sentiment analysis: A combined approach
,
2009,
J. Informetrics.
[14]
Lillian Lee,et al.
Opinion Mining and Sentiment Analysis
,
2008,
Found. Trends Inf. Retr..
[15]
David R. Karger,et al.
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
,
2003,
ICML.
[16]
M. F. Porter,et al.
An algorithm for suffix stripping
,
1997
.
[17]
Frans Coenen,et al.
Statistical Identification of Key Phrases for Text Classification
,
2007,
MLDM.
[18]
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
,
1988,
Inf. Process. Manag..