Analysis in Amazon Reviews Using Probabilistic Machine Learning

Users of the online shopping site Amazon are encouraged to post reviews of the products that they purchase. Little attempt is made by Amazon to restrict or limit the content of these reviews. The number of reviews for different products varies, but the reviews provide accessible and plentiful data for relatively easy analysis for a range of applications. This paper seeks to apply and extend the current work in the field of natural language processing and sentiment analysis to data retrieved from Amazon. Naive Bayes and decision list classifiers are used to tag a given review as positive or negative. The number of stars a user gives a product is used as training data to perform supervised machine learning. A corpus contains 50,000 product review from 15 products serves as the dataset of study. Top selling and reviewed books on the site are the primary focus of the experiments, but useful features of them that aid in accurate classification are compared to those most useful in classification of other media products. The features, such as bag-of-words and bigrams, are compared to one another in their effectiveness in correctly tagging reviews. Errors in classification and general difficulties regarding the selection of features are analyzed and discussed.