Sentiment classification of online Cantonese reviews by supervised machine learning approaches

Cantonese is an important Chinese dialect spoken in some regions of Southern China. Local online users often represent their opinions and experiences with written Cantonese on the web. With two supervised machine learning approaches, this paper conducts a series of experiments to explore appropriate methods for automatic sentiment classification in the very noisy domain of online Cantonese-written reviews. Findings indicate that the support vector machine classifier based on a Mandarin Chinese word segmentation tool performs surprisingly well. The accuracy, precision and recall respectively for positive and negative reviews all reach above 85% when the training corpus contains 5,000 or more reviews.

[1]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[2]  Cheng Xueqi Research on Sentiment Classification of Chinese Reviews Based on Supervised Machine Learning Techniques , 2007 .

[3]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[4]  Trevor J. Hastie,et al.  The Sentimental Factor: Improving Review Classification Via Human-Provided Information , 2004, ACL.

[5]  Jacob Goldenberg,et al.  Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth , 2001 .

[6]  Lina Zhou,et al.  Movie Review Mining: a Comparison between Supervised and Unsupervised Classification Approaches , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[7]  Qiang Ye,et al.  Sentiment classification of online reviews to travel destinations by supervised machine learning approaches , 2009, Expert Syst. Appl..

[8]  Vincent Ng,et al.  Examining the Role of Linguistic Knowledge Sources in the Automatic Identification and Classification of Reviews , 2006, ACL.

[9]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[10]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[11]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[12]  Don Snow Cantonese as Written Language: The Growth of a Written Chinese Vernacular , 2004 .

[13]  J. Crotts,et al.  Travel Blogs and the Implications for Destination Marketing , 2007 .

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  E. Anderson Customer Satisfaction and Word of Mouth , 1998 .

[16]  Kyung Hyan Yoo,et al.  Use and Impact of Online Travel Reviews , 2008, ENTER.

[17]  Vibhu O. Mittal,et al.  Comparative Experiments on Sentiment Classification for Online Product Reviews , 2006, AAAI.

[18]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[19]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Chrysanthos Dellarocas,et al.  The Digitization of Word-of-Mouth: Promise and Challenges of Online Feedback Mechanisms , 2003, Manag. Sci..

[21]  Hsin-Hsi Chen,et al.  Opinion Extraction, Summarization and Tracking in News and Blog Corpora , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[22]  B. Gu,et al.  The impact of online user reviews on hotel room sales , 2009 .

[23]  KH Cheung,et al.  The representation of Cantonese with Chinese characters , 2016 .

[24]  Bob Carpenter,et al.  Scaling High-Order Character Language Models to Gigabytes , 2005, ACL 2005.

[25]  Michael Gamon,et al.  Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis , 2004, COLING.