Semantic Model Representation For Human's Pre-conceived Notions In Arabic Text With Applications To Sentiment Mining

Opinion mining is becoming of high importance with the availability of opinionated data on the Internet and the different applications it can be used for. Intensive efforts have been made to develop opinion mining systems, and in particular for the English language. However, models for opinion mining in Arabic remain challenging due to the complexity and rich morphology of the language. Previous approaches can be categorized into supervised approaches that use linguistic features to train machine learning classifiers, and unsupervised approaches that make use of sentiment lexicons. Different features have been exploited such as surface-based, syntactic, morphological, and semantic features. However, the semantic extraction remains shallow. In this paper, we propose to go deeper into the semantics of the text when considered for opinion mining. We propose a model that is inspired by the cognitive process that humans follow to infer sentiment, where humans rely on a database of preconceived notions developed throughout their life experiences. A key aspect for the proposed approach is to develop a semantic representation of the notions. This model consists of a combination of a set of textual representations for the notion (Ti), and a corresponding sentiment indicator (Si). Thus denotes the representation of a notion. However, notions can be constructed at different levels of text granularity ranging from ideas covered by words to ideas covered in full documents. The range also includes clauses, phrases, sentences, and paragraphs. To demonstrate the use of this new semantic model of preconceived notions, we develop the full representation of one-word notions by including the following set of syntactic features for Ti: word surfaces, stems, and lemmas represented by binary presence and TFIDF. We also include morphological features such as part of speech tags, aspect, person, gender, mood, and number. As for the notion sentiment indicator Si, we create a new set of features that indicate the words' sentiment scores based on an internally-developed Arabic sentiment lexicon called ArSenL, and using a third-party lexicon called Sifaat. The aforementioned features are extracted at the word-level, and are considered as raw features. We also investigate the use of additional "engineered" features that reflect the aggregated semantics of a sentence. Such features are derived from word-level information, and include count of subjective words, average of sentiment scores per sentence. Experiments are conducted on a benchmark dataset collected from the Penn Arabic TreeBank (PATB) already annotated with sentiment labels. Results reveal that raw word-level features do not achieve satisfactory performance in sentiment classification. Feature reduction was also explored to evaluate the relative importance of the raw features, where the results showed low correlations between individual raw features and sentiment labels. On the other hand, the inclusion of engineered features had a significant impact on classification accuracy. The outcome of these experiments is a comprehensive set of features that reflect the one-word notion or idea representation in a human mind. The results from one-word also show promises towards higher level context with multi-word notions.