Portuguese personal story analysis and detection in blogs

Diary-like content expressing authors personal experiences and sentiments over a variety of topics is generated every day and made available on the Internet. This rich content can be used for psychological analysis and knowledge discovery regarding human related issues in several ways. This paper presents the creation of a Brazilian Portuguese corpus, using blog posts, for personal stories analyses and detection. We present an analysis of psycholinguistic categories across personal story and non-story posts, discussing their similarities and differences. We also study the use of these psycholinguistic categories as classifying features. Then we describe the evaluation of several machine learning approaches and the process of applying them to identify personal stories on the basis of our dataset. Finally, we investigate the main topic-related polarity of personal narratives posts.

[1]  Sandra M. Aluísio,et al.  An Evaluation of the Brazilian Portuguese LIWC Dictionary for Sentiment Analysis , 2013, STIL.

[2]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[3]  Detmar Meurers,et al.  Learning what the crowd can do : a case study on focus annotation , 2015 .

[4]  Svetha Venkatesh,et al.  Analysis of psycholinguistic processes and topics in online autism communities , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Changqin Quan,et al.  Construction of a Blog Emotion Corpus for Chinese Emotional Expression Analysis , 2009, EMNLP.

[7]  Hasan Davulcu,et al.  Story detection using generalized concepts and relations , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[8]  Reid Swanson,et al.  StoryUpgrade: Finding Stories in Internet Weblogs , 2008, ICWSM.

[9]  R. Swanson,et al.  Identifying Personal Stories in Millions of Weblog Entries , 2009, ICWSM 2009.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Suya Pereira Castilhos Pylinguistics : an open source library for readability assessment of texts written in Portuguese , 2016 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Dietrich Klakow,et al.  Topic-Related Polarity Classification of Blog Sentences , 2009, EPIA.

[14]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[15]  Hasan Davulcu,et al.  A Semantic Triplet Based Story Classifier , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[16]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[17]  Gustavo Gauer,et al.  Personal journal blogs as manifest internal conversation toward self-innovation: A semiotic phenomenological analysis , 2016 .

[18]  Andrew S. Gordon,et al.  PhotoFall: discovering weblog stories through photographs , 2012, CIKM '12.

[19]  G. A. Mishne,et al.  Expiriments with mood classification in blog posts , 2005, SIGIR 2005.

[20]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[21]  Bin Li,et al.  Improving Blog Polarity Classification via Topic Analysis and Adaptive Methods , 2010, HLT-NAACL.

[22]  Felipe Meneguzzi,et al.  Comparing Approaches to Subjectivity Classification: A Study on Portuguese Tweets , 2016, PROPOR.

[23]  April M. Wensel,et al.  VIBES: visualizing changing emotional states in personal stories , 2008, SRMC '08.

[24]  E. Gehan,et al.  A generalized two-sample Wilcoxon test for doubly censored data. , 1965, Biometrika.

[25]  Jesus J. Caban,et al.  A Neural Network Based Model for Predicting Psychological Conditions , 2015, BIH.

[26]  Andrew S. Gordon,et al.  Content-based similarity measures of weblog authors , 2013, WebSci.

[27]  E. Gehan A GENERALIZED WILCOXON TEST FOR COMPARING ARBITRARILY SINGLY-CENSORED SAMPLES. , 1965, Biometrika.