A Study on Using Semantic Word Associations to Predict the Success of a Novel

Many new books get published every year, and only a fraction of them become popular among the readers. So the prediction of a book success can be a very useful parameter for publishers to make a reliable decision. This article presents the study of semantic word associations using the word embedding of book content for a set of Roget’s thesaurus concepts for book success prediction. In this work, we discuss the method to represent a book as a spectrum of concepts based on the association score between its content embedding and a global embedding (i.e. fastText) for a set of semantically linked word clusters. We show that the semantic word associations outperform the previous methods for book success prediction. In addition, we present that semantic word associations also provide better results than using features like the frequency of word groups in Roget’s thesaurus, LIWC (a popular tool for linguistic inquiry and word count), NRC (word association emotion lexicon), and part of speech (PoS). Our study reports that concept associations based on Roget’s Thesaurus using word embedding of individual novel resulted in the state-of-the-art performance of 0.89 average weighted F1-score for book success prediction. Finally, we present a set of dominant themes that contribute towards the popularity of a book for a specific genre.

[1]  S. Skiena,et al.  Stereotypical Gender Associations in Language Have Decreased Over Time , 2020 .

[2]  D. Kendall A Survey of the Statistical Theory of Shape , 1989 .

[3]  Fabio A. González,et al.  A Multi-task Approach to Predict Likability of Books , 2017, EACL.

[4]  Eneko Agirre,et al.  Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations , 2018, AAAI.

[5]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[6]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[7]  Yejin Choi,et al.  Success with Style: Using Writing Style to Predict the Success of Novels , 2013, EMNLP.

[8]  Stefan Klein,et al.  Feature Selection Based on the SVM Weight Vector for Classification of Dementia , 2015, IEEE Journal of Biomedical and Health Informatics.

[9]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[10]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[11]  Fabio A. González,et al.  Letting Emotions Flow: Success Prediction by Modeling the Flow of Emotions in Books , 2018, NAACL.

[12]  Erik Cambria,et al.  SenticNet 5: Discovering Conceptual Primitives for Sentiment Analysis by Means of Context Embeddings , 2018, AAAI.

[13]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[14]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[15]  Corina Koolen,et al.  Identifying Literary Texts with Bigrams , 2015, CLfL@NAACL-HLT.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Jure Leskovec,et al.  Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[18]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[19]  Fabio A. González,et al.  A Genre-Aware Attention Model to Improve the Likability Prediction of Books , 2018, EMNLP.

[20]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[21]  Arvind Narayanan,et al.  Semantics derived automatically from language corpora contain human-like biases , 2016, Science.

[22]  Mario Jarmasz,et al.  Roget's Thesaurus as a Lexical Resource for Natural Language Processing , 2012, ArXiv.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Thamar Solorio,et al.  Jointly Learning Author and Annotated Character N-gram Embeddings: A Case Study in Literary Text , 2019, RANLP.

[25]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Stan Szpakowicz,et al.  Roget's thesaurus and semantic similarity , 2012, RANLP.

[28]  Christopher M. Danforth,et al.  The emotional arcs of stories are dominated by six basic shapes , 2016, EPJ Data Science.

[29]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[30]  Peter Mark Roget Thesaurus of English Words and Phrases: So Classified and Arranged as to Facilitate the Expression of Ideas and Assist in Literary Composition , 2009 .

[31]  K. Vonnegut Palm Sunday: An Autobiographical Collage , 1981 .