Mining contentious documents

This work proposes an unsupervised method intended to enhance the quality of opinion mining in contentious text. It presents a Joint Topic Viewpoint (JTV) probabilistic model to analyze the underlying divergent arguing expressions that may be present in a collection of contentious documents. It extends the original Latent Dirichlet Allocation, which makes it domain and thesaurus independent, e.g., does not rely on WordNet coverage. The conceived JTV has the potential of automatically carrying the tasks of extracting associated terms denoting an arguing expression, according to the hidden topics it discusses and the embedded viewpoint it voices. Furthermore, JTV’s structure enables the unsupervised grouping of obtained arguing expressions according to their viewpoints, using a constrained clustering approach. Experiments are conducted on three types of contentious documents: polls, online debates and editorials. The qualitative and quantitative analyses of the experimental results show the effectiveness of our model to handle six different contentious issues when compared to a state-of-the-art method. Moreover, the ability to automatically generate distinctive and informative patterns of arguing expressions is demonstrated. Furthermore, the coherence of these arguing expressions is proved to be of a high quality when evaluated on the basis of recently introduced automatic coherence measure.

[1]  Noah A. Smith,et al.  Learning Topics and Positions from Debatepedia , 2013, EMNLP.

[2]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[3]  Matt Thomas,et al.  Get out the vote: Determining support or opposition from Congressional floor-debate transcripts , 2006, EMNLP.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[6]  Arjun Mukherjee,et al.  Mining contentions from discussions and debates , 2012, KDD.

[7]  Junehwa Song,et al.  Contrasting Opposing Views of News Articles on Contentious Issues , 2011, ACL.

[8]  Luo Si,et al.  Mining contrastive opinions on political texts using cross-perspective topic model , 2012, WSDM '12.

[9]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[10]  Jing Jiang,et al.  A Latent Variable Model for Viewpoint Discovery from Threaded Forum Posts , 2013, NAACL.

[11]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[12]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[15]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[16]  Arjun Mukherjee,et al.  Discovering User Interactions in Ideological Discussions , 2013, ACL.

[17]  Wei-Hao Lin,et al.  A Joint Topic and Perspective Model for Ideological Discourse , 2008, ECML/PKDD.

[18]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[19]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[20]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Alice H. Oh,et al.  Aspect and sentiment unification model for online review analysis , 2011, WSDM '11.

[22]  Hongfei Yan,et al.  Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid , 2010, EMNLP.

[23]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[24]  Michael J. Paul,et al.  Summarizing Contrastive Viewpoints in Opinionated Text , 2010, EMNLP.

[25]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[26]  Michael J. Paul,et al.  A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics , 2010, AAAI.

[27]  Soo-Min Kim,et al.  Crystal: Analyzing Predictive Opinions on the Web , 2007, EMNLP.

[28]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[29]  Osmar R. Zaïane,et al.  Mining Contentious Documents Using an Unsupervised Topic Model Based Approach , 2014, 2014 IEEE International Conference on Data Mining.

[30]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[31]  Wei-Hao Lin,et al.  Which Side are You on? Identifying Perspectives at the Document and Sentence Levels , 2006, CoNLL.

[32]  Michael I. Jordan,et al.  Mixed Membership Matrix Factorization , 2010, ICML.

[33]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[34]  Jesús María Larrazabal Antía Crucial concepts in argumentation theory (Frans H. van Eemeren) , 2002 .

[35]  Chris H. Q. Ding,et al.  On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing , 2008, Comput. Stat. Data Anal..

[36]  Swapna Somasundaran,et al.  Recognizing Stances in Ideological On-Line Debates , 2010, HLT-NAACL 2010.