Mining Subjective Properties on the Web

Even with the recent developments in Web search of answering queries from structured data, search engines are still limited to queries with an objective answer, such as EUROPEAN CAPITALS or WOODY ALLEN MOVIES. However, many queries are subjective, such as SAFE CITIES, or CUTE ANIMALS. The underlying knowledge bases of search engines do not contain answers to these queries because they do not have a ground truth. We describe the Surveyor system that mines the dominant opinion held by authors of Web content about whether a subjective property applies to a given entity. The evidence on which SURVEYOR relies is statements extracted from Web text that either support the property or claim its negation. The key challenge that SURVEYOR faces is that simply counting the number of positive and negative statements does not suffice, because there are multiple hidden biases with which content tends to be authored on the Web. SURVEYOR employs a probabilistic model of how content is authored on the Web. As one example, this model accounts for correlations between the subjective property and the frequency with which it is mentioned on the Web. The parameters of the model are specialized to each property and entity type. Surveyor was able to process a large Web snapshot within a few hours, resulting in opinions for over 4~billion entity-property combinations. We selected a subset of 500 entity-property combinations and compared our results to the dominant opinion of a large number of Amazon Mechanical Turk (AMT) workers. The predictions of Surveyor match the results from AMT in 77\% of all cases (and 87\% for test cases where inter-worker agreement is high), significantly outperforming competing approaches.

[1]  Jiawei Han,et al.  Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation , 2006 .

[2]  Surajit Chaudhuri,et al.  EntityTagger: automatically tagging entities with descriptive phrases , 2011, WWW.

[3]  Maite Taboada,et al.  Lexicon-Based Methods for Sentiment Analysis , 2011, CL.

[4]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[5]  Themis Palpanas,et al.  Survey on mining subjective data on the web , 2011, Data Mining and Knowledge Discovery.

[6]  Surajit Chaudhuri,et al.  Data services for E-tailers leveraging web search engine assets , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[7]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[8]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[9]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[10]  R. Groves Nonresponse Rates and Nonresponse Bias in Household Surveys , 2006 .

[11]  Oren Etzioni,et al.  OPINE: Extracting Product Features and Opinions from Reviews , 2005, HLT/EMNLP.

[12]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[13]  D. McDonald,et al.  On the poisson approximation to the multinomial distribution , 1980 .

[14]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[15]  Mark Craven,et al.  Evidence combination in biomedical natural-language processing , 2003, BIOKDD.

[16]  Manuela M. Veloso,et al.  OpenEval: Web Information Query Evaluation , 2013, AAAI.

[17]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[18]  Gerhard Weikum,et al.  WebChild: harvesting and organizing commonsense knowledge from the web , 2014, WSDM.

[19]  Surajit Chaudhuri,et al.  Query portals: dynamically generating portals for entity-oriented web queries , 2010, SIGMOD Conference.

[20]  Claudio Giuliano Fine-Grained Classification of Named Entities Exploiting Latent Semantic Kernels , 2009, CoNLL.

[21]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[22]  Sihem Amer-Yahia,et al.  Efficient sentiment correlation for large-scale demographics , 2013, SIGMOD '13.

[23]  Martin Ester,et al.  Opinion digger: an unsupervised opinion miner from unstructured product reviews , 2010, CIKM.

[24]  R. Fisher 001: On an Absolute Criterion for Fitting Frequency Curves. , 1912 .

[25]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[26]  B. Roos On the Rate of Multivariate Poisson Convergence , 1999 .