Understanding the Massive Online Reviews: a Novel Representative Subset Extraction method

With the widespread use of e-commerce, explosive online reviews often get overwhelming to online consumers to read, leading to the depression of decision making in purchasing. To deal with the information overload problem caused by large-scale online reviews, this study focuses on the representative review extraction problem and formulates this problem as an optimization model with a submodularity property. Then, to analyze and draw topic information from the original review collection, the topic model of LDA is adopted in this study to semantically measure the similarity between each pair of reviews for the optimization model. Furthermore, a greedy extraction method named RR with a satisfactory error bound is proposed to extract a representative subset of reviews from the original review collection based on the optimization model. Experiments on real data and a user study are conducted in this study. The experimental results demonstrate that the proposed method is of high efficiency and scalability, and performs better than benchmark methods in terms of coverage, which proves that it can help online consumers better capture the main ideas of the whole set of original reviews in a limited time.

[1]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[2]  Xiaohui Yu,et al.  Modeling and Predicting the Helpfulness of Online Reviews , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Panagiotis G. Ipeirotis,et al.  Show me the money!: deriving the pricing power of product features by mining consumer reviews , 2007, KDD '07.

[4]  Dimitrios Gunopulos,et al.  Efficient Confident Search in Large Review Corpora , 2010, ECML/PKDD.

[5]  Anthony K. H. Tung,et al.  Finding representative set from massive data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Rafael Lucian,et al.  Information Overload on E-commerce , 2007, I3E.

[7]  Meng Wang,et al.  Topic and Sentiment Unification Maximum Entropy Model for Online Review Analysis , 2015, WWW.

[8]  Kazutaka Shimada,et al.  Multi-aspects review summarization with objective information , 2011 .

[9]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[10]  Panagiotis G. Ipeirotis,et al.  Designing novel review ranking systems: predicting the usefulness and impact of reviews , 2007, ICEC.

[11]  Jin Zhang,et al.  Extracting Representative Information to Enhance Flexible Data Queries , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Qiang Wei,et al.  Measuring the coverage and redundancy of information search services on e-commerce platforms , 2012, Electron. Commer. Res. Appl..

[13]  Christopher S. G. Khoo,et al.  Aspect-based sentiment analysis of movie reviews on discussion boards , 2010, J. Inf. Sci..

[14]  Evimaria Terzi,et al.  Selecting a comprehensive set of reviews , 2011, KDD.

[15]  Ming Zhou,et al.  Low-Quality Product Review Detection in Opinion Summarization , 2007, EMNLP.

[16]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[17]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[18]  Jin Zhang,et al.  A heuristic approach for λ-representative information retrieval from large-scale data , 2014, Inf. Sci..

[19]  Andrea Esuli,et al.  Multi-Faceted Rating of Product Reviews , 2009, ERCIM News.

[20]  Soo-Min Kim,et al.  Automatically Assessing Review Helpfulness , 2006, EMNLP.

[21]  Daniel Dajun Zeng,et al.  Fine-grained opinion mining by integrating multiple review sources , 2010, J. Assoc. Inf. Sci. Technol..

[22]  David Schuff,et al.  What Makes a Helpful Review? A Study of Customer Reviews on Amazon.com , 2010 .

[23]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[24]  Mark Crovella,et al.  Selecting a characteristic set of reviews , 2012, KDD.

[25]  Zhu Zhang,et al.  Utility scoring of product reviews , 2006, CIKM '06.

[26]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[29]  Guoqing Chen,et al.  A combined measure for representative information retrieval in enterprise information systems , 2011, J. Enterp. Inf. Manag..

[30]  Robert M. Schindler,et al.  Internet forums as influential sources of consumer information , 2001 .

[31]  Monic Sun,et al.  How Does the Variance of Product Ratings Matter? , 2010, Manag. Sci..