Application of a Probability-Based Algorithm to Extraction of Product Features from Online Reviews

Prior research has demonstrated the viability of automatically extracting product features from online reviews. In this paper, I present a probability-based algorithm and compare it to an existing support-based approach. Specifically, I used each algorithm to extract features from 7 Amazon.com product categories and then asked end users to rate the features in terms of helpfulness for choosing products. The end users preferred the features identified by the probability-based algorithm. This probability-based algorithm can identify features that comprise a single noun or two successive nouns (which end users rated as more helpful than features comprising only one noun), yet even for collections of tens of thousands of reviews, it still executes fast enough (at around 1ms per review) for practical use.

[1]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[2]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[3]  Morris Rubinoff,et al.  Statistical generation of a technical vocabulary , 1968 .

[4]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[5]  H. P. Edmundson,et al.  Automatic abstracting and indexing—survey and recommendations , 1961, CACM.

[6]  Chin-Yew Lin,et al.  Automated Text Summarization , 2005, IJCNLP.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[9]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[10]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[11]  G. Leech,et al.  Word Frequencies in Written and Spoken English: based on the British National Corpus , 2001 .

[12]  David Clark,et al.  Shopbots Become Agents for Business Change , 2000, Computer.

[13]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[14]  Alexander F. Gelbukh,et al.  Zipf and Heaps Laws' Coefficients Depend on Language , 2001, CICLing.

[15]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[16]  Kamal Nigam,et al.  Towards a Robust Metric of Opinion , 2004 .

[17]  Michael McGill,et al.  A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment , 1980, SIGIR '80.

[18]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[19]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[20]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature , 1975, J. Am. Soc. Inf. Sci..

[21]  Gerard Salton,et al.  A comparison of search term weighting: term relevance vs. inverse document frequency , 1981, SIGIR '81.

[22]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[23]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[24]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[25]  S. H. Srinivasan,et al.  Polarized Lexicon for Review Classification , 2004, IC-AI.

[26]  Inderjeet Mani,et al.  Multi-Document Summarization by Graph Search and Matching , 1997, AAAI/IAAI.

[27]  Branimir K. Boguraev,et al.  Salience-based Content Characterisafion of Text Documents , 1997 .

[28]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[29]  Michael D. Smith The impact of shopbots on electronic markets , 2002 .

[30]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[31]  Fred J. Damerau,et al.  An experiment in automatic indexing , 1965 .

[32]  Kathleen R. McKeown,et al.  Information Extraction and Summarization: Domain Independence through Focus Types , 1999 .

[33]  Hiroshi Nakagawa,et al.  A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[34]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.