The concept of the crowd: mining perceptual attributes from rating data on the social web

With its huge amount of information, the World Wide Web mirrors all aspects of our everyday life. During the last years, various efforts have been undertaken to refine all the unstructured information into a structured form. In particular, this would enable relational database systems to process the information available on theWeb in a well-proven fashion, thus enabling database users to use this information efficiently. Particularly challenging is the discovery of so-called perceptual concepts. In contrast to “hard” facts, which currently are extracted mostly by analyzing textual information, perceptual concepts primarily concern the common perception of people. This perception is characterized primarily by the fact that many properties of real-world objects cannot be described easily in explicit form. Examples of perceptual concepts are the “sportiness” of a car, the “suspense” of a movie, and the “creativity” of a restaurant. This cumulative doctoral thesis develops one of the first approaches to extracting an entity’s perceptional properties by using the Social Web’s data. The focus of this thesis are ratings of the type “user X rates item Y a Z out of 10,” which meanwhile can be found on a variety of Web sites. By performing an extensive analysis of the rating behavior of users, the doctoral thesis demonstrates that those groups of objects can automatically be identified that are perceived by users as being similar regarding one or more perceptual properties. When linking the so-created perceptual spaces (an abstract model to describe perceptual similarity) with external reference information, detailed structured descriptions of objects can automatically be derived and then used in database systems. This work performs a detailed investigation of the proposed methods for analyzing rating data and creates innovative application scenarios from them. Particularly important is a combination of the methods developed and a novel technique from the area of crowdsourcing. Among other things, it is shown that crowd-enabled databases, which have been invented recently, can massively benefit from the approach developed in this thesis: The application of crowdsourcing gets easier, data quality increases, and costs are reduced significantly. In total, this cumulative doctoral thesis is based on eight peer-reviewed publications.

[1]  S. Erevelles Book Review: Re-Imagine: Business Excellence in a Disruptive Age: , 2006 .

[2]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[3]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[4]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[5]  Zhihua Cai,et al.  Evaluation Measures of the Classification Performance of Imbalanced Data Sets , 2009 .

[6]  Chris H. Q. Ding,et al.  Low-order tensor decompositions for social tagging recommendation , 2011, WSDM '11.

[7]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[8]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[9]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[10]  A. Tversky,et al.  The Psychology of Preferences , 1982 .

[11]  Bhaskar Mehta,et al.  Attack resistant collaborative filtering , 2008, SIGIR '08.

[12]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[13]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[14]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[15]  Joachim Selke Representing Perceptual Product Features in Databases , 2011, Grundlagen von Datenbanken.

[16]  Wolf-Tilo Balke,et al.  Turning Experience Products into Search Products: Making User Feedback Count , 2011, 2011 IEEE 13th Conference on Commerce and Enterprise Computing.

[17]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Jeffrey F. Naughton,et al.  Efficiently incorporating user feedback into information extraction and integration programs , 2009, SIGMOD Conference.

[19]  Michael F. Goodchild,et al.  Please Scroll down for Article International Journal of Digital Earth Crowdsourcing Geographic Information for Disaster Response: a Research Frontier Crowdsourcing Geographic Information for Disaster Response: a Research Frontier , 2022 .

[20]  Ying Liu,et al.  A survey of content-based image retrieval with high-level semantics , 2007, Pattern Recognit..

[21]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[22]  Duncan J. Watts,et al.  Financial incentives and the "performance of crowds" , 2009, HCOMP '09.

[23]  Aditya G. Parameswaran,et al.  Answering Queries using Humans, Algorithms and Databases , 2011, CIDR.

[24]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[25]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[26]  R. Stam Film Theory: An Introduction , 2000 .

[27]  Wolf-Tilo Balke,et al.  Conceptual views for entity-centric search: turning data into meaningful concepts , 2012, Computer Science - Research and Development.

[28]  Yehuda Koren,et al.  Collaborative filtering with temporal dynamics , 2009, KDD.

[29]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[30]  Andrew Olney Likability-Based Genres: Analysis and Evaluation of the Netflix Dataset , 2010 .

[31]  Wolf-Tilo Balke,et al.  Query relaxation using malleable schemas , 2007, SIGMOD '07.

[32]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[33]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[34]  Alon Y. Halevy,et al.  Malleable Schemas: A Preliminary Report , 2005, WebDB.

[35]  Panagiotis G. Ipeirotis Analyzing the Amazon Mechanical Turk marketplace , 2010, XRDS.

[36]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[37]  Gerhard Weikum,et al.  Database and information-retrieval methods for knowledge discovery , 2009, CACM.

[38]  Michael N. Jones,et al.  Redundancy in Perceptual and Linguistic Experience: Comparing Feature-Based and Distributional Models of Semantic Representation , 2010, Top. Cogn. Sci..

[39]  Ralph L. Keeney,et al.  Decisions with multiple objectives: preferences and value tradeoffs , 1976 .

[40]  José Galindo,et al.  Fuzzy Databases: Modeling, Design, and Implementation , 2006 .

[41]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[42]  Oren Etzioni,et al.  Structured Querying of Web Text A Technical Challenge , 2006 .

[43]  Juha Karhunen,et al.  Principal Component Analysis for Large Scale Problems with Lots of Missing Values , 2007, ECML.

[44]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[45]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[46]  Lisa R. Klein Evaluating the Potential of Interactive Media through a New Lens: Search versus Experience Goods , 1998 .

[47]  P. Gärdenfors Conceptual spaces as a framework for knowledge representation , 2004 .

[48]  Dominik Benz,et al.  Query Logs as Folksonomies , 2010, Datenbank-Spektrum.

[49]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[50]  Panagiotis G. Ipeirotis Demographics of Mechanical Turk , 2010 .

[51]  Domonkos Tikk,et al.  Recommending new movies: even a few ratings are more valuable than metadata , 2009, RecSys '09.

[52]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[53]  Panagiotis Symeonidis,et al.  Tag recommendations based on tensor dimensionality reduction , 2008, RecSys '08.

[54]  P. Nelson Information and Consumer Behavior , 1970, Journal of Political Economy.

[55]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[56]  Luis von Ahn Games with a Purpose , 2006, Computer.

[57]  Yi Zhang,et al.  Efficient bayesian hierarchical user modeling for recommendation system , 2007, SIGIR.

[58]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[59]  Jian Pei,et al.  Top-k typicality queries and efficient query answering methods on large databases , 2009, The VLDB Journal.

[60]  Xavier Amatriain,et al.  The wisdom of the few: a collaborative filtering approach based on expert opinions from the web , 2009, SIGIR.

[61]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[63]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[64]  Jeffrey Heer,et al.  Crowdsourcing graphical perception: using mechanical turk to assess visualization design , 2010, CHI.

[65]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[66]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[67]  R. Dholakia,et al.  Factors Driving Consumer Intention to Shop Online: An Empirical Investigation , 2003 .

[68]  Luo Si,et al.  A study of mixture models for collaborative filtering , 2006, Information Retrieval.

[69]  AnHai Doan,et al.  Crowds, clouds, and algorithms: exploring the human side of "big data" applications , 2010, SIGMOD Conference.

[70]  Joann Peck,et al.  To have and to Hold: The Influence of Haptic Information on Product Judgments , 2003 .

[71]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[72]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .

[73]  Xiaojin Zhu,et al.  Building Community Wikipedias: A Machine-Human Partnership Approach , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[74]  Marie-Francine Moens,et al.  Information Extraction: Algorithms and Prospects in a Retrieval Context , 2006, The Information Retrieval Series.

[75]  John Riedl,et al.  Navigating the tag genome , 2011, IUI '11.

[76]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[77]  Yehuda Koren,et al.  Advances in Collaborative Filtering , 2011, Recommender Systems Handbook.

[78]  Alon Y. Halevy,et al.  A Platform for Personal Information Management and Integration , 2005, CIDR.

[79]  Wolf-Tilo Balke,et al.  Extracting Features from Ratings: The Role of Factor Models , 2011, ArXiv.

[80]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[81]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[82]  Shih-Fu Chang,et al.  A conceptual framework and empirical research for classifying visual descriptors , 2001, J. Assoc. Inf. Sci. Technol..

[83]  Wolf-Tilo Balke,et al.  Pushing the Boundaries of Crowd-enabled Databases with Query-driven Schema Expansion , 2012, Proc. VLDB Endow..

[84]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[85]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[86]  William Nick Street,et al.  Collaborative filtering via euclidean embedding , 2010, RecSys '10.

[87]  Guy Shani,et al.  Using Wikipedia to boost collaborative filtering techniques , 2011, RecSys '11.

[88]  Ravi Kumar,et al.  A Characterization of Online Search Behavior , 2009, IEEE Data Eng. Bull..

[89]  Jiawei Han,et al.  Tensor space model for document analysis , 2006, SIGIR.

[90]  Raghu Ramakrishnan,et al.  DBLife: A Community Information Management Platform for the Database Research Community (Demo) , 2007, CIDR.

[91]  Makoto Nakayama,et al.  Has the web transformed experience goods into search goods? , 2010, Electron. Mark..

[92]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[93]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[94]  Wolf-Tilo Balke,et al.  Information Extraction Meets Crowdsourcing: A Promising Couple , 2012, Datenbank-Spektrum.

[95]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[96]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[97]  Jason Weston,et al.  Large Scale Transductive SVMs , 2006, J. Mach. Learn. Res..

[98]  Jonathon S. Hare,et al.  Mind the gap: another look at the problem of the semantic gap in image retrieval , 2006, Electronic Imaging.

[99]  Werner Kießling,et al.  Foundations of Preferences in Database Systems , 2002, VLDB.

[100]  Bill Tomlinson,et al.  Who are the crowdworkers?: shifting demographics in mechanical turk , 2010, CHI Extended Abstracts.

[101]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[102]  Elizabeth Cooper-Martin,et al.  Consumers and Movies: Some Findings on Experiential Products , 1991 .

[103]  Jan Chomicki,et al.  Preference formulas in relational queries , 2003, TODS.

[104]  Dianhong Wang,et al.  Survey of Improving K-Nearest-Neighbor for Classification , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[105]  B. Schölkopf,et al.  Generalization and similarity in exemplar models of categorization: Insights from machine learning , 2008, Psychonomic bulletin & review.

[106]  Grace Hui Yang,et al.  Collecting high quality overlapping labels at low cost , 2010, SIGIR.

[107]  B. Schölkopf,et al.  Does Cognitive Science Need Kernels? , 2009, Trends in Cognitive Sciences.

[108]  E. Hirschman,et al.  Hedonic Consumption: Emerging Concepts, Methods and Propositions , 1982 .