Internet-scale collection of human-reviewed data

Enterprise and web data processing and content aggregation systems often require extensive use of human-reviewed data (e.g. for training and monitoring machine learning-based applications). Today these needs are often met by in-house efforts or out-sourced offshore contracting. Emerging applications attempt to provide automated collection of human-reviewed data at Internet-scale. We conduct extensive experiments to study the effectiveness of one such application. We also study the feasibility of using Yahoo! Answers, a general question-answering forum, for human-reviewed data collection.

[1]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[2]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[3]  Kamal Nigamyknigam,et al.  Employing Em in Pool-based Active Learning for Text Classiication , 1998 .

[4]  Shlomo Argamon,et al.  Committee-Based Sample Selection for Probabilistic Classifiers , 1999, J. Artif. Intell. Res..

[5]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[6]  Ulrich Güntzer,et al.  Data Quality Mining - Making a Virute of Necessity , 2001, DMKD.

[7]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[8]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[9]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .

[10]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[11]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[12]  Craig Gentry,et al.  Secure distributed human computation , 2005, EC '05.

[13]  Jennifer Widom,et al.  Generic Entity Resolution in the SERF Project , 2006, IEEE Data Eng. Bull..

[14]  Luis von Ahn Games with a Purpose , 2006, Computer.

[15]  Edmund A. Mennis The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations , 2006 .

[16]  Manuel Blum,et al.  Verbosity: a game for collecting common-sense facts , 2006, CHI.

[17]  Manuel Blum,et al.  Peekaboom: a game for locating objects in images , 2006, CHI.