Collecting paraphrase corpora from volunteer contributors

Extensive and deep paraphrase corpora are important for a variety of natural language processing and user interaction tasks. In this paper, we present an approach which i) collects multiple paraphrases per given item from volunteers and ii) incentivises responsible contributions by volunteer contributors. Our approach is to solicit paraphrases from Web volunteers, both collecting new paraphrases with no prompting and asking contributors to guess partially obfuscated paraphrases. To test the approach, we have implemented an online game, 1001 Paraphrases (http://ai-games.org/paraphrase.html), and deployed it to collect 20,944 entries focused on paraphrases of 400 statements. The approach complements existing text extraction methods and has some inherent unique advantages. We present and motivate our design as well as share preliminary observations and lessons learned about the performance of the approach.

[1]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[2]  James F. Allen,et al.  Toward Conversational Human-Computer Interaction , 2001, AI Mag..

[3]  Timothy Chklovski,et al.  Designing interfaces for guided collection of knowledge about everyday objects from volunteers , 2005, IUI.

[4]  Daniel Marcu,et al.  Transonics: a speech to speech system for English-Persian interactions , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[5]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[6]  Rada Mihalcea,et al.  Building sense tagged corpora with volunteer contributions over the Web , 2003, RANLP.

[7]  Timothy Chklovski,et al.  Using analogy to acquire commonsense knowledge from human Contributors , 2003 .

[8]  Daniel Marcu,et al.  Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[9]  Timothy Chklovski,et al.  Learner: a system for acquiring commonsense knowledge by analogy , 2003, K-CAP '03.

[10]  Yolanda Gil,et al.  An Analysis of Knowledge Collected from Volunteer Contributors , 2005, AAAI.

[11]  Rakesh Gupta,et al.  Common Sense Data Acquisition for Indoor Mobile Robots , 2004, AAAI.

[12]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[13]  James F. Allen,et al.  Towards Conversational Human-Computer Interaction , 2000 .

[14]  Rada Mihalcea,et al.  Building a Sense Tagged Corpus with Open Mind Word Expert , 2002, SENSEVAL.

[15]  Jon Curtis,et al.  Representing Knowledge Gaps Effectively , 2004, PAKM.