Word Clouds for Efficient Document Labeling

In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labelers - a tedious and time-consuming work. We propose to use condensed representations of text documents instead of the full-text document to reduce the labeling time for single documents. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. The key phrases are presented in a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labeling with these condensed representations can be done faster and equally accurate by the human labelers. Our evaluation shows that the users labeled word clouds twice as fast but as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labeling process of text documents.

[1]  Fredrik Olsson,et al.  A Web Survey on the Use of Active Learning to Support Annotation of Text Data , 2009, HLT-NAACL 2009.

[2]  Bojana Dalbelo Basic,et al.  Visualization of Text Streams: A Survey , 2010, KES.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  W. Bradford Paley,et al.  TextArc: Showing Word Frequency and Distribution in Text , 2002 .

[5]  Gurpreet Singh Lehal,et al.  A Survey of Text Summarization Extractive Techniques , 2010 .

[6]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[7]  Wolfgang Kienreich,et al.  On the Beauty and Usability of Tag Clouds , 2008, 2008 12th International Conference Information Visualisation.

[8]  Lyle H. Ungar,et al.  Machine Learning manuscript No. (will be inserted by the editor) Active Learning for Logistic Regression: , 2007 .

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[11]  Martin Wattenberg,et al.  The Word Tree, an Interactive Visual Concordance , 2008, IEEE Transactions on Visualization and Computer Graphics.

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[15]  Daniel A. Keim,et al.  Document Cards: A Top Trumps Visualization for Documents , 2009, IEEE Transactions on Visualization and Computer Graphics.

[16]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[17]  Martin Wattenberg,et al.  Mapping Text with Phrase Nets , 2009, IEEE Transactions on Visualization and Computer Graphics.

[18]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[19]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .