There’s no Data like More Data? Revisiting the Impact of Data Size on a Classification Task

In the paper we investigate the impact of data size on a Word Sense Disambiguation task (WSD). We question the assumption that the knowledge acquisition bottleneck, which is known as one of the major challenges for WSD, can be solved by simply obtaining more and more training data. Our case study on 1,000 manually annotated instances of the German verb ""drohen"" (threaten) shows that the best performance is not obtained when training on the full data set, but by carefully selecting new training instances with regard to their informativeness for the learning process (Active Learning). We present a thorough evaluation of the impact of different sampling methods on the data sets and propose an improved method for uncertainty sampling which dynamically adapts the selection of new instances to the learning progress of the classifier, resulting in more robust results during the initial stages of learning. A qualitative error analysis identifies problems for automatic WSD and discusses the reasons for the great gap in performance between human annotators and our automatic WSD system.

[1]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[2]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[3]  Claudia Kunze,et al.  GermaNet - representation, visualization, application , 2002, LREC.

[4]  Sandra Kübler The PaGe 2008 Shared Task on Parsing German , 2008 .

[5]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[6]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[7]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[8]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[9]  Julio Gonzalo,et al.  Automatic Acquisition of Lexical Information and Examples , 2007 .

[10]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[11]  Martha Palmer,et al.  Investigations into the role of lexical semantics in word sense disambiguation , 2004 .

[12]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[13]  Martha Palmer,et al.  Towards Robust High Performance Word Sense Disambiguation of English Verbs Using Rich Linguistic Features , 2005, IJCNLP.

[14]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[15]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[16]  K. Vijay-Shanker,et al.  A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping , 2009, CoNLL.

[17]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[18]  Ines Rehbein Josef Ruppenhofer Jonas Sunde MaJo-A Toolkit for Supervised Word Sense Disambiguation and Active Learning , 2009 .

[19]  Hwee Tou Ng,et al.  Domain Adaptation with Active Learning for Word Sense Disambiguation , 2007, ACL.

[20]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[21]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[22]  Katrin Erk,et al.  The SALSA Corpus: a German Corpus Resource for Lexical Semantics , 2006, LREC.

[23]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Computer Speech and Language.