Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem

In this paper, we analyze the effect of resampling techniques, including undersampling and over-sampling used in active learning for word sense disambiguation (WSD). Experimental results show that under-sampling causes negative effects on active learning, but over-sampling is a relatively good choice. To alleviate the withinclass imbalance problem of over-sampling, we propose a bootstrap-based oversampling (BootOS) method that works better than ordinary over-sampling in active learning for WSD. Finally, we investigate when to stop active learning, and adopt two strategies, max-confidence and min-error, as stopping conditions for active learning. According to experimental results, we suggest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for stopping conditions.

[1]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[2]  Craig A. Knoblock,et al.  Selective Sampling with Redundant Views , 2000, AAAI/IAAI.

[3]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  L BergerAdam,et al.  A maximum entropy approach to natural language processing , 1996 .

[7]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[8]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[9]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[10]  Wee Sun Lee,et al.  Learning Semantic Classes for Word Sense Disambiguation , 2005, ACL.

[11]  Yoshihiko Hamamoto,et al.  A Bootstrap Technique for Nearest Neighbor Classifier Design , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[13]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[14]  Carlo Strapparava,et al.  Direct Word Sense Matching for Lexical Substitution , 2006, ACL.

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16]  Hwee Tou Ng,et al.  Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation , 2006, ACL.

[17]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[18]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[19]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[20]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[21]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[22]  Martha Palmer,et al.  The Role of Semantic Roles in Disambiguating Verb Senses , 2005, ACL.

[23]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[24]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[25]  Walter Daelemans,et al.  Classifier Optimization and Combination in the English All Words Task , 2001, *SEMEVAL.

[26]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[27]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[28]  Tsuhan Chen,et al.  An active learning framework for content-based information retrieval , 2002, IEEE Trans. Multim..

[29]  Nianwen Xue,et al.  Aligning Features with Sense Distinction Dimensions , 2006, ACL.

[30]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[31]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..