Top K representative: a method to select representative samples based on K nearest neighbors

Short text categorization involves the use of a supervised learning process that requires a large amount of labeled data for training and therefore consumes considerable human labor. Active learning is a way to reduce the number of manually labeled samples in traditional supervised learning problems. In active learning, the number of samples is reduced by selecting the most representative samples to represent an entire training set. Uncertainty sampling is a means of active learning but is easily affected by outliers. In this paper, a new sampling method called Top K representative (TKR) is proposed to solve the problem caused by outliers. However, TKR optimization is a nondeterministic polynomial-time hardness (NP-hard) problem, making it challenging to obtain exact solutions. To tackle this problem, we propose a new approach based on the greedy algorithm, which can obtain approximate solutions, and thereby achieve high performance. Experiments show that our proposed sampling method outperforms the existing methods in terms of efficiency.

[1]  Xingming Sun,et al.  Toward Efficient Multi-Keyword Fuzzy Search Over Encrypted Outsourced Data With Accuracy Improvement , 2016, IEEE Transactions on Information Forensics and Security.

[2]  Michael J. Prince,et al.  Does Active Learning Work? A Review of the Research , 2004 .

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[5]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[6]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[7]  Ali Selamat,et al.  Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples , 2015, Inf. Sci..

[8]  Xingming Sun,et al.  Enabling Personalized Search over Encrypted Outsourced Data with Efficiency Improvement , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[10]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[11]  Jun Zhou,et al.  Active learning SVM with regularization path for image classification , 2014, Multimedia Tools and Applications.

[12]  Jingbo Zhu,et al.  Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification , 2008, IJCNLP.

[13]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Matthieu Cord,et al.  Active Learning Methods for Interactive Image Retrieval , 2008, IEEE Transactions on Image Processing.

[16]  R. Samworth Optimal weighted nearest neighbour classifiers , 2011, 1101.5783.

[17]  D. Coomans,et al.  Alternative k-nearest neighbour rules in supervised pattern recognition : Part 1. k-Nearest neighbour classification by using alternative voting rules , 1982 .

[18]  Martha Palmer,et al.  An Empirical Study of the Behavior of Active Learning for Word Sense Disambiguation , 2006, NAACL.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[21]  Jingbo Zhu,et al.  Active Learning With Sampling by Uncertainty and Density for Data Annotations , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Kai Yang,et al.  A new samples selecting method based on K nearest neighbors , 2017, 2017 IEEE International Conference on Big Data and Smart Computing (BigComp).

[23]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[24]  Hwee Tou Ng,et al.  Domain Adaptation with Active Learning for Word Sense Disambiguation , 2007, ACL.

[25]  Xingming Sun,et al.  Enabling Semantic Search Based on Conceptual Graphs over Encrypted Outsourced Data , 2019, IEEE Transactions on Services Computing.

[26]  Guodong Zhou,et al.  Active Learning for Cross-domain Sentiment Classification , 2013, IJCAI.

[27]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[28]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[29]  Tianshun Yao,et al.  Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification , 2008, COLING.

[30]  Brian Everitt,et al.  Miscellaneous Clustering Methods , 2011 .

[31]  Zhihua Xia,et al.  A Secure and Dynamic Multi-Keyword Ranked Search Scheme over Encrypted Cloud Data , 2016, IEEE Transactions on Parallel and Distributed Systems.

[32]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[33]  Pierre Perruchet,et al.  The exploitation of distributional information in syllable processing , 2004, Journal of Neurolinguistics.

[34]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[35]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[36]  Ying Li,et al.  Detecting online commercial intention (OCI) , 2006, WWW '06.

[37]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[38]  D. E. Knuth,et al.  Postscript about NP-hard problems , 1974, SIGA.

[39]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .