Using Seed Words to Learn to Categorize Chinese Text

In this paper, we focus on text categorization model by unsupervised learning techniques that do not require labeled data. We propose a feature learning bootstrapping algorithm (FLB) using a small number of seed words, in that features for each of categories could be automatically learned from a large amount of unlabeled documents. Using these learned features we develop a new Naive Bayes classifier named NB_FLB. Experimental results show that the NB_FLB classifier performs better than other Naive Bayes classifiers by supervised learning in small number of features cases.

[1]  Steven P. Abney,et al.  Bootstrapping , 2002, ACL.

[2]  Stephen Grossberg,et al.  ARTMAP: supervised real-time learning and classification of nonstationary data by a self-organizing neural network , 1991, [1991 Proceedings] IEEE Conference on Neural Networks for Ocean Engineering.

[3]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[4]  Hang Li,et al.  Word Translation Disambiguation Using Bilingual Bootstrapping , 2002, ACL.

[5]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Ah-Hwee Tan,et al.  Adaptive resonance associative map , 1995, Neural Networks.

[9]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[10]  Andreas Rauber,et al.  Text Classification and Labelling of Document Clusters with Self-Organising Maps , 2000 .

[11]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[12]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[13]  Hang Li,et al.  Word Translation Disambiguation Using Bilingual Bootstrapping , 2004, Computational Linguistics.

[14]  Vittorio Castelli,et al.  The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter , 1996, IEEE Trans. Inf. Theory.

[15]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[16]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..