Iterative cross‐training: An algorithm for learning from unlabeled Web pages

The article presents a new learning method, called iterative cross‐training (ICT), for classifying Web pages in three classification problems, i.e., (1) classification of Thai/non‐Thai Web pages, (2) classification of course/non‐course home pages, and (3) classification of university‐related Web pages. Given domain knowledge or a small set of labeled data, our method combines two classifiers that are able to use effectively unlabeled examples to iteratively train each other. We compare ICT against the other learning methods: a supervised word segmentation classifier, a supervised naïve Bayes classifier, and a co–training‐style classifier. The experimental results on three classification problems show that ICT gives better performance than those of the other classifiers. One of the advantages of ICT is that it needs only a small set of prelabeled data or no prelabeled data in the case that domain knowledge is available. © 2004 Wiley Periodicals, Inc.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[8]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[9]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[10]  Ellen Riloff Bootstrapping for text learning tasks , 1999 .

[11]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[12]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[13]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[14]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[15]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[16]  Surapant Meknavin,et al.  Feature-based Thai Word Segmentation , 1997 .