Automatic Generation of Background Text to Aid Classification

We illustrate that Web searches can often be utilized to generate background text for use with text classification. This is the case because there are frequently many pages on the World Wide Web that are relevant to particular text classification tasks. We show that an automatic method of creation of a secondary corpus of unlabeled but related documents can help decrease error rates in text categorization problems. Furthermore, if the test corpus is known, this related set of information can be tailored to match the particular categorization problem in a transductive approach. Our system uses WHIRL, a tool that combines database functionalities with techniques from the information retrieval literature. When there is a limited number of training examples, or the process of obtaining training examples is expensive or difficult, this method can be especially useful.

[1]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[2]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[3]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[4]  Tom Michael Mitchell,et al.  The Role of Unlabeled Data in Supervised Learning , 2004 .

[5]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[6]  Haym Hirsh,et al.  Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[7]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[8]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[9]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[10]  Gerald Salton,et al.  Automatic text processing , 1988 .

[11]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[12]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Haym Hirsh,et al.  Improving Short Text Classification Using Unlabeled Background Knowledge , 2000, ICML 2000.

[15]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[16]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[17]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[20]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.