Text Categorization and Sorting of Web Search Results

With the Internet facing the growing problem of information overload, the large volumes, weak structure and noisiness of Web data make it amenable to the application of machine learning techniques. After providing an overview of several topics in text categorization, including document representation, feature selection, and a choice of classifiers, the paper presents experimental results concerning the performance and effects of different transformations of the bag-of-words document representation and feature selection, on texts extracted from the dmoz Open Di- rectory of Web pages. Finally, the paper describes the primary motivation for the experiments: a new meta-search engine CatS which utilizes text categorization to enhance the presentation of search results obtained from a major Web search engine.

[1]  Yi-fang Brook Wu,et al.  Extracting Features from Web Search Returned Hits for Hierarchical Classification , 2003, IKE.

[2]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[3]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[4]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[5]  Bjoern Koester,et al.  Conceptual Knowledge Retrieval with FooCA: Improving Web Search Engine Results with Contexts and Concept Hierarchies , 2006, ICDM.

[6]  Rohini K. Srihari,et al.  Document Representation for One-Class SVM , 2004, ECML.

[7]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[8]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[12]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[13]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[14]  Mirjana Ivanovic,et al.  Document Representations for Classification of Short Web-Page Descriptions , 2006, DaWaK.

[15]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[16]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[17]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[18]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[20]  Mirjana Ivanovic,et al.  CatS: A Classification-Powered Meta-Search Engine , 2006, Advances in Web Intelligence and Data Mining.

[21]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[22]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[23]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[24]  Claudio Carpineto,et al.  Exploiting the Potential of Concept Lattices for Information Retrieval with CREDO , 2004, J. Univers. Comput. Sci..

[25]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[26]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[27]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[28]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[29]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[30]  MladenicDunja Text-Learning and Related Intelligent Agents , 1999 .