Noise reduction through summarization for Web-page classification

Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through summarization techniques. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then put forward a new Web-page summarization algorithm based on Web-page layout and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that the classification algorithms (NB or SVM) augmented by any summarization approach can achieve an improvement by more than 5.0% as compared to pure-text-based classification algorithms. We further introduce an ensemble method to combine the different summarization algorithms. The ensemble summarization method achieves more than 12.0% improvement over pure-text based methods.

[1]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[2]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[3]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[4]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[5]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[6]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[7]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Andreas Paepcke,et al.  Seeing the whole in parts: text summarization for web browsing on handheld devices , 2001, WWW '01.

[10]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Wei-Pang Yang,et al.  Text summarization using a trainable summarizer and latent semantic analysis , 2005, Inf. Process. Manag..

[13]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[14]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[15]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[16]  Qiang Yang,et al.  Q2C@UST: our winning solution to query classification in KDDCUP 2005 , 2005, SKDD.

[17]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[18]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[19]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[20]  Wei-Ying Ma,et al.  Building a web thesaurus from web link structure , 2003, SIGIR.

[21]  Jugal K. Kalita,et al.  Summarization as feature selection for text categorization , 2001, CIKM '01.

[22]  Wai Lam,et al.  Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[24]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[25]  Jihoon Yang,et al.  Extracting sentence segments for text summarization: a machine learning approach , 2000, SIGIR '00.

[26]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[27]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[28]  Jon Atli Benediktsson,et al.  Proceedings of the 8th International Workshop on Multiple Classifier Systems , 2009, International Workshop on Multiple Classifier Systems.

[29]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[30]  Sur-Jin Ker,et al.  A Text Categorization Based on a Summarization Extraction , 2000 .

[31]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[32]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[33]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[34]  Donna Harman,et al.  Information Processing and Management , 2022 .

[35]  Rada Mihalcea,et al.  Language Independent Extractive Summarization , 2005, ACL.

[36]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[37]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[38]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[39]  Bernadette Bouchon-Meunier,et al.  Web Document Summarization by Context , 2003, WWW.

[40]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[41]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[42]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.