论文信息 - Web-page classification through summarization - 字舞流文

Web-page classification through summarization

Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

Wei-Ying Ma | Qiang Yang | Zheng Chen | Dou Shen | Hua-Jun Zeng | Benyu Zhang | Yuchang Lu | Benyu Zhang | Qiang Yang | Wei-Ying Ma | Hua-Jun Zeng | Zheng Chen | Dou Shen | Yuchang Lu

[1] Jugal K. Kalita,et al. Summarization as feature selection for text categorization , 2001, CIKM '01.

[2] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[3] Charles Elkan,et al. The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[4] Wei-Ying Ma,et al. Building a web thesaurus from web link structure , 2003, SIGIR.

[5] Vibhu O. Mittal,et al. OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[6] Jihoon Yang,et al. Extracting sentence segments for text summarization: a machine learning approach , 2000, SIGIR '00.

[7] A. K. Singh,et al. An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[8] Wai Lam,et al. Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[9] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[10] Andreas Paepcke,et al. Seeing the whole in parts: text summarization for web browsing on handheld devices , 2001, WWW '01.

[11] Simone Teufel,et al. Sentence extraction as a classification task , 1997 .

[12] Gerhard Widmer,et al. Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[13] William H. Press,et al. Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[14] Susan T. Dumais,et al. Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[15] Piotr Indyk,et al. Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[16] William H. Press,et al. The Art of Scientific Computing Second Edition , 1998 .

[17] Bianca Zadrozny,et al. Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[18] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19] Bernadette Bouchon-Meunier,et al. Web Document Summarization by Context , 2003, WWW.

[20] David M. Pennock,et al. Using web structure for classifying and describing web pages , 2002, WWW.

[21] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[22] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23] Peter W. Foltz,et al. An introduction to latent semantic analysis , 1998 .

[24] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[25] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[26] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[27] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[29] Xin Liu,et al. Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[30] Jinwoo Park,et al. Automatic Text Categorization using the Importance of Sentences , 2002, COLING.

[31] Sur-Jin Ker,et al. A Text Categorization Based on a Summarization Extraction , 2000 .

[32] Thomas G. Dietterich,et al. Improving SVM accuracy by training on auxiliary data sources , 2004, ICML.

[33] Francine Chen,et al. A trainable document summarizer , 1995, SIGIR '95.

[34] F. A. Seiler,et al. Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[35] Giuseppe Attardi,et al. Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[36] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[37] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[38] Baoyao Zhou,et al. Function-based object model towards website adaptation , 2001, WWW '01.