Web page classification: Features and algorithms

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

[1]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[2]  M. Indra Devi,et al.  Feature Selection for Web Page Classification , 2009 .

[3]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[4]  Ulf Hermjakob,et al.  Parsing and Question Classification for Question Answering , 2001, ACL 2001.

[5]  Yiming Yang,et al.  An experimental study on large-scale web categorization , 2005, WWW '05.

[6]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[7]  Evgeniy Gabrilovich,et al.  Harnessing the Expertise of 70, 000 Human Editors: Knowledge-Based Feature Generation for Text Categorization , 2007, J. Mach. Learn. Res..

[8]  Xiaogang Peng,et al.  Automatic web page classification in a dynamic and hierarchical way , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[10]  William W. Cohen Improving a Page Classifier with Anchor Extraction and Link Analysis , 2002, NIPS.

[11]  Dunja Mladenic,et al.  Turning Yahoo to Automatic Web-Page Classifier , 1998, European Conference on Artificial Intelligence.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Thorsten Joachims,et al.  Web Watcher: A Tour Guide for the World Wide Web , 1997, IJCAI.

[14]  Nello Cristianini,et al.  Composite Kernels for Hypertext Categorisation , 2001, ICML.

[15]  Azriel Rosenfeld,et al.  Scene Labeling by Relaxation Operations , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[17]  Osmar R. Zaïane,et al.  Finding Similar Queries to Satisfy Searches Based on Query Traces , 2002, OOIS Workshops.

[18]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[19]  Filippo Menczer,et al.  Algorithmic detection of semantic similarity , 2005, WWW '05.

[20]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[21]  Tong Zhang,et al.  Linear prediction models with graph regularization for web-page categorization , 2006, KDD '06.

[22]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[23]  Hong Qu,et al.  Automated Blog Classification: Challenges and Pitfalls , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[24]  Rohini K. Srihari,et al.  Using Verbs and Adjectives to Automatically Classify Blog Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[25]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[26]  Gerhard Weikum,et al.  Query-Log Based Authority Analysis for Web Information Search , 2004, WISE.

[27]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[28]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[29]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[30]  Joseph Kaye,et al.  Understanding how bloggers feel: recognizing affect in blog posts , 2006, CHI Extended Abstracts.

[31]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[32]  Filippo Menczer,et al.  Mapping the semantics of Web text and links , 2005, IEEE Internet Computing.

[33]  Grace Hui Yang,et al.  Web-based List Question Answering , 2004, COLING.

[34]  Scott Nowson The Language of Weblogs: A study of genre and individual differences , 2006 .

[35]  Javed Mostafa,et al.  An application of text categorization methods to gene ontology annotation , 2005, SIGIR '05.

[36]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[37]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[38]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[39]  Haym Hirsh,et al.  Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[40]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[41]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[42]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[43]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[44]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[45]  Zenglin Xu,et al.  Web page classification with heterogeneous data fusion , 2007, WWW '07.

[46]  Hugo Liu,et al.  A Corpus-based Approach to Finding Happiness , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[47]  Bettina Berendt,et al.  Tags are not metadata, but "just more content" - to some people , 2007, ICWSM.

[48]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[49]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[50]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[51]  Csaba Veres,et al.  The Language of Folksonomies: What Tags Reveal About User Classification , 2006, NLDB.

[52]  William R. Hersh Text retrieval conference (TREC) genomics pre-track workshop , 2002, JCDL '02.

[53]  Wolfgang Nejdl,et al.  Utility analysis for topically biased PageRank , 2007, WWW '07.

[54]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[55]  Johannes Fürnkranz,et al.  Link-Local Features for Hypertext Classification , 2005, EWMF/KDO.

[56]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[57]  Larry Fitzpatrick,et al.  Automatic feedback using past queries: social searching? , 1997, SIGIR '97.

[58]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[59]  Gilad Mishne,et al.  Capturing Global Mood Levels using Blog Posts , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[60]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[61]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[62]  Mounia Lalmas,et al.  A probabilistic description-oriented approach for categorizing web documents , 1999, CIKM '99.

[63]  Rayid Ghani,et al.  Combining labeled and unlabeled data for text classification with a large number of categories , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[64]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[65]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[66]  Mika Käki,et al.  Findex: search result categories help users when document ranking fails , 2005, CHI.

[67]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[68]  Aljoscha Klose Extracting fuzzy classification rules from partially labeled data , 2004, Soft Comput..

[69]  Einat Amitay,et al.  Using common hypertext links to identify the best phrasal description of target web documents , 1998 .

[70]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[71]  Ben Choi,et al.  Web Page Classification , 2005 .

[72]  Natalie S. Glance,et al.  Community search assistant , 2001, IUI '01.

[73]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[74]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[75]  Christoph Lindemann,et al.  Coarse-grained classification of web sites by their structural properties , 2006, WIDM '06.

[76]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[77]  Hans-Peter Kriegel,et al.  Web site mining: a new way to spot competitors, customers and suppliers in the world wide web , 2002, KDD.

[78]  David M. Pennock,et al.  The structure of broad topics on the web , 2002, WWW.

[79]  Siegfried Handschuh,et al.  P-TAG: large scale automatic generation of personalized annotation tags for the web , 2007, WWW '07.

[80]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[81]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[82]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[83]  Amit P. Sheth,et al.  Altering document term vectors for classification: ontologies as expectations of co-occurrence , 2007, WWW '07.

[84]  Tie-Yan Liu,et al.  Adapting ranking SVM to document retrieval , 2006, SIGIR.

[85]  Arlindo L. Oliveira,et al.  An Empirical Comparison of Text Categorization Methods , 2003, SPIRE.

[86]  Hugh E. Williams,et al.  Strategies for minimising errors in hierarchical web categorisation , 2002, CIKM '02.

[87]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[88]  Weiming Hu,et al.  A Novel Web Page Filtering System by Combining Texts and Images , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[89]  Songbo Tan,et al.  Combining error-correcting output codes and model-refinement for text categorization , 2007, SIGIR.

[90]  Liming Chen,et al.  WebGuard: Web based adult content detection and filtering system , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[91]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[92]  Jaideep Srivastava,et al.  Web Mining , 2004, Data Mining and Knowledge Discovery.

[93]  Arul Prakash Asirvatham,et al.  Web Page Classification based on Document Structure , 2001 .

[94]  Vincenzo Loia,et al.  Personalized Knowledge Models Using RDF-Based Fuzzy Classification , 2006, Soft Computing in Web Information Retrieva.

[95]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[96]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[97]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[98]  Steffen Bickel,et al.  Discovering Communities in Linked Data by Multi-view Clustering , 2005, GfKl.

[99]  Hongyuan Zha,et al.  Web document clustering using hyperlink structures , 2001 .

[100]  Aixin Sun,et al.  Blog Classification Using Tags: An Empirical Study , 2007, ICADL.

[101]  Alan L. Rector,et al.  Web ontology segmentation: analysis, classification and use , 2006, WWW '06.

[102]  Dunja Mladenic,et al.  Text-learning and related intelligent agents: a survey , 1999, IEEE Intell. Syst..

[103]  Johannes Fürnkranz,et al.  Hyperlink ensembles: a case study in hypertext classification , 2002, Inf. Fusion.

[104]  Susan T. Dumais,et al.  The Combination of Text Classifiers Using Reliability Indicators , 2016, Information Retrieval.

[105]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[106]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[107]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[108]  Oren Kurland,et al.  Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models , 2006, SIGIR.

[109]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[110]  Jong-Hyeok Lee,et al.  Web page classification based on k-nearest neighbor approach , 2000, IRAL '00.

[111]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[112]  Hugh E. Williams,et al.  Fast Categorisation of Large Document Collections , 2001, SPIRE.

[113]  G. Mishne Experiments with Mood Classification in , 2005 .

[114]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[115]  Shui-Lung Chuang,et al.  Liveclassifier: creating hierarchical text classifiers through web corpora , 2004, WWW '04.

[116]  Andrei Z. Broder,et al.  A semantic approach to contextual advertising , 2007, SIGIR.

[117]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[118]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[119]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[120]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[121]  Shivani Agarwal,et al.  Ranking on graph data , 2006, ICML.

[122]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[123]  Sanda M. Harabagiu,et al.  Experiments with Open-Domain Textual Question Answering , 2000, COLING.

[124]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[125]  Koraljka Golub,et al.  Importance of HTML structural elements and metadata in automated subject classification , 2005 .

[126]  Rayid Ghani,et al.  Combining Labeled and Unlabeled Data for MultiClass Text Categorization , 2002, ICML.

[127]  Yugyung Lee,et al.  OntoKhoj: a semantic web portal for ontology searching, ranking and classification , 2003, WIDM '03.

[128]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[129]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[130]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[131]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[132]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[133]  Yasuhiro Suzuki,et al.  Automatically collecting, monitoring, and mining japanese weblogs , 2004, WWW Alt. '04.

[134]  Johannes Fürnkranz,et al.  Web Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[135]  Evgeniy Gabrilovich,et al.  Parameterized generation of labeled datasets for text categorization based on a hierarchical directory , 2004, SIGIR '04.

[136]  Shui-Lung Chuang,et al.  Using a web-based categorization approach to generate thematic metadata from texts , 2004, TALIP.

[137]  Jennifer Neville,et al.  Why collective inference improves relational classification , 2004, KDD.

[138]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[139]  Qiang Yang,et al.  Reinforcing Web-object Categorization Through Interrelationships , 2006, Data Mining and Knowledge Discovery.

[140]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[141]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[142]  Hugh E. Williams,et al.  Simple and accurate feature selection for hierarchical categorisation , 2002, DocEng '02.

[143]  Wei Liu,et al.  Importance-Based Web Page Classification Using Cost-Sensitive SVM , 2005, WAIM.

[144]  John M. Pierre,et al.  On the Automated Classification of Web Sites , 2001, ArXiv.

[145]  Dunja Mladenic,et al.  Turning {{\sc Yahoo!}}\ into an automatic Web page classifier , 1998 .

[146]  Weiguo Fan,et al.  Discretization based learning approach to information retrieval , 2005, EMNLP 2005.

[147]  Byoung-Tak Zhang,et al.  Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information , 2003, PAKDD.

[148]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[149]  Brian D. Davison,et al.  Topical link analysis for web search , 2006, SIGIR.

[150]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[151]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[152]  Vaughan R. Shanks,et al.  Fast categorisation of large document collections , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[153]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[154]  Stefan Siersdorfer,et al.  A neighborhood-based approach for clustering of linked document collections , 2006, CIKM '06.

[155]  Maarten de Rijke,et al.  Learning to Recognize Blogs: A Preliminary Exploration , 2006 .

[156]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[157]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[158]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[159]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[160]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[161]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[162]  Wen Gao,et al.  Two-phase Web site classification based on hidden Markov tree models , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[163]  Andreas Hotho,et al.  Tag Recommendations in Folksonomies , 2007, LWA.

[164]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[165]  Grace Hui Yang,et al.  Effectiveness of web page classification on finding list answers , 2004, SIGIR '04.

[166]  Veljko Milutinovic,et al.  Visual Adjacency Multigraphs – a Novel Approach for a Web Page Classification , 2004 .

[167]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[168]  Svetlana Kiritchenko,et al.  Hierarchical text categorization and its application to bioinformatics , 2006 .

[169]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[170]  Yihong Gong,et al.  Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[171]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[172]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[173]  Brian D. Davison The potential of the metasearch engine , 2005, ASIST.

[174]  Benno Stein,et al.  Genre Classification of Web Pages , 2004, KI.

[175]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[176]  Witold Pedrycz,et al.  PROXIMITY-BASED SUPERVISION FOR FLEXIBLE WEB PAGES CATEGORIZATION , 2004 .