Modeling A Generic Web Classification System Using Design Patterns

In order to save time in extracting specific information from high volume of data in web documents, this paper proposes an architectural model of generic web document classification system using design patterns for classifying web documents. This work implements two classification techniques for classifying Thai web documents, namely centroid classification and neural network classification, based on the proposed model and compares their classification effectiveness empirically. The training data sets in this experiment consist of 500 web documents of the following five categories (100 documents for each category): mobile phone sales, book sales, travel sales, education information and company profile. Another two hundred and fifty web documents were then used to test the two classifiers. The experiment results showed that the centroid classifier outperforms the neural network classifier both in term of efficiency and effectiveness.

[1]  Jihoon Yang,et al.  A Fast Algorithm for Hierarchical Text Classification , 2000, DaWaK.

[2]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[3]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[4]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[9]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[10]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[11]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[14]  Verayuth Lertnattee,et al.  Text Classification for Thai Medicinal Web Pages , 2007, PAKDD.

[15]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[16]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[17]  Constantine D. Spyropoulos,et al.  Learning Rules for Large Vocabulary Word Sense Disambiguation , 1999, IJCAI.

[18]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[19]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[20]  Yiming Yang,et al.  A Loss Function Analysis for Classification Methods in Text Categorization , 2003, ICML.

[21]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[22]  Verayuth Lertnattee,et al.  Improving Thai Academic Web Page Classification Using Inverse Class Frequency and Web Link Information , 2008, 22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008).

[23]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[24]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  Ming Zhang,et al.  A Linear Text Classification Algorithm Based on Category Relevance Factors , 2002, ICADL.

[27]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[28]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[29]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[30]  Surapant Meknavin,et al.  Feature-based Thai Word Segmentation , 1997 .

[31]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[32]  Cornelis H. A. Koster,et al.  Four text classification algorithms compared on a Dutch corpus , 1998, SIGIR '98.

[33]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[34]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[35]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[36]  Verayuth Lertnattee,et al.  IMPROVING CENTROID-BASED TEXT CLASSIFICATION USING TERM-DISTRIBUTION-BASED WEIGHTING SYSTEM AND CLUSTERING , 2001 .

[37]  Philipp Koehn,et al.  Combining Multiclass Maximum Entropy Text Classifiers with Neural Network Voting , 2002, PorTAL.

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  J. Palous,et al.  Machine Learning and Data Mining , 2002 .

[40]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[41]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[42]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[43]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[44]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[45]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[46]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .