A Review of Machine Learning Algorithms for Text-Documents Classification

With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and know- ledge discovery. Proper classification of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing tech- niques to get meaningful knowledge. The aim of this paper is to highlight the important techniques and methodologies that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved, focused mainly on text representation and machine learning techniques. This paper provides a review of the theory and methods of document classification and text mining, focusing on the existing litera- ture.

[1]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[2]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[3]  PazzaniMichael,et al.  Learning and Revising User Profiles , 1997 .

[4]  Bo Yu,et al.  Latent semantic analysis for text categorization using neural network , 2008, Knowl. Based Syst..

[5]  Shourya Roy,et al.  Fast and accurate text classification via multiple linear discriminant projections , 2003, The VLDB Journal.

[6]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[7]  Cheng Hua Li,et al.  An efficient document classification model using an improved back propagation neural network and singular value decomposition , 2009, Expert Syst. Appl..

[8]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[9]  Bong Chih How,et al.  An Examination of Feature Selection Frameworks in Text Categorization , 2005, AIRS.

[10]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[11]  Youngjoong Ko,et al.  Text classification from unlabeled documents with bootstrapping and feature projection techniques , 2009, Inf. Process. Manag..

[12]  Xijin Tang,et al.  Text classification based on multi-word with support vector machine , 2008, Knowl. Based Syst..

[13]  Vincent Tam,et al.  A Comparative Study of Centroid-Based, Neighborhood-Based and Statistical Approaches for Effective Document Categorization , 2002, ICPR.

[14]  José Ranilla,et al.  Measures of Rule Quality for Feature Selection in Text Categorization , 2003, IDA.

[15]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[16]  Jonathan E. Rowe Genetic algorithm theory , 2007, GECCO '07.

[17]  Ioannis Pratikakis,et al.  Text line and word segmentation of handwritten documents , 2009, Pattern Recognit..

[18]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  SingerYoram,et al.  Context-sensitive learning methods for text categorization , 1999 .

[21]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Michael J. Shaw,et al.  Application of Decision-Tree Induction Techniques to Personalized Advertisements on Internet Storefronts , 2001, Int. J. Electron. Commer..

[23]  Pegah Falinouss,et al.  Stock trend prediction using news articles : a text mining approach , 2007 .

[24]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[25]  Chih-Hung Wu,et al.  Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks , 2009, Expert Syst. Appl..

[26]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[27]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[28]  Yi-Hsing Chang,et al.  An Automatic Document Classifier System based on Naíve Bayes Classifier and Ontology , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[29]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[30]  Hsin-Chang Yang,et al.  Construction of supervised and unsupervised learning systems for multilingual text categorization , 2009, Expert Syst. Appl..

[31]  Steffen Staab,et al.  GETESS - Searching the Web Exploiting German Texts , 1999, CIA.

[32]  Haibin Zhu,et al.  An Adaptive Fuzzy kNN Text Classifier , 2006, International Conference on Computational Science.

[33]  S. B. Palmer The Semantic Web: the introduction , 2001 .

[34]  NgHwee Tou,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997 .

[35]  Saurav Sahay Support Vector Machines and Document Classification , 2004 .

[36]  Jun Yu,et al.  Design and Implementation of an Ontology Algorithm for Web Documents Classification , 2006, ICCSA.

[37]  Zhu Zhen-fang,et al.  Research of Text Classification Technology based on Genetic Annealing Algorithm , 2008, 2008 International Symposium on Computational Intelligence and Design.

[38]  Dino Isa,et al.  Using the self organizing map for clustering of text documents , 2009, Expert Syst. Appl..

[39]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[40]  Hai Jin,et al.  MSVM-kNN: Combining SVM and k-NN for Multi-class Text Classification , 2008, IEEE International Workshop on Semantic Computing and Systems.

[41]  Xin Li,et al.  An Optimal SVM-Based Text Classification Algorithm , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[42]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[43]  S. M. Kamruzzaman,et al.  A hybrid learning algorithm for text classification , 2010, ArXiv.

[44]  D. Madigan,et al.  Sparse Bayesian Classifiers for Text Categorization , 2003 .

[45]  Sung-Bae Cho,et al.  Learning Neural Network Ensemble for Practical Text Classification , 2003, IDEAL.

[46]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[47]  Steffen Staab,et al.  Semantic community Web portals , 2000, Comput. Networks.

[48]  Amy J. C. Trappey,et al.  Development of a patent document classification and search platform using a back-propagation network , 2006, Expert Syst. Appl..

[49]  Xiao-Jing Wang,et al.  A new approach to feature selection in text classification , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[50]  Yaxin Bi,et al.  Combining Multiple Classifiers Using Dempster's Rule of Combination for Text Categorization , 2004, MDAI.

[51]  M. Sarnovsky,et al.  Text mining workflows construction with support of ontologies , 2008, 2008 6th International Symposium on Applied Machine Intelligence and Informatics.

[52]  Gerhard Knolmayer,et al.  Document Classification Methods for Organizing Explicit Knowledge , 2002 .

[53]  Jae-Moon Lee,et al.  Managing Content with Automatic Document Classification , 2004, J. Digit. Inf..

[54]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[55]  John Atkinson,et al.  Discovering implicit intention-level knowledge from natural-language texts , 2008, Knowl. Based Syst..

[56]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[57]  Lin Ma,et al.  Empirical analysis of support vector machine ensemble classifiers , 2009, Expert Syst. Appl..

[58]  Hyung Jeong Yang,et al.  Hierarchical document categorization with k-NN and concept-based thesauri , 2006, Inf. Process. Manag..

[59]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[60]  Mark Last,et al.  A Simple, Structure-Sensitive Approach for Web Document Classification , 2005, AWIC.

[61]  Shyi-Ming Chen,et al.  New Methods for Text Categorization Based on a New Feature Selection Method and a New Similarity Measure Between Documents , 2006, IEA/AIE.

[62]  Sang-Jo Lee,et al.  Automatic classification of Web pages based on the concept of domain ontology , 2005, 12th Asia-Pacific Software Engineering Conference (APSEC'05).

[63]  Judy Kay,et al.  A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization , 2002, PRICAI.

[64]  Allam Appa Rao,et al.  Performance Comparative in Classification Algorithms Using Real Datasets , 2009 .

[65]  Cheng Hua Li,et al.  Combination of modified BPNN algorithms and an efficient feature selection method for text categorization , 2009, Inf. Process. Manag..

[66]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[67]  Alessandro Sperduti,et al.  Discretizing Continuous Attributes in AdaBoost for Text Categorization , 2003, ECIR.

[68]  Peretz Shoval,et al.  ONTOLOGY-BASED CLASSIFICATION OF NEWS IN AN ELECTRONIC NEWSPAPER , 2008 .

[69]  Periklis Andritsos,et al.  Overview and semantic issues of text mining , 2007, SGMD.

[70]  Tai-Yue Wang,et al.  One-against-one fuzzy support vector machine classifier: An approach to text categorization , 2009, Expert Syst. Appl..

[71]  Steffen Staab,et al.  Mining Ontologies from Text , 2000, EKAW.

[72]  Chung Keung Poon,et al.  Using phrases as features in email classification , 2009, J. Syst. Softw..

[73]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[74]  Naohiro Ishii,et al.  Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts , 2002, Discovery Science.

[75]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[76]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[77]  Hyunki Kim,et al.  Associative Naïve Bayes classifier: Automated linking of gene ontology to medline documents , 2009, Pattern Recognit..

[78]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[79]  Myong Kee Jeong,et al.  Class dependent feature scaling method using naive Bayes classifier for text datamining , 2009, Pattern Recognit. Lett..

[80]  P. Manomaisupat,et al.  Feature Selection For Text Categorisation Using Self-organising Map , 2005, 2005 International Conference on Neural Networks and Brain.

[81]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[82]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[83]  Sang-Bum Kim,et al.  Effective Methods for Improving Naive Bayes Text Classifiers , 2002, PRICAI.

[84]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[85]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[86]  Rish,et al.  An analysis of data characteristics that affect naive Bayes performance , 2001 .

[87]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[88]  Franz J. Kurfess,et al.  Ontology-Based Semantic Classification of Unstructured Documents , 2003, Adaptive Multimedia Retrieval.

[89]  Pavel Brazdil,et al.  Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks , 2006, IFIP AI.

[90]  Zhi-Hua Zhou,et al.  Semi-supervised document retrieval , 2009, Inf. Process. Manag..

[91]  Borys Omelayenko,et al.  Learning of Ontologies from the Web: the Analysis of Existent Approaches , 2001, WebDyn@ICDT.

[92]  Hongyun Zhang,et al.  Rough set based hybrid algorithm for text classification , 2009, Expert Syst. Appl..

[93]  H.M. Al Fawareh,et al.  Ambiguity in text mining , 2008, 2008 International Conference on Computer and Communication Engineering.

[94]  Antonio Badia,et al.  Ontologies , 2001, Springer Berlin Heidelberg.

[95]  Michael R. Genesereth,et al.  The Conceptual Basis for Mediation Services , 1997, IEEE Expert.

[96]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[97]  Henry Tirri,et al.  Bayesian case-based reasoning with neural networks , 1993, IEEE International Conference on Neural Networks.

[98]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[99]  Bo Yu,et al.  A comparative study for content-based dynamic spam classification using four machine learning algorithms , 2008, Knowl. Based Syst..

[100]  George Forman,et al.  Extremely fast text feature extraction for classification and indexing , 2008, CIKM '08.

[101]  Padmini Srinivasan,et al.  Automatic Text Categorization Using Neural Networks , 1997 .

[102]  Mostafa Keikha,et al.  Rich document representation and classification: An analysis , 2009, Knowl. Based Syst..

[103]  Jun Fang,et al.  Ontology-Based Automatic Classification and Ranking for Web Documents , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).