Representation of Texts into String Vectors for Text Categorization

In this study, we propose a method for encoding documents into string vectors, instead of numerical vectors. A traditional approach to text categorization usually requires encoding documents into numerical vectors. The usual method of encoding documents therefore causes two main problems: huge dimensionality and sparse distribution. In this study, we modify or create machine learning-based approaches to text categorization, where string vectors are received as input vectors, instead of numerical vectors. As a result, we can improve text categorization performance by avoiding these two problems.

[1]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[2]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[3]  Sung-Bong Yang,et al.  Three Effective Top-Down Clustering Algorithms for Location Database Systems , 2010, J. Comput. Sci. Eng..

[4]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[5]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[7]  Taeho Jo,et al.  Index Based Approach for Text Categorization , 2008 .

[8]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[9]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[10]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[11]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Geoffrey E. Hinton,et al.  A general framework for parallel distributed processing , 1986 .

[17]  Moo Wan Kim,et al.  Adaptive QoS Mechanism for Wireless Mobile Network , 2010, J. Comput. Sci. Eng..

[18]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.