An error probability estimation of the document classification using Markov model

The document classification problem has been investigated by various techniques, such as a vector space model, a support vector machine, a random forest, and so on. On the other hand, J. Ziv et al. have proposed a document classification method using Ziv-Lempel algorithm to compress the data. Furthermore, the Context-Tree Weighting (CTW) algorithm has been proposed as an outstanding data compression, and for the document classification using the CTW algorithm experimental results have been reported. In this paper, we assume that each document with same category arises from Markov model with same parameters for the document classification. Then we propose an analysis method to estimate a classification error probability for the document with the finite length.

[1]  Huanhuan Chen,et al.  Probabilistic Classification Vector Machines , 2009, IEEE Transactions on Neural Networks.

[2]  Neri Merhav,et al.  A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[3]  Joe Suzuki,et al.  A Relationship between Contex Tree Weighting and General Model Weighting Techniques for Tree Sources , 1998 .

[4]  Sanjeev R. Kulkarni,et al.  Universal Divergence Estimation for Finite-Alphabet Sources , 2006, IEEE Transactions on Information Theory.

[5]  Shigeichi Hirasawa,et al.  Reducing the space complexity of a Bayes coding algorithm using an expanded context tree , 2009, 2009 IEEE International Symposium on Information Theory.

[6]  Hiroshi Imai,et al.  Implementing the context tree weighting method for text compression , 2000, Proceedings DCC 2000. Data Compression Conference.

[7]  Joe Suzuki,et al.  On Strong Consistency of Model Selection in Classification , 2006, IEEE Transactions on Information Theory.

[8]  Shigeichi Hirasawa,et al.  A Generalization of B. S. Clarke and A. R. Barron's Asymptotics of Bayes Codes for FSMX Sources , 1998 .

[9]  Zaher Dawy,et al.  Implementing the context tree weighting method for content recognition , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[10]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[11]  Shigeichi Hirasawa,et al.  A class of distortionless codes designed by Bayes decision theory , 1991, IEEE Trans. Inf. Theory.

[12]  Frans M. J. Willems,et al.  The Context-Tree Weighting Method : Extensions , 1998, IEEE Trans. Inf. Theory.

[13]  Y. Shtarkov,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[14]  Toshiyasu Matsushima,et al.  A Bayes coding algorithm for FSM sources , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.