To be or not to be IID: Can Zipf's Law help?

Classification is a popular problem within machine learning, and increasing the effectiveness of classification algorithms has many significant applications within industry and academia. In particular, focus will be given to Higher-Order Naive Bayes (HONB), a relational variant of the famed Naive Bayes (NB) statistical classification algorithm that has been shown to outperform Naive Bayes in many cases [1,10]. Specifically, HONB has outperformed NB on character n-gram based feature spaces when the available training data is small [2]. In this paper, a correlation is hypothesized between the performance of HONB on character n-gram feature spaces and how closely the feature space distribution follows Zipf's Law. This hypothesis stems from the overarching goal of ultimately understanding HONB and knowing when it will outperform NB. Textual datasets ranging from several thousand instances to nearly 20,000 instances, some containing microtext, were used to generate character n-gram feature spaces. HONB and NB were both used to model these datasets, using varying character n-gram sizes (2-7) and dictionary sizes up to 5000 features. The performances of HONB and NB were then compared, and the results show potential support for our hypothesis: namely, the results support the hypothesized correlation for the Accuracy and Precision metrics. Additionally, a solution is provided for an open problem which was presented in [1], giving an explicit formula for the number of SDRs from k given sets, which has connections to counting higher-order paths of arbitrary length, which are important in the learning stage of HONB.

[1]  C. Nelson,et al.  Nuclear detection using Higher-Order topic modeling , 2012, 2012 IEEE Conference on Technologies for Homeland Security (HST).

[2]  Lyle H. Ungar,et al.  Statistical Relational Learning for Link Prediction , 2003 .

[3]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[4]  William M. Pottenger,et al.  Higher Order Naïve Bayes: A Novel Non-IID Approach to Text Classification , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  William M. Pottenger,et al.  Modeling Microtext with Higher Order Learning , 2013, AAAI Spring Symposium: Analyzing Microtext.

[6]  Michael A. Shepherd,et al.  An N-Gram Based Approach to Automatically Identifying Web Page Genre , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[7]  William M. Pottenger,et al.  Leveraging Higher Order Dependencies Between Features for Text Classification , 2009 .

[8]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[9]  William M. Pottenger,et al.  A Framework for Understanding LSI Performance , 2004 .

[10]  Jack Duffy,et al.  An N-gram Based Approach to Automatically Identifying Web Page Genre , 2009 .

[11]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[12]  C. V. Jawahar,et al.  Robust Recognition of Degraded Documents Using Character N-Grams , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[13]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .