Modeling Microtext with Higher Order Learning

Processing data manually is especially problematic during a natural disaster, where aid and response are quickly and urgently needed. In real time scenarios, a difficult yet important problem is to be able to get an accurate picture of needs from streaming data in a short time. When the streaming data includes microtext, this problem becomes even more challenging. In the application of emergency response, modeling microtext in real-time is especially important. Once messages have been classified and/or topics learned, the predicted categories and/or topics can be used by emergency responders to rapidly respond to needs. In this effort, microtext from social media and text messages during the 2010 Haitian earthquake were modeled using novel machine learning algorithms: Higher-Order Naive Bayes (HONB) and Higher-Order Latent Dirichlet Allocation (HO-LDA). Both illustrate that Higher-Order Learning can be valuable in classifying text data. Higher-Order Learning improves model generalization in online or real-time scenarios when smaller amounts of data are available for learning. Results from this research are promising in that when using samples of training data, the HONB classifier statistically significantly outperformed Naive Bayes in all trials based on the accuracy metric. Promising results were also obtained in the comparison of HO-LDA versus traditional Latent Dirichlet Allocation.

[1]  Gulab Singh,et al.  Dirichlet distribution with centroid model (DDCM) based summarization technique for web document classification , 2011, Bangalore Compute Conf..

[2]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[3]  William M. Pottenger,et al.  Leveraging Higher Order Dependencies Between Features for Text Classification , 2009 .

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  William M. Pottenger,et al.  Higher Order Naïve Bayes: A Novel Non-IID Approach to Text Classification , 2011, IEEE Transactions on Knowledge and Data Engineering.

[6]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[7]  C. Nelson,et al.  Nuclear detection using Higher-Order topic modeling , 2012, 2012 IEEE Conference on Technologies for Homeland Security (HST).

[8]  William M. Pottenger,et al.  Nuclear detection using higher order learning , 2011, 2011 IEEE International Conference on Technologies for Homeland Security (HST).

[9]  M. C. Jones,et al.  The Statistical Analysis of Compositional Data , 1986 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[12]  Ding-Zhu Du,et al.  A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering , 2003, J. Glob. Optim..

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  Dilek Z. Hakkani-Tür,et al.  A Hybrid Hierarchical Model for Multi-Document Summarization , 2010, ACL.

[15]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[16]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[17]  Fred S. Roberts,et al.  Sensor Management Problems of Nuclear Detection , 2011 .

[18]  John Yen,et al.  Classifying text messages for the haiti earthquake , 2011, ISCRAM.

[19]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[20]  Perry R. Cook,et al.  Content-Based Musical Similarity Computation using the Hierarchical Dirichlet Process , 2008, ISMIR.