论文信息 - Linguistic Dumpster Diving: Geographical Classification of Arabic Text

Linguistic Dumpster Diving: Geographical Classification of Arabic Text

In many text analysis tasks it is common to remove frequently occurring words as part of the pre-processing step prior to analysis. Frequent words are removed for two reasons: first, because they are unlikely to contribute in any meaningful way to the results; and, second, removing them can greatly reduce the amount of computation required for the analysis task. In the literature, such words have been called 'noise' in the system, 'fluff words', and 'non-significant words'. While the removal of frequent words is correct for many text analysis tasks, it is not correct for all tasks. There are many analysis tasks where frequent words play a crucial role. To cite just one example, Mosteller and Wallace in their seminal book on stylometrics noted that the frequencies of various function words could distinguish the writings of Alexander Hamilton and James Madison. We use a similar frequent word technique to geographically classify Arabic news stories. In representing a document, we throw away all content words and retain only the most frequent words. In this way, we represent each document by a vector of common word frequencies. In our study we used a collection of 4,167 Arabic documents from 5 newspapers (representing Egypt, Sudan, Libya, Syria, and the U.K.). We then train on this data using a sequential minimal optimization algorithm to create a support vector, and evaluate the approach using 10-fold cross-validation. Depending on the number of frequent words, results range from 92% classification accuracy to 99.8%.

[1] Scott Spangler,et al. Mining the Talk: Unlocking the Business Value in Unstructured Information (IBM Press) , 2007 .

[2] M. Levison,et al. I.—THE SEVENTH LETTER OF PLATO , 1968 .

[3] Jagdish Gangolly,et al. On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies , 2002 .

[4] C. J. van Rijsbergen,et al. Information Retrieval , 1979, Encyclopedia of GIS.

[5] J. Hilton. On Verifying Wordprint Studies: Book of Mormon Authorship , 1990 .

[6] John C. Platt,et al. Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[7] Michael W. Berry,et al. Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[8] Philip Deane,et al. STYLOMETRICS DO NOT EXCLUDE THE SEVENTH LETTER , 1973 .

[9] Jeanette K. Gundel,et al. Statut cognitif et forme des anaphoriques indirects , 2000 .

[10] F. Mosteller,et al. Inference and Disputed Authorship: The Federalist , 1966 .

[11] Thorsten Joachims,et al. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[12] Hans Peter Luhn,et al. The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[13] Susan T. Dumais,et al. Hierarchical classification of Web content , 2000, SIGIR '00.