Distributed classification of Persian News (Case study: Hamshahri News dataset)

Classifying the News specifies the most likely topic that the News content refers to it. In this paper, we use distance detection in vector space model for classifying the News articles. In this method, it is calculated distances between weighted frequency vectors of each category, and the News vector determine its category by finding minimum distance with weighted frequency vector of categories. According to volume of the News articles on each topic, extracting keywords, building weighted frequency vectors and determining vector distances are very time consuming operations. So, in order to increase performance, calculation accuracy and decrease execution time, we use MapReduce, a distributed programming model, to implement and execute distributed classification of the News articles. This research is the first attempt to classifying Persian data in distributed manner and results of this research can be used for other text mining areas in any languages. It is worth mentioning that we have successfully implemented our method on the supercomputer of Amirkabir University of Technology.

[1]  Gong Ling,et al.  An improved TF-IDF approach for text classification , 2005 .

[2]  Mohammad Kazem Akbari,et al.  Implementing Hadoop Platform on Eucalyptus Cloud Infrastructure , 2012, 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[3]  Shiwen Yu,et al.  A Study Based on Distributed Supervised Machine Learning System for Text Classification , 2012 .

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Maryam Mahmoudi,et al.  A Persian Web Page Classifi er Applying a Combination of Content-Based and Context-Based Features , 2009 .

[6]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[7]  Yanchun Zhang,et al.  Using probabilistic latent semantic analysis for Web page grouping , 2005, 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05).

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[10]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  Heshaam Faili,et al.  Classification of Persian textual documents using learning vector quantization , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[14]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[15]  Bernardete Ribeiro,et al.  Distributed Text Classification With an Ensemble Kernel-Based Learning Approach , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Mahmood Bijankhan,et al.  Lessons from building a Persian written corpus: Peykare , 2011, Lang. Resour. Evaluation.

[17]  Chuleerat Jaruskulchai,et al.  A parallel learning algorithm for text classification , 2002, KDD.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..