Guiding Practical Text Classification Framework to Optimal State in Multiple Domains

This paper introduces DICE, a Domain-Independent text Classification Engine. DICE is robust, efficient, and domain-independent in terms of software and architecture. Each module of the system is clearly modularized and encapsulated for extensibility. The clear modular architecture allows for simple and continuous verification and facilitates changes in multiple cycles, even after its major development period is complete. Those who want to make use of DICE can easily implement their ideas on this test bed and optimize it for a particular domain by simply adjusting the configuration file. Unlike other publically available tool kits or development environments targeted at general purpose classification models, DICE specializes in text classification with a number of useful functions specific to it. This paper focuses on the ways to locate the optimal states of a practical text classification framework by using various adaptation methods provided by the system such as feature selection, lemmatization, and classification models.

[1]  Myungho Yeo,et al.  Data Correlation-Based Clustering Algorithm in Wireless Sensor Networks , 2009, KSII Trans. Internet Inf. Syst..

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Hugh E. Williams,et al.  Simple and accurate feature selection for hierarchical categorisation , 2002, DocEng '02.

[4]  Roberto Basili,et al.  A Hybrid Approach to Optimize Feature Selection Process in Text Classification , 2001, AI*IA.

[5]  Guy W. Mineau,et al.  A Simple Feature Selection Method for Text Classification , 2001, IJCAI.

[6]  Chung-Hsien Wu,et al.  Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology , 2002, TALIP.

[7]  Mohamed S. Kamel,et al.  Text Classification Using Small Number of Features , 2005, MLDM.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[10]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[11]  José Ranilla,et al.  Scoring and selecting terms for text categorization , 2005, IEEE Intelligent Systems.

[12]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[13]  J. J. Paijmans Text categorization as an information retrieval task , 1998 .

[14]  Jugal K. Kalita,et al.  Summarization as feature selection for text categorization , 2001, CIKM '01.

[15]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[16]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[17]  Alessandro Moschitti,et al.  A Study on Optimal Parameter Tuning for Rocchio Text Classifier , 2003, ECIR.

[18]  Andrei Z. Broder,et al.  Effective and efficient classification on a search-engine model , 2007, Knowledge and Information Systems.

[19]  Mirjana Ivanovic,et al.  Interactions Between Document Representation and Feature Selection in Text Categorization , 2006, DEXA.

[20]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[21]  Sung-Kwun Oh,et al.  GA-based Feed-forward Self-organizing Neural Network Architecture and Its Applications for Multi-variable Nonlinear Process Systems , 2009, KSII Trans. Internet Inf. Syst..

[22]  Abdur Chowdhury,et al.  Avoidance of Model Re-Induction in SVM-Based Feature Selection for Text Categorization , 2007, IJCAI.

[23]  Isabelle Moulinier,et al.  Feature Selection: A Useful Preprocessing Step , 1997, BCS-IRSG Annual Colloquium on IR Research.

[24]  Masahiko Haruno,et al.  Feature Selection in SVM Text Categorization , 1999, AAAI/IAAI.

[25]  Wei-Ying Ma,et al.  OCFS: optimal orthogonal centroid feature selection for text categorization , 2005, SIGIR '05.

[26]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[27]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[28]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[29]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[30]  Gang Wang,et al.  Feature selection with conditional mutual information maximin in text categorization , 2004, CIKM '04.

[31]  Jochen Dörre,et al.  Text mining: finding nuggets in mountains of textual data , 1999, KDD '99.