Text Classification Using Ensemble Features Selection and Data Mining Techniques

Text categorization is a task of text mining/analytics which involves extracting useful information from unstructured resources followed by categorizing these documents. In this paper, we classify the TechTC dataset collected from various Web directories. We employed feature selection methods such as Gini index, chi-square, t-statistic, correlation which drastically reduced the model building time. Various neural network models such as probabilistic neural network, group method of data handling, multi layer perceptron yielded higher accuracies compared to other techniques applied in literature.

[1]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[2]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[3]  A. G. Ivakhnenko,et al.  Polynomial Theory of Complex Systems , 1971, IEEE Trans. Syst. Man Cybern..

[4]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[5]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[6]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[7]  A. Zanasi Text Mining and its Applications to Intelligence, CRM and Knowledge Management , 2007 .

[8]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[12]  G. A. Barnard,et al.  Student: A Statistical Biography of William Sealy Gosset , 1990 .

[13]  Mayank Pandey,et al.  Text and Data Mining to Detect Phishing Websites and Spam Emails , 2013, SEMCCO.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Soon Myoung Chung,et al.  Efficient mining of association rules in text databases , 1999, CIKM '99.

[16]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[17]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[18]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  M. Narasimha Murty,et al.  Discriminative Feature Analysis and Selection for Document Classification , 2012, ICONIP.

[20]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[21]  Wu He,et al.  International Journal of Information Management Social Media Competitive Analysis and Text Mining: a Case Study in the Pizza Industry , 2022 .

[22]  Jesus Mena,et al.  Investigative Data Mining for Security and Criminal Detection , 2002 .

[23]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[24]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[25]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[26]  A. Ivakhnenko Heuristic self-organization in problems of engineering cybernetics , 1970 .

[27]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[28]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[29]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[30]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[31]  D. A. Kenny,et al.  Correlation and causality , 1979 .

[32]  Vadlamani Ravi,et al.  Malware detection by text and data mining , 2013, 2013 IEEE International Conference on Computational Intelligence and Computing Research.

[33]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .